Unified deep-health probe (rc#1188 master decision doc)¶
Tracking: rc#1188
· companion incident: rc#1173
· companion fix: docs/2026-05-19-valkey-abort-on-connect-fail.md
This doc is the master reference for the rc#1188 deep-health probe — what each slice shipped, why, and how the parts compose.
Why¶
On 2026-05-18 22:28 UTC the rig-conductor pod restarted while Valkey was transiently unready. ConnectionMultiplexer.Connect(string) (with default AbortOnConnectFail=true) threw synchronously; the catch block in Program.cs downgraded IStreamPublisher to NullStreamPublisher for the pod's lifetime. For 8 hours, every agent dispatch silently no-op'd. No readiness probe failure (the /health endpoint returns 200 unconditionally). No Discord alert. No operator notice until the next morning.
Path A of the response — AbortOnConnectFail=false + ConfigurationOptions.Parse overload — shipped in PR #1175. It fixes the specific Valkey failure mode.
rc#1188 generalises the fix: any dependent system (Valkey, Marten, GitHub API, Discord) could fail similarly, and the conductor's existing health surface couldn't tell the difference between "process alive" and "actually able to do work."
The four-slice rollout¶
| Slice | Issue | PR | Layer | Ships |
|---|---|---|---|---|
| A | rc#1199 | #1200 | Core/Domain | DeepHealthCheck aggregation policy + DependencyCheck/DeepHealthResult records. Pure decision logic with critical/non-critical two-tier semantics. |
| B1 | rc#1201 | #1205 | Core/Ports + Api | IDependencyHealthChecker port + DeepHealthService parallel-fan-out orchestrator (2-sec per-checker hard timeout) + /healthz/deep endpoint. Tier-3 e2e with stub checkers covering all five status branches. |
| B2 | rc#1204 | #1208 | Api/Adapters | Four production checkers — Valkey (critical, PingAsync), Marten (critical, AnyAsync on AgentStatus), GitHub API (critical, GET /rate_limit), Discord (non-critical, HEAD webhook). Boot-time DI guard log when no checkers registered. |
| C | rc#1210 | #1211 | Core/SelfImprovement + Api | DependencyHealthDegradedWatcher + DependencyHealthDegradationPolicy. Singleton lifetime (first stateful watcher in the rc#947 set) accumulates per-dep observations in a 4-hour rolling buffer; fires when a dep is non-Ok ≥30 min across ≥3 consecutive scans. Auto-files gap-analysis issue. |
| D | this PR | — | k8s | deploy/k8s/deployment.yaml switches both probes from /health to /healthz/deep. Liveness 5-min grace (failureThreshold:5 × periodSeconds:60). Readiness 60s grace (failureThreshold:6 × periodSeconds:10). |
Status-code semantics¶
The endpoint returns:
DeepHealthResult.Overall |
HTTP | When |
|---|---|---|
Ok |
200 | All critical deps reachable; all non-critical deps Ok |
Degraded |
200 | Any critical dep is Degraded, OR any non-critical dep is non-Ok |
Unreachable |
503 | Any critical dep is Unreachable |
The two-tier criticality is the load-bearing design choice. Discord (Critical = false) can soft-degrade overall but cannot trip 503 — an alerting outage must not pull the pod from the load balancer.
Two-system response¶
The probe-wiring (PR-D) and the watcher (PR-C) are two complementary responses to a sustained dep outage:
| Response | Trigger | Effect | Cadence |
|---|---|---|---|
| Kubernetes readiness | /healthz/deep 503 sustained ≥60s |
Pod removed from load balancer; clients see ServiceUnavailable | Probe runs every 10s |
| Kubernetes liveness | /healthz/deep 503 sustained ≥5 min |
Pod restart; fresh connection pool for every dep | Probe runs every 60s |
| rc#947 watcher auto-file | Any dep non-Ok across ≥3 consecutive scans (≥30 min) | GitHub gap-analysis issue filed under dashecorp/rig-conductor |
Scan every 15 min |
The three layers degrade gracefully:
- Brief blip (<60s): no response. The orchestrator's 2-sec per-checker timeout absorbs single slow probes.
- Moderate blip (60s–5 min): readiness pulls the pod from LB; dashboard becomes ServiceUnavailable; clients retry against the cluster service IP (no other pod in this single-replica deployment, so they wait).
- Sustained outage (5 min+): pod restart. New
IConnectionMultiplexerconstructed, new Marten pool, fresh HttpClient handlers. If the dep is back, recovery is automatic. If not, the cycle repeats. - Persistent outage (30 min+): watcher files a gap-analysis issue. Operator gets a paper trail.
Single-replica deployment caveat¶
replicas: 1 in deploy/k8s/deployment.yaml. Readiness 503 means no pod available — clients see service unavailable, not "use another replica." This is intentional: the conductor's state lives in Postgres / Valkey / GitHub, so a fresh pod restart is a clean recovery. Horizontal scaling is out of scope (rc#1023 tracks the design implications).
Tunables¶
All thresholds are deliberately hardcoded in this rollout. The followups for env-var configuration are tracked in:
- Per-checker latency thresholds (
>500 ms Valkey,>1000 ms Marten→ Degraded): default values defensible until production data suggests otherwise. DeepHealthService.PerCheckerTimeout(2 s): conservative for the slowest known dep (Marten LINQ round-trip).DependencyHealthDegradationPolicy.MinDuration(30 min) +MinObservations(3): tuned to the 15-minSelfImprovementService.ScanInterval.- K8s probe parameters (60s × 5 = 5min liveness; 10s × 6 = 60s readiness): tracked in this doc only.
A future refactor (IOptions<DeepHealthOptions> or similar) can lift them into appsettings / env without code churn.
Verification post-merge¶
After PR-D rolls out, expect:
GET /healthz/deepreturns200 {"overall":"Ok","dependencies":[{name:"valkey",status:"Ok",...},...,4 deps total]}- Pod logs
[DeepHealth] 4 dependency health checker(s) registered: valkey, marten, github, discord(non-critical)at startup kubectl describe podshows the new probe paths- Watcher logs
[DependencyHealthDegradedWatcher] N dep(s) degraded: ...ONLY when a dep has been non-Ok for ≥30 min
See also¶
docs/api.md#deep-health-rc1188— endpoint reference for operators.docs/2026-05-19-watcher-dependency-health-degraded.md— PR-C watcher details.docs/2026-05-19-valkey-abort-on-connect-fail.md— the rc#1173 Path A fix that prompted rc#1188.docs/2026-05-16-rc-947-self-improvement-service.md— parent watcher framework.