Skip to content

Unified deep-health probe (rc#1188 master decision doc)

Tracking: rc#1188 · companion incident: rc#1173 · companion fix: docs/2026-05-19-valkey-abort-on-connect-fail.md

This doc is the master reference for the rc#1188 deep-health probe — what each slice shipped, why, and how the parts compose.

Why

On 2026-05-18 22:28 UTC the rig-conductor pod restarted while Valkey was transiently unready. ConnectionMultiplexer.Connect(string) (with default AbortOnConnectFail=true) threw synchronously; the catch block in Program.cs downgraded IStreamPublisher to NullStreamPublisher for the pod's lifetime. For 8 hours, every agent dispatch silently no-op'd. No readiness probe failure (the /health endpoint returns 200 unconditionally). No Discord alert. No operator notice until the next morning.

Path A of the response — AbortOnConnectFail=false + ConfigurationOptions.Parse overload — shipped in PR #1175. It fixes the specific Valkey failure mode.

rc#1188 generalises the fix: any dependent system (Valkey, Marten, GitHub API, Discord) could fail similarly, and the conductor's existing health surface couldn't tell the difference between "process alive" and "actually able to do work."

The four-slice rollout

Slice Issue PR Layer Ships
A rc#1199 #1200 Core/Domain DeepHealthCheck aggregation policy + DependencyCheck/DeepHealthResult records. Pure decision logic with critical/non-critical two-tier semantics.
B1 rc#1201 #1205 Core/Ports + Api IDependencyHealthChecker port + DeepHealthService parallel-fan-out orchestrator (2-sec per-checker hard timeout) + /healthz/deep endpoint. Tier-3 e2e with stub checkers covering all five status branches.
B2 rc#1204 #1208 Api/Adapters Four production checkers — Valkey (critical, PingAsync), Marten (critical, AnyAsync on AgentStatus), GitHub API (critical, GET /rate_limit), Discord (non-critical, HEAD webhook). Boot-time DI guard log when no checkers registered.
C rc#1210 #1211 Core/SelfImprovement + Api DependencyHealthDegradedWatcher + DependencyHealthDegradationPolicy. Singleton lifetime (first stateful watcher in the rc#947 set) accumulates per-dep observations in a 4-hour rolling buffer; fires when a dep is non-Ok ≥30 min across ≥3 consecutive scans. Auto-files gap-analysis issue.
D this PR k8s deploy/k8s/deployment.yaml switches both probes from /health to /healthz/deep. Liveness 5-min grace (failureThreshold:5 × periodSeconds:60). Readiness 60s grace (failureThreshold:6 × periodSeconds:10).

Status-code semantics

The endpoint returns:

DeepHealthResult.Overall HTTP When
Ok 200 All critical deps reachable; all non-critical deps Ok
Degraded 200 Any critical dep is Degraded, OR any non-critical dep is non-Ok
Unreachable 503 Any critical dep is Unreachable

The two-tier criticality is the load-bearing design choice. Discord (Critical = false) can soft-degrade overall but cannot trip 503 — an alerting outage must not pull the pod from the load balancer.

Two-system response

The probe-wiring (PR-D) and the watcher (PR-C) are two complementary responses to a sustained dep outage:

Response Trigger Effect Cadence
Kubernetes readiness /healthz/deep 503 sustained ≥60s Pod removed from load balancer; clients see ServiceUnavailable Probe runs every 10s
Kubernetes liveness /healthz/deep 503 sustained ≥5 min Pod restart; fresh connection pool for every dep Probe runs every 60s
rc#947 watcher auto-file Any dep non-Ok across ≥3 consecutive scans (≥30 min) GitHub gap-analysis issue filed under dashecorp/rig-conductor Scan every 15 min

The three layers degrade gracefully:

  • Brief blip (<60s): no response. The orchestrator's 2-sec per-checker timeout absorbs single slow probes.
  • Moderate blip (60s–5 min): readiness pulls the pod from LB; dashboard becomes ServiceUnavailable; clients retry against the cluster service IP (no other pod in this single-replica deployment, so they wait).
  • Sustained outage (5 min+): pod restart. New IConnectionMultiplexer constructed, new Marten pool, fresh HttpClient handlers. If the dep is back, recovery is automatic. If not, the cycle repeats.
  • Persistent outage (30 min+): watcher files a gap-analysis issue. Operator gets a paper trail.

Single-replica deployment caveat

replicas: 1 in deploy/k8s/deployment.yaml. Readiness 503 means no pod available — clients see service unavailable, not "use another replica." This is intentional: the conductor's state lives in Postgres / Valkey / GitHub, so a fresh pod restart is a clean recovery. Horizontal scaling is out of scope (rc#1023 tracks the design implications).

Tunables

All thresholds are deliberately hardcoded in this rollout. The followups for env-var configuration are tracked in:

  • Per-checker latency thresholds (>500 ms Valkey, >1000 ms Marten → Degraded): default values defensible until production data suggests otherwise.
  • DeepHealthService.PerCheckerTimeout (2 s): conservative for the slowest known dep (Marten LINQ round-trip).
  • DependencyHealthDegradationPolicy.MinDuration (30 min) + MinObservations (3): tuned to the 15-min SelfImprovementService.ScanInterval.
  • K8s probe parameters (60s × 5 = 5min liveness; 10s × 6 = 60s readiness): tracked in this doc only.

A future refactor (IOptions<DeepHealthOptions> or similar) can lift them into appsettings / env without code churn.

Verification post-merge

After PR-D rolls out, expect:

  • GET /healthz/deep returns 200 {"overall":"Ok","dependencies":[{name:"valkey",status:"Ok",...},...,4 deps total]}
  • Pod logs [DeepHealth] 4 dependency health checker(s) registered: valkey, marten, github, discord(non-critical) at startup
  • kubectl describe pod shows the new probe paths
  • Watcher logs [DependencyHealthDegradedWatcher] N dep(s) degraded: ... ONLY when a dep has been non-Ok for ≥30 min

See also