Unified deep-health probe (rc#1188 master decision doc)¶

Tracking: rc#1188 · companion incident: rc#1173 · companion fix: docs/2026-05-19-valkey-abort-on-connect-fail.md

This doc is the master reference for the rc#1188 deep-health probe — what each slice shipped, why, and how the parts compose.

Why¶

On 2026-05-18 22:28 UTC the rig-conductor pod restarted while Valkey was transiently unready. ConnectionMultiplexer.Connect(string) (with default AbortOnConnectFail=true) threw synchronously; the catch block in Program.cs downgraded IStreamPublisher to NullStreamPublisher for the pod's lifetime. For 8 hours, every agent dispatch silently no-op'd. No readiness probe failure (the /health endpoint returns 200 unconditionally). No Discord alert. No operator notice until the next morning.

Path A of the response — AbortOnConnectFail=false + ConfigurationOptions.Parse overload — shipped in PR #1175. It fixes the specific Valkey failure mode.

rc#1188 generalises the fix: any dependent system (Valkey, Marten, GitHub API, Discord) could fail similarly, and the conductor's existing health surface couldn't tell the difference between "process alive" and "actually able to do work."

The four-slice rollout¶

Slice	Issue	PR	Layer	Ships
A	rc#1199	#1200	Core/Domain	`DeepHealthCheck` aggregation policy + `DependencyCheck`/`DeepHealthResult` records. Pure decision logic with critical/non-critical two-tier semantics.
B1	rc#1201	#1205	Core/Ports + Api	`IDependencyHealthChecker` port + `DeepHealthService` parallel-fan-out orchestrator (2-sec per-checker hard timeout) + `/healthz/deep` endpoint. Tier-3 e2e with stub checkers covering all five status branches.
B2	rc#1204	#1208	Api/Adapters	Four production checkers — Valkey (critical, `PingAsync`), Marten (critical, `AnyAsync` on AgentStatus), GitHub API (critical, `GET /rate_limit`), Discord (non-critical, `HEAD` webhook). Boot-time DI guard log when no checkers registered.
C	rc#1210	#1211	Core/SelfImprovement + Api	`DependencyHealthDegradedWatcher` + `DependencyHealthDegradationPolicy`. Singleton lifetime (first stateful watcher in the rc#947 set) accumulates per-dep observations in a 4-hour rolling buffer; fires when a dep is non-Ok ≥30 min across ≥3 consecutive scans. Auto-files gap-analysis issue.
D	this PR	—	k8s	`deploy/k8s/deployment.yaml` switches both probes from `/health` to `/healthz/deep`. Liveness 5-min grace (failureThreshold:5 × periodSeconds:60). Readiness 60s grace (failureThreshold:6 × periodSeconds:10).

Status-code semantics¶

The endpoint returns:

`DeepHealthResult.Overall`	HTTP	When
`Ok`	200	All critical deps reachable; all non-critical deps Ok
`Degraded`	200	Any critical dep is Degraded, OR any non-critical dep is non-Ok
`Unreachable`	503	Any critical dep is Unreachable

The two-tier criticality is the load-bearing design choice. Discord (Critical = false) can soft-degrade overall but cannot trip 503 — an alerting outage must not pull the pod from the load balancer.

Two-system response¶

The probe-wiring (PR-D) and the watcher (PR-C) are two complementary responses to a sustained dep outage:

Response	Trigger	Effect	Cadence
Kubernetes readiness	`/healthz/deep` 503 sustained ≥60s	Pod removed from load balancer; clients see ServiceUnavailable	Probe runs every 10s
Kubernetes liveness	`/healthz/deep` 503 sustained ≥5 min	Pod restart; fresh connection pool for every dep	Probe runs every 60s
rc#947 watcher auto-file	Any dep non-Ok across ≥3 consecutive scans (≥30 min)	GitHub gap-analysis issue filed under `dashecorp/rig-conductor`	Scan every 15 min

The three layers degrade gracefully:

Brief blip (<60s): no response. The orchestrator's 2-sec per-checker timeout absorbs single slow probes.
Moderate blip (60s–5 min): readiness pulls the pod from LB; dashboard becomes ServiceUnavailable; clients retry against the cluster service IP (no other pod in this single-replica deployment, so they wait).
Sustained outage (5 min+): pod restart. New IConnectionMultiplexer constructed, new Marten pool, fresh HttpClient handlers. If the dep is back, recovery is automatic. If not, the cycle repeats.
Persistent outage (30 min+): watcher files a gap-analysis issue. Operator gets a paper trail.

Single-replica deployment caveat¶

replicas: 1 in deploy/k8s/deployment.yaml. Readiness 503 means no pod available — clients see service unavailable, not "use another replica." This is intentional: the conductor's state lives in Postgres / Valkey / GitHub, so a fresh pod restart is a clean recovery. Horizontal scaling is out of scope (rc#1023 tracks the design implications).

Tunables¶

All thresholds are deliberately hardcoded in this rollout. The followups for env-var configuration are tracked in:

Per-checker latency thresholds (>500 ms Valkey, >1000 ms Marten → Degraded): default values defensible until production data suggests otherwise.
DeepHealthService.PerCheckerTimeout (2 s): conservative for the slowest known dep (Marten LINQ round-trip).
DependencyHealthDegradationPolicy.MinDuration (30 min) + MinObservations (3): tuned to the 15-min SelfImprovementService.ScanInterval.
K8s probe parameters (60s × 5 = 5min liveness; 10s × 6 = 60s readiness): tracked in this doc only.

A future refactor (IOptions<DeepHealthOptions> or similar) can lift them into appsettings / env without code churn.

Verification post-merge¶

After PR-D rolls out, expect:

GET /healthz/deep returns 200 {"overall":"Ok","dependencies":[{name:"valkey",status:"Ok",...},...,4 deps total]}
Pod logs [DeepHealth] 4 dependency health checker(s) registered: valkey, marten, github, discord(non-critical) at startup
kubectl describe pod shows the new probe paths
Watcher logs [DependencyHealthDegradedWatcher] N dep(s) degraded: ... ONLY when a dep has been non-Ok for ≥30 min