dependency-health-degraded watcher (rc#1188 PR-C)¶
Tracking: rc#1210 · parent rc#1188 · companion endpoint shipped in PR-B1 / PR-B2
Third slice of rc#1188 — adds the self-improvement signature that catches the failure mode rc#1173 exposed: a dependent system silently degrading without any operator visibility.
Why it exists¶
On 2026-05-18 the rig-conductor pod restarted while Valkey was transiently unready. ConnectionMultiplexer.Connect (with default AbortOnConnectFail=true) threw; the catch downgraded IStreamPublisher to NullStreamPublisher for the lifetime of the pod. For 8 hours every dispatch silently no-op'd. No operator notice — no readiness probe failure, no StreamConsumerWithoutHeartbeatWatcher fire (it watches the symptom, not the cause), no Discord alert.
The PR-B1 endpoint (/healthz/deep) + PR-B2 production checkers surface the per-dep state at any point in time. But "operator must remember to curl it" is not observability — it needs a daemon that watches the signal and tells someone.
This watcher does that. Every 15 min (per SelfImprovementService cadence) it polls DeepHealthService.RunAsync() and accumulates per-dep observations. When a dep has been continuously non-Ok across at least 3 scans spanning ≥30 minutes, the rc#947 SelfImprovementService auto-files a gap-analysis issue against dashecorp/rig-conductor.
The rc#1173 incident would have fired at ~45 minutes — within the operator-attention window — rather than 8 hours.
Signal¶
Each EvaluateAsync tick:
- Call
DeepHealthService.RunAsync(ct)to get the currentDeepHealthResult(4 production checkers from PR-B2 — Valkey / Marten / GitHub / Discord). - Append each
DependencyCheckas aDependencyObservation(Name, Status, Critical, At) to the watcher's in-memory rolling buffer. - Prune entries older than 4 hours and cap total entries at 200 (defensive — pruning by time should already keep it under).
- Delegate to
DependencyHealthDegradationPolicy.Detect(history, now):
The policy fires for each dep d where:
count(observations of d in window) ≥ 3all observations of d in window have Status != Ok(latest_at - earliest_at).TotalMinutes ≥ 30
Returns one SignatureOccurrence per degraded dep with EvidenceKey = "dep:{name}" so SelfImprovementService dedup keys per-dep, not per-watcher.
A single recovery observation clears the signal — partial flapping doesn't trip the issue file. This intentionally trades earlier detection for fewer false positives.
Why 30 min / 3 obs¶
The thresholds are tuned to the SelfImprovementService 15-min scan cadence:
| Threshold | Why |
|---|---|
MinObservations = 3 |
Two consecutive non-Ok scans could be a single ~15-min Marten connection blip; three rules that out without delaying detection past ~45 min total. |
MinDuration = 30 min |
Two scans apart (15 + 15 = 30 min) is the soonest a "third consecutive non-Ok" can be observed. The threshold matches the natural scan granularity. |
If SelfImprovementService.ScanInterval ever shortens (e.g. to 5 min), both thresholds should track — the policy is parametric, just not via config today.
Lifetime + state¶
This watcher is the first stateful one in the rc#947 set. Others (transient, stateless) read current Marten state per tick. This one accumulates observations across scans because the "≥30 min sustained" signal isn't queryable from a single snapshot.
DI registration:
builder.Services.AddSingleton<DependencyHealthDegradedWatcher>();
builder.Services.AddTransient<ISelfImprovementWatcher>(sp =>
sp.GetRequiredService<DependencyHealthDegradedWatcher>());
Singleton lifetime keeps the rolling buffer; the factory + AddTransient wiring means the IEnumerable<ISelfImprovementWatcher> resolution in SelfImprovementService still discovers it. State is guarded by a private lock — SelfImprovementService runs scans sequentially per cycle but defensive thread-safety is cheap.
Threshold for filing¶
SignatureThreshold = (Count: 3, Window: 7 days). Three occurrences of a degraded dep within a week is the trigger for an auto-filed issue. Per-dep dedup via EvidenceKey = "dep:{name}" means each dep tracks its own count; one bad week of Valkey doesn't suppress a new GitHub API regression.
Convergence¶
Standard rc#947 CloseAndReset convergence (per docs/2026-05-17-convergence-auto-close.md): when the watcher reports zero occurrences for a full scan window, the auto-filed issue closes itself. The signature stays registered as a regression canary.
Out of scope¶
- Per-dep configurable thresholds (env-overridable) — straightforward extension, deferred to demand.
- Severity tiers (critical Unreachable → file P1, non-critical Degraded → file P3) — current behaviour is one issue per dep, framed correctly via
(critical|non-critical)in the summary. - HTTP
/healthz/deephistory endpoint for the dashboard — covered in rc#1188 PR-D scope (or a follow-up dashboard slice).
Test coverage¶
| Tier | File | Cases |
|---|---|---|
| 1 (pure policy) | DependencyHealthDegradationPolicyTests |
12 — boundary at MinDuration, just-under, single recovery clears, sustained Degraded fires, mixed-status reports worst, multiple deps emit separate occurrences, non-critical labelling, EvidenceKey dedup |
| 2 (watcher smoke) | DependencyHealthDegradedWatcherTests |
5 — empty checker set, first call doesn't fire, three quick calls don't fire (zero time-span), Degraded path doesn't throw, signature metadata is conventional |
3 (e2e against real DeepHealthService) |
Skipped — requires TimeProvider seam through DeepHealthService (unrelated refactor). Policy tests cover the 30-min threshold logic with synthetic observations. |
See also¶
- rc#1173 — Valkey AbortOnConnectFail incident — the failure mode this watcher targets.
docs/2026-05-19-valkey-abort-on-connect-fail.md— Path A fix (prevented the specific Valkey case).docs/api.md#deep-health-rc1188— the underlying endpoint.docs/2026-05-16-rc-947-self-improvement-service.md— parent framework.