`dependency-health-degraded` watcher (rc#1188 PR-C)¶

Tracking: rc#1210 · parent rc#1188 · companion endpoint shipped in PR-B1 / PR-B2

Third slice of rc#1188 — adds the self-improvement signature that catches the failure mode rc#1173 exposed: a dependent system silently degrading without any operator visibility.

Why it exists¶

On 2026-05-18 the rig-conductor pod restarted while Valkey was transiently unready. ConnectionMultiplexer.Connect (with default AbortOnConnectFail=true) threw; the catch downgraded IStreamPublisher to NullStreamPublisher for the lifetime of the pod. For 8 hours every dispatch silently no-op'd. No operator notice — no readiness probe failure, no StreamConsumerWithoutHeartbeatWatcher fire (it watches the symptom, not the cause), no Discord alert.

The PR-B1 endpoint (/healthz/deep) + PR-B2 production checkers surface the per-dep state at any point in time. But "operator must remember to curl it" is not observability — it needs a daemon that watches the signal and tells someone.

This watcher does that. Every 15 min (per SelfImprovementService cadence) it polls DeepHealthService.RunAsync() and accumulates per-dep observations. When a dep has been continuously non-Ok across at least 3 scans spanning ≥30 minutes, the rc#947 SelfImprovementService auto-files a gap-analysis issue against dashecorp/rig-conductor.

The rc#1173 incident would have fired at ~45 minutes — within the operator-attention window — rather than 8 hours.

Signal¶

Each EvaluateAsync tick:

Call DeepHealthService.RunAsync(ct) to get the current DeepHealthResult (4 production checkers from PR-B2 — Valkey / Marten / GitHub / Discord).
Append each DependencyCheck as a DependencyObservation (Name, Status, Critical, At) to the watcher's in-memory rolling buffer.
Prune entries older than 4 hours and cap total entries at 200 (defensive — pruning by time should already keep it under).
Delegate to DependencyHealthDegradationPolicy.Detect(history, now):

The policy fires for each dep d where:

count(observations of d in window) ≥ 3
all observations of d in window have Status != Ok
(latest_at - earliest_at).TotalMinutes ≥ 30

Returns one SignatureOccurrence per degraded dep with EvidenceKey = "dep:{name}" so SelfImprovementService dedup keys per-dep, not per-watcher.

A single recovery observation clears the signal — partial flapping doesn't trip the issue file. This intentionally trades earlier detection for fewer false positives.

Why 30 min / 3 obs¶

The thresholds are tuned to the SelfImprovementService 15-min scan cadence:

Threshold	Why
`MinObservations = 3`	Two consecutive non-Ok scans could be a single ~15-min Marten connection blip; three rules that out without delaying detection past ~45 min total.
`MinDuration = 30 min`	Two scans apart (15 + 15 = 30 min) is the soonest a "third consecutive non-Ok" can be observed. The threshold matches the natural scan granularity.

If SelfImprovementService.ScanInterval ever shortens (e.g. to 5 min), both thresholds should track — the policy is parametric, just not via config today.

Lifetime + state¶

This watcher is the first stateful one in the rc#947 set. Others (transient, stateless) read current Marten state per tick. This one accumulates observations across scans because the "≥30 min sustained" signal isn't queryable from a single snapshot.

DI registration:

builder.Services.AddSingleton<DependencyHealthDegradedWatcher>();
builder.Services.AddTransient<ISelfImprovementWatcher>(sp =>
    sp.GetRequiredService<DependencyHealthDegradedWatcher>());

Singleton lifetime keeps the rolling buffer; the factory + AddTransient wiring means the IEnumerable<ISelfImprovementWatcher> resolution in SelfImprovementService still discovers it. State is guarded by a private lock — SelfImprovementService runs scans sequentially per cycle but defensive thread-safety is cheap.

Threshold for filing¶

SignatureThreshold = (Count: 3, Window: 7 days). Three occurrences of a degraded dep within a week is the trigger for an auto-filed issue. Per-dep dedup via EvidenceKey = "dep:{name}" means each dep tracks its own count; one bad week of Valkey doesn't suppress a new GitHub API regression.

Convergence¶

Standard rc#947 CloseAndReset convergence (per docs/2026-05-17-convergence-auto-close.md): when the watcher reports zero occurrences for a full scan window, the auto-filed issue closes itself. The signature stays registered as a regression canary.

Out of scope¶

Per-dep configurable thresholds (env-overridable) — straightforward extension, deferred to demand.
Severity tiers (critical Unreachable → file P1, non-critical Degraded → file P3) — current behaviour is one issue per dep, framed correctly via (critical|non-critical) in the summary.
HTTP /healthz/deep history endpoint for the dashboard — covered in rc#1188 PR-D scope (or a follow-up dashboard slice).

Test coverage¶

Tier	File	Cases
1 (pure policy)	`DependencyHealthDegradationPolicyTests`	12 — boundary at MinDuration, just-under, single recovery clears, sustained Degraded fires, mixed-status reports worst, multiple deps emit separate occurrences, non-critical labelling, EvidenceKey dedup
2 (watcher smoke)	`DependencyHealthDegradedWatcherTests`	5 — empty checker set, first call doesn't fire, three quick calls don't fire (zero time-span), Degraded path doesn't throw, signature metadata is conventional
3 (e2e against real `DeepHealthService`)	Skipped — requires `TimeProvider` seam through `DeepHealthService` (unrelated refactor). Policy tests cover the 30-min threshold logic with synthetic observations.

dependency-health-degraded watcher (rc#1188 PR-C)¶