Skip to content

dependency-health-degraded watcher (rc#1188 PR-C)

Tracking: rc#1210 · parent rc#1188 · companion endpoint shipped in PR-B1 / PR-B2

Third slice of rc#1188 — adds the self-improvement signature that catches the failure mode rc#1173 exposed: a dependent system silently degrading without any operator visibility.

Why it exists

On 2026-05-18 the rig-conductor pod restarted while Valkey was transiently unready. ConnectionMultiplexer.Connect (with default AbortOnConnectFail=true) threw; the catch downgraded IStreamPublisher to NullStreamPublisher for the lifetime of the pod. For 8 hours every dispatch silently no-op'd. No operator notice — no readiness probe failure, no StreamConsumerWithoutHeartbeatWatcher fire (it watches the symptom, not the cause), no Discord alert.

The PR-B1 endpoint (/healthz/deep) + PR-B2 production checkers surface the per-dep state at any point in time. But "operator must remember to curl it" is not observability — it needs a daemon that watches the signal and tells someone.

This watcher does that. Every 15 min (per SelfImprovementService cadence) it polls DeepHealthService.RunAsync() and accumulates per-dep observations. When a dep has been continuously non-Ok across at least 3 scans spanning ≥30 minutes, the rc#947 SelfImprovementService auto-files a gap-analysis issue against dashecorp/rig-conductor.

The rc#1173 incident would have fired at ~45 minutes — within the operator-attention window — rather than 8 hours.

Signal

Each EvaluateAsync tick:

  1. Call DeepHealthService.RunAsync(ct) to get the current DeepHealthResult (4 production checkers from PR-B2 — Valkey / Marten / GitHub / Discord).
  2. Append each DependencyCheck as a DependencyObservation (Name, Status, Critical, At) to the watcher's in-memory rolling buffer.
  3. Prune entries older than 4 hours and cap total entries at 200 (defensive — pruning by time should already keep it under).
  4. Delegate to DependencyHealthDegradationPolicy.Detect(history, now):

The policy fires for each dep d where:

  • count(observations of d in window) ≥ 3
  • all observations of d in window have Status != Ok
  • (latest_at - earliest_at).TotalMinutes ≥ 30

Returns one SignatureOccurrence per degraded dep with EvidenceKey = "dep:{name}" so SelfImprovementService dedup keys per-dep, not per-watcher.

A single recovery observation clears the signal — partial flapping doesn't trip the issue file. This intentionally trades earlier detection for fewer false positives.

Why 30 min / 3 obs

The thresholds are tuned to the SelfImprovementService 15-min scan cadence:

Threshold Why
MinObservations = 3 Two consecutive non-Ok scans could be a single ~15-min Marten connection blip; three rules that out without delaying detection past ~45 min total.
MinDuration = 30 min Two scans apart (15 + 15 = 30 min) is the soonest a "third consecutive non-Ok" can be observed. The threshold matches the natural scan granularity.

If SelfImprovementService.ScanInterval ever shortens (e.g. to 5 min), both thresholds should track — the policy is parametric, just not via config today.

Lifetime + state

This watcher is the first stateful one in the rc#947 set. Others (transient, stateless) read current Marten state per tick. This one accumulates observations across scans because the "≥30 min sustained" signal isn't queryable from a single snapshot.

DI registration:

builder.Services.AddSingleton<DependencyHealthDegradedWatcher>();
builder.Services.AddTransient<ISelfImprovementWatcher>(sp =>
    sp.GetRequiredService<DependencyHealthDegradedWatcher>());

Singleton lifetime keeps the rolling buffer; the factory + AddTransient wiring means the IEnumerable<ISelfImprovementWatcher> resolution in SelfImprovementService still discovers it. State is guarded by a private lock — SelfImprovementService runs scans sequentially per cycle but defensive thread-safety is cheap.

Threshold for filing

SignatureThreshold = (Count: 3, Window: 7 days). Three occurrences of a degraded dep within a week is the trigger for an auto-filed issue. Per-dep dedup via EvidenceKey = "dep:{name}" means each dep tracks its own count; one bad week of Valkey doesn't suppress a new GitHub API regression.

Convergence

Standard rc#947 CloseAndReset convergence (per docs/2026-05-17-convergence-auto-close.md): when the watcher reports zero occurrences for a full scan window, the auto-filed issue closes itself. The signature stays registered as a regression canary.

Out of scope

  • Per-dep configurable thresholds (env-overridable) — straightforward extension, deferred to demand.
  • Severity tiers (critical Unreachable → file P1, non-critical Degraded → file P3) — current behaviour is one issue per dep, framed correctly via (critical|non-critical) in the summary.
  • HTTP /healthz/deep history endpoint for the dashboard — covered in rc#1188 PR-D scope (or a follow-up dashboard slice).

Test coverage

Tier File Cases
1 (pure policy) DependencyHealthDegradationPolicyTests 12 — boundary at MinDuration, just-under, single recovery clears, sustained Degraded fires, mixed-status reports worst, multiple deps emit separate occurrences, non-critical labelling, EvidenceKey dedup
2 (watcher smoke) DependencyHealthDegradedWatcherTests 5 — empty checker set, first call doesn't fire, three quick calls don't fire (zero time-span), Degraded path doesn't throw, signature metadata is conventional
3 (e2e against real DeepHealthService) Skipped — requires TimeProvider seam through DeepHealthService (unrelated refactor). Policy tests cover the 30-min threshold logic with synthetic observations.

See also