Skip to content

Valkey AbortOnConnectFail=false

Tracking: rc#1173.

The bug

src/ConductorE.Api/Program.cs:60-85 (pre-rc#1173):

var valkeyUrl = Environment.GetEnvironmentVariable("VALKEY_URL");
try
{
    if (!string.IsNullOrEmpty(valkeyUrl))
    {
        var redis = ConnectionMultiplexer.Connect(valkeyUrl);
        builder.Services.AddSingleton<IConnectionMultiplexer>(redis);
        builder.Services.AddSingleton<IStreamPublisher, ValkeyStreamPublisher>();
        // …
    }
    // …
}
catch (Exception ex)
{
    builder.Services.AddSingleton<IStreamPublisher, NullStreamPublisher>();
    // …
}

ConnectionMultiplexer.Connect(string) has default AbortOnConnectFail = true + ConnectTimeout = 5000ms. Any transient Valkey unreadiness at conductor startup (Valkey pod restart, network flap, DNS delay) throws synchronously. The catch registers NullStreamPublisher once. No retry, no liveness-probe failure, no observable signal beyond the buried watcher noise. The conductor pod stays alive serving HTTP traffic for its full lifetime while every agent dispatch silently no-ops.

The comment said "optional, degrades gracefully" — but the degradation was silent and permanent.

The 2026-05-18 incident

Time Event
22:19 UTC Last successful Valkey stream write (PR #1151 auto-pin assignment)
22:28 UTC Conductor pod restart hit transient Valkey unreadiness → NullStreamPublisher registered → silent degradation begins
22:46 UTC PR #1165 (rc#1133 PR B, codex heartbeat fix) opened — never reaches review-e
06:00 UTC (2026-05-19) PR #1168 (rc#1166 main-guard fix) opened — same fate
06:30 UTC Orchestrator catches the silent failure; operator-approved restart finds Valkey healthy, Valkey connected: logs successfully
06:32 UTC Empty-commit nudges to both PRs → dispatch resumes

8 hours of silent dispatch failure. Only signal: StreamConsumerWithoutHeartbeatWatcher threw on every scan tick (buried in noisy logs).

The fix

var config = StackExchange.Redis.ConfigurationOptions.Parse(valkeyUrl);
config.AbortOnConnectFail = false;
config.ConnectTimeout = 5000;
config.ConnectRetry = 3;
var redis = ConnectionMultiplexer.Connect(config);

AbortOnConnectFail = false is the load-bearing change. With it, ConnectionMultiplexer.Connect() returns even when the underlying socket isn't up yet — the multiplexer connects in background and reconnects automatically on subsequent socket drops. Calls to GetDatabase() block briefly during the recovery window rather than no-op'ing forever.

ConnectRetry = 3 and ConnectTimeout = 5s keep the recovery tight: at most 15s of blocking during transient unreadiness, then either success or a fresh attempt.

The catch block stays — it handles malformed VALKEY_URL strings (which ConfigurationOptions.Parse throws on) and unforeseen library errors. Operationally the catch should be effectively unreachable for transient unreadiness.

Why this is the right path-of-three

The rc#1173 issue listed three fix paths:

Path Cost Coverage
A — AbortOnConnectFail=false (this PR) one-line + comment Eliminates the silent-degrade class entirely for transient Valkey unreadiness.
B — /healthz/valkey + liveness probe ~30 LOC + k8s yaml Defence-in-depth: if AbortOnConnectFail=false somehow doesn't recover (e.g. Valkey gone for hours), liveness probe restarts the pod.
C — dispatcher-silent-noop self-improvement watcher ~150 LOC + tests Regression detector for any future silent-dispatch failure mode (not just Valkey-specific).

A is the smallest, highest-leverage fix and ships in this PR. B is tracked for a follow-up; A makes B less critical because the failure mode it guards against now can't persist for hours. C is also tracked as a follow-up for the rc#947 self-improvement framework — it would catch silent-dispatch failures from any cause (not just Valkey), at the cost of being more LOC and depending on the watcher infrastructure.

Test coverage rationale

Per HARD RULE [TDD/DDD-on-top]: this is a config-level change in Program.cs, not a new domain policy. No new tests required by the rule.

Per HARD RULE [every-behavior-PR-e2e]: a true Tier-3 e2e test for this would kill the Valkey pod, restart the conductor, observe recovery. That requires WebApplicationFactory<Program> + a controllable Valkey container that can be stopped mid-test. Intentionally deferred because:

  1. The change wraps the upstream StackExchange.Redis library's own well-tested behavior (AbortOnConnectFail is its canonical retry hook). The library has extensive published test coverage for this exact behavior.
  2. The existing ConductorEApiFactory testcontainer-backed tests already exercise the connection path on every test run — they'll fail loudly if ConfigurationOptions.Parse regresses or the new wiring breaks.
  3. The 2026-05-18 incident IS the verification path: the next pod restart (when this PR deploys via auto-pin) is the canonical real-world test. If Valkey is unreachable at that moment, the conductor should connect in background and start dispatching within seconds — observable via assignments:* stream growth.

The cost of regression is bounded: another silent-degrade incident, detected by the same operator-observation path that found this one. Recovery is the same pod restart.

Verification path post-deploy

  1. Startup log should show Valkey configured: <url> (AbortOnConnectFail=false, IsConnected=<bool>). IsConnected=False is fine at first — the multiplexer connects in background.
  2. First few dispatches after deploy should land on assignments:* streams normally. If transient unreadiness happens, expect 2-5s blocking on GetDatabase() calls but no permanent silent-no-op.
  3. StreamConsumerWithoutHeartbeatWatcher exception trail in the conductor logs should be empty (the No service for type 'StackExchange.Redis.IConnectionMultiplexer' symptom from yesterday's degraded state shouldn't recur).

See also