Valkey AbortOnConnectFail=false¶
Tracking: rc#1173.
The bug¶
src/ConductorE.Api/Program.cs:60-85 (pre-rc#1173):
var valkeyUrl = Environment.GetEnvironmentVariable("VALKEY_URL");
try
{
if (!string.IsNullOrEmpty(valkeyUrl))
{
var redis = ConnectionMultiplexer.Connect(valkeyUrl);
builder.Services.AddSingleton<IConnectionMultiplexer>(redis);
builder.Services.AddSingleton<IStreamPublisher, ValkeyStreamPublisher>();
// …
}
// …
}
catch (Exception ex)
{
builder.Services.AddSingleton<IStreamPublisher, NullStreamPublisher>();
// …
}
ConnectionMultiplexer.Connect(string) has default AbortOnConnectFail = true + ConnectTimeout = 5000ms. Any transient Valkey unreadiness at conductor startup (Valkey pod restart, network flap, DNS delay) throws synchronously. The catch registers NullStreamPublisher once. No retry, no liveness-probe failure, no observable signal beyond the buried watcher noise. The conductor pod stays alive serving HTTP traffic for its full lifetime while every agent dispatch silently no-ops.
The comment said "optional, degrades gracefully" — but the degradation was silent and permanent.
The 2026-05-18 incident¶
| Time | Event |
|---|---|
| 22:19 UTC | Last successful Valkey stream write (PR #1151 auto-pin assignment) |
| 22:28 UTC | Conductor pod restart hit transient Valkey unreadiness → NullStreamPublisher registered → silent degradation begins |
| 22:46 UTC | PR #1165 (rc#1133 PR B, codex heartbeat fix) opened — never reaches review-e |
| 06:00 UTC (2026-05-19) | PR #1168 (rc#1166 main-guard fix) opened — same fate |
| 06:30 UTC | Orchestrator catches the silent failure; operator-approved restart finds Valkey healthy, Valkey connected: logs successfully |
| 06:32 UTC | Empty-commit nudges to both PRs → dispatch resumes |
8 hours of silent dispatch failure. Only signal: StreamConsumerWithoutHeartbeatWatcher threw on every scan tick (buried in noisy logs).
The fix¶
var config = StackExchange.Redis.ConfigurationOptions.Parse(valkeyUrl);
config.AbortOnConnectFail = false;
config.ConnectTimeout = 5000;
config.ConnectRetry = 3;
var redis = ConnectionMultiplexer.Connect(config);
AbortOnConnectFail = false is the load-bearing change. With it, ConnectionMultiplexer.Connect() returns even when the underlying socket isn't up yet — the multiplexer connects in background and reconnects automatically on subsequent socket drops. Calls to GetDatabase() block briefly during the recovery window rather than no-op'ing forever.
ConnectRetry = 3 and ConnectTimeout = 5s keep the recovery tight: at most 15s of blocking during transient unreadiness, then either success or a fresh attempt.
The catch block stays — it handles malformed VALKEY_URL strings (which ConfigurationOptions.Parse throws on) and unforeseen library errors. Operationally the catch should be effectively unreachable for transient unreadiness.
Why this is the right path-of-three¶
The rc#1173 issue listed three fix paths:
| Path | Cost | Coverage |
|---|---|---|
A — AbortOnConnectFail=false (this PR) |
one-line + comment | Eliminates the silent-degrade class entirely for transient Valkey unreadiness. |
B — /healthz/valkey + liveness probe |
~30 LOC + k8s yaml | Defence-in-depth: if AbortOnConnectFail=false somehow doesn't recover (e.g. Valkey gone for hours), liveness probe restarts the pod. |
C — dispatcher-silent-noop self-improvement watcher |
~150 LOC + tests | Regression detector for any future silent-dispatch failure mode (not just Valkey-specific). |
A is the smallest, highest-leverage fix and ships in this PR. B is tracked for a follow-up; A makes B less critical because the failure mode it guards against now can't persist for hours. C is also tracked as a follow-up for the rc#947 self-improvement framework — it would catch silent-dispatch failures from any cause (not just Valkey), at the cost of being more LOC and depending on the watcher infrastructure.
Test coverage rationale¶
Per HARD RULE [TDD/DDD-on-top]: this is a config-level change in Program.cs, not a new domain policy. No new tests required by the rule.
Per HARD RULE [every-behavior-PR-e2e]: a true Tier-3 e2e test for this would kill the Valkey pod, restart the conductor, observe recovery. That requires WebApplicationFactory<Program> + a controllable Valkey container that can be stopped mid-test. Intentionally deferred because:
- The change wraps the upstream
StackExchange.Redislibrary's own well-tested behavior (AbortOnConnectFailis its canonical retry hook). The library has extensive published test coverage for this exact behavior. - The existing
ConductorEApiFactorytestcontainer-backed tests already exercise the connection path on every test run — they'll fail loudly ifConfigurationOptions.Parseregresses or the new wiring breaks. - The 2026-05-18 incident IS the verification path: the next pod restart (when this PR deploys via auto-pin) is the canonical real-world test. If Valkey is unreachable at that moment, the conductor should connect in background and start dispatching within seconds — observable via
assignments:*stream growth.
The cost of regression is bounded: another silent-degrade incident, detected by the same operator-observation path that found this one. Recovery is the same pod restart.
Verification path post-deploy¶
- Startup log should show
Valkey configured: <url> (AbortOnConnectFail=false, IsConnected=<bool>).IsConnected=Falseis fine at first — the multiplexer connects in background. - First few dispatches after deploy should land on
assignments:*streams normally. If transient unreadiness happens, expect 2-5s blocking onGetDatabase()calls but no permanent silent-no-op. StreamConsumerWithoutHeartbeatWatcherexception trail in the conductor logs should be empty (theNo service for type 'StackExchange.Redis.IConnectionMultiplexer'symptom from yesterday's degraded state shouldn't recur).
See also¶
- rc#1173 — design issue with the full incident postmortem.
- PR #1165 — the rc#1133 codex heartbeat fix stalled 7.5h by this bug; the immediate prompt that surfaced the failure mode.
StackExchange.Redisconfiguration docs —AbortOnConnectFailsemantics.