feat(codex-cli): fail-fast gate at process() entry¶
The problem¶
Before this PR, the codex provider had a _quotaExhaustedUntil state machine that captured the reset timestamp from a 429 usage_limit_reached response — but only the conductor's CodexQuotaGuard read it. The codex provider's own process() function never checked the state at entry, so every subsequent assignment within the ~4h exhaustion window respawned codex CLI, hit the same 429, and waited ~13s before retrying.
The rar stream consumer retries up to 3× before emitting AGENT_STUCK. That meant ~40s of wasted spawn-then-fail per stuck dispatch. Operator evidence from rig-conductor#959: ~25% of recent review dispatches hit this pattern.
The fix¶
Pure policy shouldFailFastForQuotaExhaustion(quotaExhaustedUntil, now) in src/agent/providers/codex-cli.js:
Adapter gate at the top of process():
const failFast = shouldFailFastForQuotaExhaustion(_quotaExhaustedUntil, Date.now());
if (failFast.failFast) {
throw new AgentProviderError('codex-cli', summary, {
userMessage: 'Codex quota exhausted. Trying the next configured provider.',
});
}
The thrown error uses the standard fallback userMessage so agent.js chain iteration (rar#420 / PR #475) routes the work to claude immediately.
What changed in numbers¶
- Per-assignment cost during exhaustion: ~40s → <10ms (3× ~13s retries skipped)
- New codex CLI subprocesses spawned within exhaustion window: N → 0
agent_stuckevent firing latency on saturated codex: ~40s → <10ms
Why this composes with chain iteration¶
The fail-fast alone doesn't deliver the review — it just exits fast. The actual delivery via claude happens because the userMessage matches the fallback-eligible contract that processWithProviders in agent.js consumes (see chain iteration doc). Without the gate, the fail-fast would still happen but at the much slower retry-3× cadence. The gate makes chain iteration cheap.
Tests¶
7 unit tests in src/agent/providers/codex-cli.test.js:
- null state → no fail-fast
- exhaustion within future window → fail-fast with reset metadata
- partial-second remaining → rounds UP (conservative)
- boundary now === resetAt → no fail-fast (just expired)
- past expiry → no fail-fast
- 4h-class window → resetSecondsRemaining = 14400
- pure-function contract smoke check
The policy was extracted exactly so it could be tested without spawning codex — getCodexAuthenticationState requires the codex binary which CI doesn't have. This same lesson was re-learned the hard way on rar#484 / PR #484 where a buildHeartbeatSnapshot test that constructed a provider: 'codex-cli' character failed with spawn codex ENOENT on CI but passed locally; the fix was dropping the codex e2e test in favour of the policy unit tests.
Pointers¶
- Issue: rar#468
- PR: #469
- Composes with: rar#420 / PR #475 chain iteration
- Heartbeat signal: rar#482 / PR #484 degraded heartbeat
- Conductor-side recovery: rc#944 / PR #1094 quota recovery reconciliation