Skip to content

feat(codex-cli): fail-fast gate at process() entry

Tracking: rar#468 / PR #469.

The problem

Before this PR, the codex provider had a _quotaExhaustedUntil state machine that captured the reset timestamp from a 429 usage_limit_reached response — but only the conductor's CodexQuotaGuard read it. The codex provider's own process() function never checked the state at entry, so every subsequent assignment within the ~4h exhaustion window respawned codex CLI, hit the same 429, and waited ~13s before retrying.

The rar stream consumer retries up to 3× before emitting AGENT_STUCK. That meant ~40s of wasted spawn-then-fail per stuck dispatch. Operator evidence from rig-conductor#959: ~25% of recent review dispatches hit this pattern.

The fix

Pure policy shouldFailFastForQuotaExhaustion(quotaExhaustedUntil, now) in src/agent/providers/codex-cli.js:

{
  failFast: boolean,
  resetIso: string|null,
  resetSecondsRemaining: number|null,
}

Adapter gate at the top of process():

const failFast = shouldFailFastForQuotaExhaustion(_quotaExhaustedUntil, Date.now());
if (failFast.failFast) {
  throw new AgentProviderError('codex-cli', summary, {
    userMessage: 'Codex quota exhausted. Trying the next configured provider.',
  });
}

The thrown error uses the standard fallback userMessage so agent.js chain iteration (rar#420 / PR #475) routes the work to claude immediately.

What changed in numbers

  • Per-assignment cost during exhaustion: ~40s → <10ms (3× ~13s retries skipped)
  • New codex CLI subprocesses spawned within exhaustion window: N → 0
  • agent_stuck event firing latency on saturated codex: ~40s → <10ms

Why this composes with chain iteration

The fail-fast alone doesn't deliver the review — it just exits fast. The actual delivery via claude happens because the userMessage matches the fallback-eligible contract that processWithProviders in agent.js consumes (see chain iteration doc). Without the gate, the fail-fast would still happen but at the much slower retry-3× cadence. The gate makes chain iteration cheap.

Tests

7 unit tests in src/agent/providers/codex-cli.test.js: - null state → no fail-fast - exhaustion within future window → fail-fast with reset metadata - partial-second remaining → rounds UP (conservative) - boundary now === resetAt → no fail-fast (just expired) - past expiry → no fail-fast - 4h-class window → resetSecondsRemaining = 14400 - pure-function contract smoke check

The policy was extracted exactly so it could be tested without spawning codex — getCodexAuthenticationState requires the codex binary which CI doesn't have. This same lesson was re-learned the hard way on rar#484 / PR #484 where a buildHeartbeatSnapshot test that constructed a provider: 'codex-cli' character failed with spawn codex ENOENT on CI but passed locally; the fix was dropping the codex e2e test in favour of the policy unit tests.

Pointers