Skip to content

feat(reconciler): re-dispatch stuck PRs when provider quota recovers

Tracking: rc#944.

Problem

Before today's chain iteration shipped (rar#420 / PR #475), codex 429 saturation caused review-e to drop the work after 3 retries and emit AGENT_STUCKIssueStatus.State="failed". Even after the 5-hour quota window reset (~23 min later), the failed issue stayed in state=failed because the conductor's view didn't consider it reviewable.

Recovery required manual orchestrator action: gh pr update-branch <N> on each, which fired pull_request.synchronize → conductor re-dispatched.

With today's chain iteration, the silent-failure path is much rarer: codex 429 → fail-fast → claude takes over → review happens. But there are still cases where both providers saturate (operator quota mixed across multiple sessions), and there are pre-existing failed issues from before chain iteration shipped that need manual recovery. This PR closes that gap.

Design

Add a third recovery path to ReconciliationService (alongside abstention recovery and timeout recovery), gated on a pure Core policy.

Pure policy — QuotaRecoveryPolicy

Located at src/ConductorE.Core/Domain/QuotaRecoveryPolicy.cs. Two static methods:

// Match the free-form AgentStuck.Reason text against known quota signatures.
public static bool IsQuotaExhaustedReason(string? reason);

// Combine "was-quota-stuck" + current quota state → re-dispatch decision.
public static bool ShouldReDispatchOnQuotaRecovery(
    bool wasQuotaStuck,
    double? currentQuotaFiveHourPct,
    double recoveryThresholdPct = 80.0);

Signature list (case-insensitive substring match on AgentStuck.Reason):

  • "Codex quota exhausted" — emitted by rar codex-cli fail-fast gate (rar#468 / PR #469)
  • "usage_limit_reached" — verbatim from codex API 429 response
  • "rate_limit" — generic OpenAI/Anthropic rate-cap surface
  • "X-Codex-Primary-Used-Percent" — header substring that leaks into stderr when codex saturates

Recovery threshold default: 80% (configurable per-call). Quota at the exact threshold is NOT recovered (<, not <=) to prevent fence-post oscillation when the value hovers at the line.

Adapter — ReconciliationService.ReconcileQuotaRecoveryAsync

Located at src/ConductorE.Api/Services/ReconciliationService.cs. Runs every reconciliation tick (every 5 min, same cadence as the existing paths).

for each issue where State="failed" and PrNumber is set and !IsReviewEAuthored:
  walk the event stream
  find the most recent AgentStuck event
  if IsQuotaExhaustedReason(stuck.Reason):
    look up agent.QuotaFiveHourPct via IAgentQuery
    if ShouldReDispatchOnQuotaRecovery(true, agent.QuotaFiveHourPct):
      throttle check (30 min, same as abstention path)
      emit RE_REVIEW_REQUESTED with reason="prior_provider_quota_recovered"
      re-publish to assignments stream

The agent registry (IAgentQuery) is queried once per tick, not per issue, to amortise the cost. Most failed issues will share a small set of responsible agents.

Defensive guards

  • IStreamPublisher not registered → skip (test fixtures, minimal pod configs).
  • IAgentQuery not registered → skip (same — defensive against incomplete DI).
  • Agent missing from registry → skip (pod rename, archive).
  • COI guard — review-e-authored PRs are never re-dispatched to review-e (would 422 on GitHub).
  • Throttle — same 30-min window as the abstention path; prevents thundering-herd when quota recovers then re-saturates immediately.

What this does NOT change

  • Schema: no new fields on IssueStatus. The failure reason lives on the event stream as part of AgentStuck.Reason — the reconciler walks it on demand. Adding a LastFailureReason projection field would be a follow-up if we observe scan-time perf problems (1803-line ReconciliationService already walks streams, so the per-issue cost is amortised against the existing scans).
  • rar emission: no change to how rar emits AGENT_STUCK. The existing free-form reason field is parsed by the policy; rar didn't need to learn new event types.
  • Claude-cli signature coverage: the signature list is biased toward codex (the case we've actually observed). Claude rate-cap signatures can be added to the list when we observe them.

TDD + DDD discipline

Per the operator-set hard rule (2026-05-18):

  • Policy in CoreQuotaRecoveryPolicy is a pure static class with no I/O, no DI, no clock. 22 unit tests in tests/ConductorE.Core.Tests/Domain/QuotaRecoveryPolicyTests.cs cover the signature inventory, threshold boundaries, null handling, and custom-threshold paths.
  • Tests first — both policy unit tests (Core) AND adapter e2e tests (Api) were written before the implementation.
  • Adapter as thin I/O shellReconcileQuotaRecoveryAsync is the only non-pure code; it walks streams + queries agents + emits events. All decisions are delegated to the policy.

7 new adapter tests in ReconciliationServiceTests.cs cover: 1. Stuck on codex quota + quota recovered → re-dispatch 2. Stuck on codex quota + quota still saturated → no-op 3. Stuck on non-quota reason → no-op (even with low quota) 4. Recent ReReviewRequested in stream → throttled 5. Agent not in registry → skipped safely 6. Review-E-authored PR → COI guard skips 7. No IAgentQuery registered → no-op safely

Pointers