feat(reconciler): re-dispatch stuck PRs when provider quota recovers¶
Tracking: rc#944.
Problem¶
Before today's chain iteration shipped (rar#420 / PR #475), codex 429 saturation caused review-e to drop the work after 3 retries and emit AGENT_STUCK → IssueStatus.State="failed". Even after the 5-hour quota window reset (~23 min later), the failed issue stayed in state=failed because the conductor's view didn't consider it reviewable.
Recovery required manual orchestrator action: gh pr update-branch <N> on each, which fired pull_request.synchronize → conductor re-dispatched.
With today's chain iteration, the silent-failure path is much rarer: codex 429 → fail-fast → claude takes over → review happens. But there are still cases where both providers saturate (operator quota mixed across multiple sessions), and there are pre-existing failed issues from before chain iteration shipped that need manual recovery. This PR closes that gap.
Design¶
Add a third recovery path to ReconciliationService (alongside abstention recovery and timeout recovery), gated on a pure Core policy.
Pure policy — QuotaRecoveryPolicy¶
Located at src/ConductorE.Core/Domain/QuotaRecoveryPolicy.cs. Two static methods:
// Match the free-form AgentStuck.Reason text against known quota signatures.
public static bool IsQuotaExhaustedReason(string? reason);
// Combine "was-quota-stuck" + current quota state → re-dispatch decision.
public static bool ShouldReDispatchOnQuotaRecovery(
bool wasQuotaStuck,
double? currentQuotaFiveHourPct,
double recoveryThresholdPct = 80.0);
Signature list (case-insensitive substring match on AgentStuck.Reason):
"Codex quota exhausted"— emitted by rar codex-cli fail-fast gate (rar#468 / PR #469)"usage_limit_reached"— verbatim from codex API 429 response"rate_limit"— generic OpenAI/Anthropic rate-cap surface"X-Codex-Primary-Used-Percent"— header substring that leaks into stderr when codex saturates
Recovery threshold default: 80% (configurable per-call). Quota at the exact threshold is NOT recovered (<, not <=) to prevent fence-post oscillation when the value hovers at the line.
Adapter — ReconciliationService.ReconcileQuotaRecoveryAsync¶
Located at src/ConductorE.Api/Services/ReconciliationService.cs. Runs every reconciliation tick (every 5 min, same cadence as the existing paths).
for each issue where State="failed" and PrNumber is set and !IsReviewEAuthored:
walk the event stream
find the most recent AgentStuck event
if IsQuotaExhaustedReason(stuck.Reason):
look up agent.QuotaFiveHourPct via IAgentQuery
if ShouldReDispatchOnQuotaRecovery(true, agent.QuotaFiveHourPct):
throttle check (30 min, same as abstention path)
emit RE_REVIEW_REQUESTED with reason="prior_provider_quota_recovered"
re-publish to assignments stream
The agent registry (IAgentQuery) is queried once per tick, not per issue, to amortise the cost. Most failed issues will share a small set of responsible agents.
Defensive guards¶
- IStreamPublisher not registered → skip (test fixtures, minimal pod configs).
- IAgentQuery not registered → skip (same — defensive against incomplete DI).
- Agent missing from registry → skip (pod rename, archive).
- COI guard — review-e-authored PRs are never re-dispatched to review-e (would 422 on GitHub).
- Throttle — same 30-min window as the abstention path; prevents thundering-herd when quota recovers then re-saturates immediately.
What this does NOT change¶
- Schema: no new fields on IssueStatus. The failure reason lives on the event stream as part of
AgentStuck.Reason— the reconciler walks it on demand. Adding aLastFailureReasonprojection field would be a follow-up if we observe scan-time perf problems (1803-line ReconciliationService already walks streams, so the per-issue cost is amortised against the existing scans). - rar emission: no change to how rar emits AGENT_STUCK. The existing free-form
reasonfield is parsed by the policy; rar didn't need to learn new event types. - Claude-cli signature coverage: the signature list is biased toward codex (the case we've actually observed). Claude rate-cap signatures can be added to the list when we observe them.
TDD + DDD discipline¶
Per the operator-set hard rule (2026-05-18):
- Policy in Core —
QuotaRecoveryPolicyis a pure static class with no I/O, no DI, no clock. 22 unit tests intests/ConductorE.Core.Tests/Domain/QuotaRecoveryPolicyTests.cscover the signature inventory, threshold boundaries, null handling, and custom-threshold paths. - Tests first — both policy unit tests (Core) AND adapter e2e tests (Api) were written before the implementation.
- Adapter as thin I/O shell —
ReconcileQuotaRecoveryAsyncis the only non-pure code; it walks streams + queries agents + emits events. All decisions are delegated to the policy.
7 new adapter tests in ReconciliationServiceTests.cs cover:
1. Stuck on codex quota + quota recovered → re-dispatch
2. Stuck on codex quota + quota still saturated → no-op
3. Stuck on non-quota reason → no-op (even with low quota)
4. Recent ReReviewRequested in stream → throttled
5. Agent not in registry → skipped safely
6. Review-E-authored PR → COI guard skips
7. No IAgentQuery registered → no-op safely
Pointers¶
- Issue: rc#944
- Policy:
src/ConductorE.Core/Domain/QuotaRecoveryPolicy.cs - Adapter:
src/ConductorE.Api/Services/ReconciliationService.cs—ReconcileQuotaRecoveryAsync - Tests:
tests/ConductorE.Core.Tests/Domain/QuotaRecoveryPolicyTests.cs,tests/ConductorE.Api.Tests/ReconciliationServiceTests.cs— searchReconcileQuotaRecovery_* - Pairs with: rar#468/PR#469 (in-pod codex fail-fast), rar#420/PR#475 (chain iteration), rc#959 gap #1+#2 (complement)