Provider chain iteration¶
The problem¶
character.json allows a fallbackProviders list:
{
"llm": {
"provider": "codex-cli",
"fallbackProviders": ["claude-cli"],
"providers": { "claude-cli": { "model": "claude-opus-4-7[1m]" } }
}
}
The chain was BUILT in memory (providers Map in agent.js#createAgent) but agent.process() only ever called the active provider. On error it logged "Provider X failed: ... Trying the next configured provider." — but no actual fallback happened. The provider chain was decorative.
Combined with codex's usage_limit_reached 429 (covered by rar#468 fail-fast gate) and the operator's quota cycle, ~25% of review dispatches stuck because the second provider was right there but never tried (see rig-conductor#959 for the operator-verified evidence).
The design¶
Two pieces:
1. Pure policy: shouldTryNextProvider(providerError) -> boolean¶
In src/agent/provider-error.js:
export function shouldTryNextProvider(providerError) {
if (providerError === null || providerError === undefined) return false;
return providerError.fallbackEligible !== false;
}
AgentProviderError has always had a fallbackEligible field defaulting to true. It was never consumed. This classifier reads it.
- Returns
truefor the typical provider error (codex/claude/openai all defaulttruewhen throwing fallback-friendly errors). - Returns
falsewhen an error specifically opts out vianew AgentProviderError(p, m, { fallbackEligible: false }). - Returns
falsedefensively fornull/undefined— never fall back on missing input.
2. Adapter: processWithProviders({ providers, chain, active, ... })¶
Exported from src/agent.js. Iterates the chain, trying each provider until one succeeds:
export async function processWithProviders({ providers, chain, active, message, context, dashboard, onProgress }) {
const ordered = [active, ...chain.filter(p => p !== active)];
let lastError = null;
for (let i = 0; i < ordered.length; i++) {
const provider = ordered[i];
const agent = providers.get(provider);
if (!agent) {
lastError = new Error(`Configured provider '${provider}' is not available.`);
continue;
}
if (onProgress) {
const label = i === 0
? `Using provider ${provider}.`
: `Falling back to provider ${provider}.`;
await onProgress(label);
}
try {
return await agent.process(message, context, dashboard, onProgress);
} catch (error) {
const providerError = toAgentProviderError(provider, error);
lastError = providerError;
// ... log / notify ...
const isLast = i === ordered.length - 1;
if (!shouldTryNextProvider(providerError) || isLast) {
// Re-throw with the standard recovery-flag shape.
const finalError = new Error(providerError.message);
finalError.userMessage = providerError.userMessage || 'The selected provider failed. Switch provider and retry.';
finalError.retryable = providerError.retryable;
finalError.sessionNotFound = providerError.sessionNotFound;
finalError.provider = provider;
throw finalError;
}
}
}
throw lastError || new Error('No providers were available.');
}
The public createAgent factory wraps this with the real providers Map:
return {
async process(message, context, dashboard, onProgress) {
return processWithProviders({
providers,
chain: configuredProviders,
active: getActiveProvider(character),
message, context, dashboard, onProgress,
});
},
};
This is a Core/Api split in spirit — the pure policy + iteration logic is testable in isolation; the factory is the thin I/O adapter. Past rar test convention had ZERO coverage for createAgent.process() because the factory wired closure state directly; now processWithProviders is a named export with stub-provider tests.
What this does NOT change¶
- Session pinning: rar#420 marks this as "preferred" but not required. The implementation here is per-invocation only — the NEXT assignment still tries the active provider first. Combined with rar#468 fail-fast, the active-provider attempt costs <10ms when quota is exhausted, so the per-invocation overhead is negligible.
- Heartbeat alive-but-degraded (rig-conductor#959 gap #2): the conductor's quota-aware dispatcher (rc#942) still sees the same heartbeat shape. Tracked separately.
- Stream-consumer rerouting (rig-conductor#959 gap #3): once Redis XADDs to
assignments:review-e, the chain iteration runs whichever consumer-pod wins XREADGROUP. Cross-pod rerouting is conductor-side concern, tracked separately.
Failure modes the design preserves¶
| Scenario | Behavior |
|---|---|
| Active provider succeeds | Return immediately; no chain walk |
| Active throws fallback-eligible | Try next provider; succeed or continue walking |
| Active throws non-fallback-eligible | Re-throw immediately; secondary never called |
| All providers throw fallback-eligible | Re-throw the LAST provider's error |
| Single-provider chain | Re-throw on failure (no fallback target) |
| Active != chain[0] (manual override) | Active still tried first |
| Provider in chain missing from Map | Skip to next entry |
Telemetry¶
Each hop emits two log lines via the existing onProgress mechanism:
Using provider codex-cli.(orFalling back to provider claude-cli.for the secondary)Provider codex-cli failed: Codex quota exhausted. Trying the next configured provider.
Dashboard addLog('error', ...) fires for each provider failure. The final re-thrown Error preserves userMessage, retryable, sessionNotFound, and sets .provider to the LAST attempted provider — so the stream consumer's agent_stuck emission names the actual point of failure.
Tests¶
src/agent.test.js covers 9 cases for processWithProviders with stub providers (primary success, fallback-eligible hop, non-fallback-eligible short-circuit, all-fail, single-provider, active != chain[0], sessionNotFound preservation, onProgress labels, missing-provider skip).
src/agent/provider-error.test.js covers 8 cases for shouldTryNextProvider (typed and untyped error shapes, null/undefined defensiveness, wrapped Error preservation).
Both files were written first (TDD red → green) per the operator-set hard rule (2026-05-18).
Related¶
- rar#468 / PR #469 — codex fail-fast gate (the prerequisite that makes per-invocation chain iteration cheap)
- rar#470 / PR #471 — npm test CI gate (the safety net for this PR + future provider work)
- rig-conductor#959 — the operator-filed evidence of the three gaps; this PR closes gap #1
- rig-conductor#942 — quota-aware dispatch (conductor-side complement)