Provider chain iteration¶

Tracking: rar#420 / PR #475.

The problem¶

character.json allows a fallbackProviders list:

{
  "llm": {
    "provider": "codex-cli",
    "fallbackProviders": ["claude-cli"],
    "providers": { "claude-cli": { "model": "claude-opus-4-7[1m]" } }
  }
}

The chain was BUILT in memory (providers Map in agent.js#createAgent) but agent.process() only ever called the active provider. On error it logged "Provider X failed: ... Trying the next configured provider." — but no actual fallback happened. The provider chain was decorative.

Combined with codex's usage_limit_reached 429 (covered by rar#468 fail-fast gate) and the operator's quota cycle, ~25% of review dispatches stuck because the second provider was right there but never tried (see rig-conductor#959 for the operator-verified evidence).

The design¶

Two pieces:

1. Pure policy: `shouldTryNextProvider(providerError) -> boolean`¶

In src/agent/provider-error.js:

export function shouldTryNextProvider(providerError) {
  if (providerError === null || providerError === undefined) return false;
  return providerError.fallbackEligible !== false;
}

AgentProviderError has always had a fallbackEligible field defaulting to true. It was never consumed. This classifier reads it.

Returns true for the typical provider error (codex/claude/openai all default true when throwing fallback-friendly errors).
Returns false when an error specifically opts out via new AgentProviderError(p, m, { fallbackEligible: false }).
Returns false defensively for null / undefined — never fall back on missing input.

2. Adapter: `processWithProviders({ providers, chain, active, ... })`¶

Exported from src/agent.js. Iterates the chain, trying each provider until one succeeds:

export async function processWithProviders({ providers, chain, active, message, context, dashboard, onProgress }) {
  const ordered = [active, ...chain.filter(p => p !== active)];

  let lastError = null;
  for (let i = 0; i < ordered.length; i++) {
    const provider = ordered[i];
    const agent = providers.get(provider);
    if (!agent) {
      lastError = new Error(`Configured provider '${provider}' is not available.`);
      continue;
    }

    if (onProgress) {
      const label = i === 0
        ? `Using provider ${provider}.`
        : `Falling back to provider ${provider}.`;
      await onProgress(label);
    }

    try {
      return await agent.process(message, context, dashboard, onProgress);
    } catch (error) {
      const providerError = toAgentProviderError(provider, error);
      lastError = providerError;
      // ... log / notify ...

      const isLast = i === ordered.length - 1;
      if (!shouldTryNextProvider(providerError) || isLast) {
        // Re-throw with the standard recovery-flag shape.
        const finalError = new Error(providerError.message);
        finalError.userMessage = providerError.userMessage || 'The selected provider failed. Switch provider and retry.';
        finalError.retryable = providerError.retryable;
        finalError.sessionNotFound = providerError.sessionNotFound;
        finalError.provider = provider;
        throw finalError;
      }
    }
  }

  throw lastError || new Error('No providers were available.');
}

The public createAgent factory wraps this with the real providers Map:

return {
  async process(message, context, dashboard, onProgress) {
    return processWithProviders({
      providers,
      chain: configuredProviders,
      active: getActiveProvider(character),
      message, context, dashboard, onProgress,
    });
  },
};

This is a Core/Api split in spirit — the pure policy + iteration logic is testable in isolation; the factory is the thin I/O adapter. Past rar test convention had ZERO coverage for createAgent.process() because the factory wired closure state directly; now processWithProviders is a named export with stub-provider tests.

What this does NOT change¶

Session pinning: rar#420 marks this as "preferred" but not required. The implementation here is per-invocation only — the NEXT assignment still tries the active provider first. Combined with rar#468 fail-fast, the active-provider attempt costs <10ms when quota is exhausted, so the per-invocation overhead is negligible.
Heartbeat alive-but-degraded (rig-conductor#959 gap #2): the conductor's quota-aware dispatcher (rc#942) still sees the same heartbeat shape. Tracked separately.
Stream-consumer rerouting (rig-conductor#959 gap #3): once Redis XADDs to assignments:review-e, the chain iteration runs whichever consumer-pod wins XREADGROUP. Cross-pod rerouting is conductor-side concern, tracked separately.

Failure modes the design preserves¶

Scenario	Behavior
Active provider succeeds	Return immediately; no chain walk
Active throws fallback-eligible	Try next provider; succeed or continue walking
Active throws non-fallback-eligible	Re-throw immediately; secondary never called
All providers throw fallback-eligible	Re-throw the LAST provider's error
Single-provider chain	Re-throw on failure (no fallback target)
Active != chain[0] (manual override)	Active still tried first
Provider in chain missing from Map	Skip to next entry

Telemetry¶

Each hop emits two log lines via the existing onProgress mechanism:

Using provider codex-cli. (or Falling back to provider claude-cli. for the secondary)
Provider codex-cli failed: Codex quota exhausted. Trying the next configured provider.

Dashboard addLog('error', ...) fires for each provider failure. The final re-thrown Error preserves userMessage, retryable, sessionNotFound, and sets .provider to the LAST attempted provider — so the stream consumer's agent_stuck emission names the actual point of failure.

Tests¶

src/agent.test.js covers 9 cases for processWithProviders with stub providers (primary success, fallback-eligible hop, non-fallback-eligible short-circuit, all-fail, single-provider, active != chain[0], sessionNotFound preservation, onProgress labels, missing-provider skip).

src/agent/provider-error.test.js covers 8 cases for shouldTryNextProvider (typed and untyped error shapes, null/undefined defensiveness, wrapped Error preservation).

Both files were written first (TDD red → green) per the operator-set hard rule (2026-05-18).

rar#468 / PR #469 — codex fail-fast gate (the prerequisite that makes per-invocation chain iteration cheap)
rar#470 / PR #471 — npm test CI gate (the safety net for this PR + future provider work)
rig-conductor#959 — the operator-filed evidence of the three gaps; this PR closes gap #1
rig-conductor#942 — quota-aware dispatch (conductor-side complement)