Skip to content

Provider & Runtime Portability — Not Just Anthropic

TL;DR

The rig is vendor-neutral at four layers: coordination (Conductor-E events are provider-agnostic), gateway (LiteLLM routes to 100+ providers), instrumentation (OpenTelemetry GenAI semantic conventions), and instructions (AGENTS.md is a cross-tool standard read by Claude Code, Codex CLI, Gemini CLI, Cursor, and Aider). Claude Code + Anthropic is the default runtime for Dev-E / Review-E / Spec-E because it's what the team ships today, not because the architecture requires it. Any provider with an OpenAI-compatible API or a LiteLLM adapter works.

Why this document exists

Earlier drafts of the whitepaper referenced Claude, Anthropic, and Sonnet / Opus / Haiku as if they were the only option. They are not. The underlying architecture is multi-vendor by design — but the prose hid it. This doc corrects that framing and makes the portability story explicit. If you read "Claude Sonnet 4.6" anywhere else in the whitepaper, read it as "the configured default model, currently Sonnet 4.6" — not as a required pick.

The four portability layers

Each layer has been chosen specifically so no single vendor is load-bearing:

graph TB
    classDef layer fill:#e3f2fd,stroke:#1565c0,color:#000
    classDef vendor fill:#fff3e0,stroke:#e65100,color:#000
    classDef defense fill:#e8f5e9,stroke:#2e7d32,color:#000

    L1[1. Coordination layer<br/>Conductor-E events + projections]:::layer
    L2[2. Gateway layer<br/>LiteLLM proxy]:::layer
    L3[3. Instrumentation layer<br/>OpenTelemetry GenAI conventions]:::layer
    L4[4. Instruction layer<br/>AGENTS.md cross-tool standard]:::layer

    V1[Anthropic<br/>Claude Opus / Sonnet / Haiku]:::vendor
    V2[OpenAI<br/>GPT-5.2 / mini / o3]:::vendor
    V3[Google<br/>Gemini 3.1 Pro / Flash]:::vendor
    V4[Local<br/>Ollama: llama3.2 / custom]:::vendor
    V5[Aggregator<br/>OpenRouter]:::vendor

    D1[Events carry no vendor-specific shape]:::defense
    D2[LiteLLM virtual keys route per-model]:::defense
    D3[OTel traces portable across backends]:::defense
    D4[AGENTS.md works in Claude Code / Codex CLI / Gemini CLI / Cursor / Aider]:::defense

    L1 --> D1
    L2 --> D2
    L3 --> D3
    L4 --> D4
    D2 --> V1
    D2 --> V2
    D2 --> V3
    D2 --> V4
    D2 --> V5

Layer 1 — Coordination (Conductor-E events)

The event record types in ConductorE.Core.Domain.EventsIssueAssigned, WorkStarted, PrCreated, ReviewPassed, GuardBlocked, TokenUsage, and the rest — carry no provider-specific field. TokenUsage has model and provider as plain strings ("anthropic/claude-sonnet-4-6", "openai/gpt-5.2", "google/gemini-3.1-pro", "ollama/llama3.2"). Projections compute per-provider totals without changing schema.

Layer 2 — Gateway (LiteLLM)

LiteLLM is the single-process proxy in front of every agent. It speaks the OpenAI API wire format and translates to 100+ backends. From the agent's point of view, calling https://llm-proxy.rig.svc/v1/chat/completions looks like talking to OpenAI — the proxy decides where the request actually goes based on the virtual-key config.

Concrete LiteLLM config for a multi-vendor rig:

# litellm-config.yaml (managed by Flux)
model_list:
  - model_name: sonnet-4-6
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: opus-4-7
    litellm_params:
      model: anthropic/claude-opus-4-7
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: haiku-4-5
    litellm_params:
      model: anthropic/claude-haiku-4-5
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: gpt-5-2
    litellm_params:
      model: openai/gpt-5.2
      api_key: os.environ/OPENAI_API_KEY
  - model_name: gemini-3-1-pro
    litellm_params:
      model: gemini/gemini-3.1-pro
      api_key: os.environ/GEMINI_API_KEY
  - model_name: llama-local
    litellm_params:
      model: ollama/llama3.2
      api_base: http://ollama.rig.svc:11434

virtual_keys:
  # Primary path
  - key_alias: dev-e-primary
    models: [sonnet-4-6, opus-4-7, haiku-4-5]
    fallback_models: [gpt-5-2, gemini-3-1-pro]
    max_budget: 20.00
    budget_duration: 1d

  # Alternative agent class — default to OpenAI
  - key_alias: dev-e-gpt
    models: [gpt-5-2]
    fallback_models: [sonnet-4-6]
    max_budget: 20.00
    budget_duration: 1d

  # Local-first for cost-sensitive tasks
  - key_alias: spec-e-local
    models: [llama-local]
    fallback_models: [haiku-4-5]
    max_budget: 3.00
    budget_duration: 1d

The fallback_models list is the resilience story: if the primary provider 429s (rate limit) or 529s (overloaded), LiteLLM automatically retries on the next model. Agent code never handles it.

Layer 3 — Instrumentation (OpenTelemetry GenAI conventions)

This is the single highest-leverage portability defense (tool-choices.md called it out). Instrument agent code with the OTel GenAI semantic conventions (gen_ai.system, gen_ai.request.model, gen_ai.usage.*) rather than vendor-specific SDK hooks. Spans emitted this way land in any OTel-compatible backend (Langfuse, Phoenix, Grafana Cloud, Helicone, Datadog, New Relic) without re-instrumenting.

Effect: switching observability backend is "point the OTel exporter at a new URL" — not "rewrite every trace call." Switching LLM provider is "change gen_ai.system from anthropic to openai" — the dashboard keeps working because the query is keyed on gen_ai.request.model, not the provider brand.

Layer 4 — Instructions (AGENTS.md cross-tool)

The AGENTS.md file in each repo is readable by:

Runtime Reads AGENTS.md? Hooks system
Claude Code (Anthropic) Yes (via CLAUDE.md import) PreToolUse, PostToolUse, Stop, SessionStart, UserPromptSubmit
Codex CLI (OpenAI) Yes (native AGENTS.md support) .codex/hooks.json with PreToolUse/PostToolUse/SessionStart/Stop/UserPromptSubmit
Gemini CLI (Google) Yes (via GEMINI.md or AGENTS.md import) Native hooks system, similar contract
Cursor (Anysphere) Yes (Cursor Rules can pull from AGENTS.md) Cursor Rules + MCP
Aider Yes (explicit read of AGENTS.md) — (pair-mode single-shot, hooks less relevant)

This is the reason AGENTS.md is in the whitepaper, not CLAUDE.md: the instruction contract is cross-runtime by design. Humans and agents read the same rules regardless of which tool the human is using that day.

Supported providers (shipping today)

Provider Models Best fit Caveats
Anthropic Opus 4.7 / Sonnet 4.6 / Haiku 4.5 Default for Dev-E, Review-E, Architect-E, Spec-E Shared Max plan across agents — see cost-framework.md for rate-limit-aware dispatch
OpenAI GPT-5.2 / GPT-5-mini / o3 Secondary / fallback; strong on tool-use reliability Separate pay-per-token account; different rate-limit model
Google Gemini 3.1 Pro / Flash Secondary; competitive on SWE-bench Pro (80.6% vs Sonnet 79.6%); large context window Tool-use API differs subtly from Anthropic's
Ollama (local) llama3.2, llama3.1, custom quantized (our ibuild-e:3b) Spec-E clarifiers, pre-flight cost prediction, low-sensitivity classification Quality ceiling lower; only feasible when task doesn't need large context
OpenRouter Aggregates all of the above Emergency fallback if a primary account is blocked 5.5% markup on credit purchases — not cheapest steady-state

Supported agent runtimes

Runtime is orthogonal to provider. Any supported runtime can talk to any supported provider via LiteLLM.

Runtime Vendor tie Use case in the rig
Claude Code CLI Anthropic (API only, models via LiteLLM) Default for Dev-E, Review-E, Spec-E, Architect-E, repair-dispatch
Codex CLI OpenAI (API only, models via LiteLLM) Alternative Dev-E flavor, hooks are similar to Claude Code
Gemini CLI Google (API only, models via LiteLLM) Alternative for Architect-E (large context strengths)
Cursor Anysphere (IDE-integrated) Human pair-mode primary; agents via Cursor Cloud
Aider None Human pair-mode alternative, model-agnostic
Custom via Anthropic/OpenAI SDK For bespoke agents where we control the loop

The AGENTS.md + OpenTelemetry conventions combo means a single TaskSpec can dispatch to any runtime without the TaskSpec itself knowing which is used.

Per-agent-per-task-class model selection

Default mapping (configurable via HelmRelease values or human override in pair mode):

Agent role Default model Alternative Why
Spec-E (intake refinement) Haiku 4.5 Gemini Flash, llama3.2 local Many small calls, cost-sensitive
Dev-E (issue-dispatch) Sonnet 4.6 GPT-5.2, Gemini 3.1 Pro Default balance of capability and cost
Dev-E (repair-dispatch) Sonnet 4.6 Opus 4.7 for complex diagnosis Same agent, sometimes needs bigger model for ambiguous traces
Review-E Sonnet 4.6 Opus 4.7 on T2/T3 PRs Judgment sensitivity
Architect-E Opus 4.7 Gemini 3.1 Pro (long-context strengths) High-stakes interface decisions
LLM-as-judge (quality sampling) Opus 4.7 on Sonnet output GPT-5.2 judging Claude (cross-family check) Avoid one-provider confirmation bias

None of these are hard-coded. Swap a row, redeploy the HelmRelease, and the agent shifts provider on the next scale-up.

Vendor rate-limit handling

Every major provider has a rate-limit story, none identical:

Provider Behaviors
Anthropic Three simultaneous limits (5h rolling, 7d weekly, TPM/RPM). Dashboard shows one. 429 at 72% "reported" util is common
OpenAI TPM + RPM per model, retry-after header reliable
Google RPM + TPD, less predictable 429s during launches
Ollama local CPU/GPU bound, not rate-limited but slow

The LiteLLM proxy handles them uniformly: any 429/529 → fallback. Per-provider budget envelopes in Conductor-E projections make cost attribution visible: "Dev-E's Anthropic spend is tracking 80% of budget but OpenAI fallback is at 12% — maybe we're overusing primary."

The lock-in assessment, honestly

From tool-choices.md, refined for the vendor angle:

  • Anthropic specifically: HIGH lock-in. Claude prompts are not portable without rewriting; Anthropic's tool-use format differs from OpenAI's, which differs from Gemini's. Even with OTel conventions keeping traces portable, the prompt layer is vendor-specific. If we deliberately wrote prompts for Sonnet, they won't match OpenAI behavior out of the box.
  • OpenAI lock-in: MEDIUM. GPT-style structured outputs, function-calling schemas, and prompt conventions are different enough that ports need testing. Less total lock-in than Anthropic because LiteLLM + OpenAI-format is the lingua franca.
  • Google lock-in: MEDIUM-LOW for now. Small footprint in the rig. Gemini CLI hooks are less mature than Claude Code's.
  • Local (Ollama) lock-in: NONE. The model is a file.

Net: the architecture is portable; the prompts are the sticky layer. Every agent's system prompt is versioned in git (per tool-choices.md and drift-detection.md), which means provider migration is a mechanical task: re-author system prompts against the target provider, A/B the eval suite against both, flip the HelmRelease model_name.

What this means for the whitepaper

Everywhere the whitepaper says "Claude Sonnet 4.6", read "the currently-configured default, Sonnet 4.6, swappable". Everywhere it says "Anthropic's API", read "the LLM provider (Anthropic by default, but the gateway is LiteLLM)".

Specific places that previously implied Anthropic-only:

  • cost-framework.md — Anthropic-heavy examples; the LiteLLM config above shows the multi-vendor shape
  • observability.md — Claude Code native OTel highlighted; Codex CLI and Gemini CLI have the same, the OTel conventions make it irrelevant
  • safety.md — CaMeL separation applies to any LLM; examples used Anthropic-specific CVEs because that's where the attacks happened, but the defense pattern is provider-agnostic
  • security.md — egress allowlist includes api.anthropic.com; extend with api.openai.com, generativelanguage.googleapis.com, local Ollama when used
  • development-process.md — team topology lists "Sonnet-backed" as a tag, not a requirement

Open questions for multi-vendor operation

Things the whitepaper doesn't yet resolve:

  • Prompt portability testing cadence. When should we dual-run prompts across providers to check behavior parity? Today the eval harness only runs one model per agent config. A "cross-provider regression" step is not in the weekly cadence yet.
  • CaMeL structured-output equivalence. Anthropic tool-use via Instructor is well-tested. OpenAI's structured outputs work differently. Gemini's differ again. The quarantine-plane extraction pattern is the same, but the per-provider polyfills aren't written.
  • Prompt caching economics. Anthropic's cache hits are ~10% of input token cost. OpenAI's prompt caching (beta) has different discount curves. Gemini's context caching is separate again. The cost framework should route toward the provider with the cheapest cache economics for the hot path.
  • Tool-use call reliability per provider. Claude Sonnet 4.6 is well-tested on the rig's specific tool set. GPT-5.2 and Gemini 3.1 Pro would need their own reliability baseline before we could hand them T2 work.

Each of these is an ADR candidate once we have operating data with more than one provider.

See also