Cost Framework — Budgets, Rate Limits, Prompt Caching, Proxy Enforcement¶

TL;DR

LLM-driven development is bounded by three overlapping constraints: $ cost, rate limits (each provider has its own: Anthropic 5h/7d/TPM/RPM; OpenAI TPM/RPM; Gemini RPM/TPD), and shared budget across agents on a given plan. One looping agent can burn the hourly cap for every other agent on that provider. Four cumulative defense layers (pre-flight prediction → dispatch token-bucket → LiteLLM proxy per-key budgets → Langfuse post-hoc attribution) enforce a hard ceiling at the proxy layer, not at trust-level in the agent. LiteLLM automatically falls over to a secondary provider on 429/529 — see provider-portability.md.

The four enforcement layers¶

graph LR
    classDef pred fill:#e3f2fd,color:#000
    classDef disp fill:#fff3e0,color:#000
    classDef proxy fill:#ffebee,color:#000
    classDef attr fill:#e8f5e9,color:#000

    L1[1. Pre-flight prediction<br/>cheap model<br/>Haiku or Ollama<br/>estimate tokens<br/>abort if > budget]:::pred
    L2[2. Dispatch token-bucket<br/> rig-conductorprojection<br/>circuit breaker state]:::disp
    L3[3. LiteLLM proxy<br/>per-agent virtual keys<br/>HARD CEILING — 429s<br/>before reaching provider]:::proxy
    L4[4. Langfuse attribution<br/>per-task cost<br/>drives future budgets]:::attr

    L1 --> L2 --> L3 --> A[LLM provider API<br/>default: Anthropic]
    A --> L4
    L4 -.->|weekly budget review| L2

Each layer is cumulative and independent. Layer 3 is the one that cannot be bypassed — the proxy returns 429 before the call reaches the model provider.

The hard claim — with a caveat

No agent can burn more than its hourly budget without explicit human approval. Enforced at the proxy layer, not trust-based at the agent layer. A compromised or looping agent cannot exceed its budget because the proxy returns 429 before the request reaches the LLM provider (Anthropic on the default path; same mechanic applies to OpenAI, Gemini, or any other LiteLLM-supported backend — see provider-portability.md).

Caveat (added after honest re-review): LiteLLM has known config-sensitive bugs in budget enforcement — specifically issue #12905 where user-level budgets are not enforced inside team configurations. Treat the proxy as the primary defense, not an absolute one. Verify your specific virtual-key setup with a synthetic budget-overrun test (deliberately exceed the limit, confirm 429 fires) before relying on it. Track record the test; re-run on every LiteLLM upgrade.

Layer 1: Pre-flight cost prediction¶

Why¶

The agent is about to do something expensive. Before sending the request, estimate:

Input tokens: system prompt + current context + fetched content
Output tokens: estimate based on task class (small edit ~2k, refactor ~10k, full feature ~30k)
Cost = (input × $3/1M) + (output × $15/1M) for Sonnet 4.6 pricing (2026)

For Opus 4.7: (input × $15/1M) + (output × $75/1M).

When¶

Triggered when estimated cost > TaskSpec.expected_effort_tokens × 0.1 (10% of budget as a single-call floor). For a task with an 80k-token budget, calls estimated >8k trigger prediction.

How¶

A cheap model does the estimation:

Haiku 4.5 ($1/1M input, $5/1M output) — fast, can run inline
Local Ollama model (llama3.2:3b or similar) — for agents already on iBuild-E with Ollama installed

The estimation prompt:

Given this system prompt: [truncated to 500 tokens]
And this user request: [full]
And this tool-use history: [last 5 calls]
Estimate: input_tokens, output_tokens, confidence (0-1).
Output JSON only.

If confidence × cost > budget_remaining, the request is deferred and the task is paused pending human approval.

What this catches¶

The "agent discovered it needs to refactor 20 files, now wants to call Opus 4.7 with the entire codebase as context" case. Prediction says 250k input tokens; budget says 40k remaining; task pauses; human sees the escalation.

What this doesn't catch¶

Requests that are individually within budget but accumulate across a long session. Handled by layer 2 ( rig-conductortoken bucket) and layer 3 (LiteLLM proxy).

Layer 2: Dispatch-time budget check¶

rig-conductor's budget projection¶

record AgentBudget(
    string AgentId,
    decimal HourlySpent,
    decimal HourlyLimit,
    decimal DailySpent,
    decimal DailyLimit,
    DateTimeOffset HourWindow,
    DateTimeOffset DayWindow,
    CircuitState CircuitState  // Closed | OpenHalfOpen | Open
);

Updated by a projection that consumes TokenUsage events emitted by agents at every LLM call.

Dispatch check¶

Before ClaimNextAssignmentAsync returns work to an agent:

var budget = await GetAgentBudget(agentId);
if (budget.HourlySpent >= budget.HourlyLimit * 0.95) return null;
if (budget.DailySpent >= budget.DailyLimit * 0.95) return null;
if (budget.CircuitState == CircuitState.Open) return null;

var task = FindAssignableTask(agentId);
if (task.ExpectedEffortTokens + budget.HourlySpent > budget.HourlyLimit) return null;

return task;

Result: the dispatcher won't hand an agent a task it cannot afford to complete.

The circuit breaker state machine¶

Three states:

Closed — normal operation, dispatch proceeds
Open — budget exhausted or 529 storm detected, no dispatch
Half-open — after cooldown, single probe dispatch to test

Transitions:

stateDiagram-v2
    Closed --> Open: Budget exhausted
    Closed --> Open: 3× consecutive 529
    Open --> HalfOpen: Cooldown elapsed (30min)
    HalfOpen --> Closed: Probe succeeds
    HalfOpen --> Open: Probe fails

Budget-exhaustion cooldown: until next hour rollover (wall-clock). 529-storm cooldown: 30 minutes.

Per-agent budget defaults¶

Model names in the table are illustrative defaults, not requirements

Per provider-portability.md, any LiteLLM-supported model can substitute for the defaults shown — see the fallback_models config pattern in the LiteLLM section below. The dollar budgets are the load-bearing numbers; the Sonnet / Haiku / Opus tags are just the current default routing.

Agent	Hourly	Daily	Rationale
Dev-E (default: Sonnet 4.6) — issue dispatch	$2	$20	Active development; most common runtime
Dev-E (default: Sonnet 4.6) — repair dispatch	$2	$10	Same agent class, separate budget envelope because bursty and incident-driven
Review-E (default: Sonnet 4.6)	$0.50	$5	Shorter sessions, smaller context
Spec-E (default: Haiku 4.5)	$0.30	$3	Many short calls
Architect-E (default: Opus 4.7)	$5	$30	High-stakes, large context

Budgets adjustable per TaskSpec. A T3 task may carry an explicit higher budget with human approval.

Layer 3: Request-time proxy enforcement (LiteLLM)¶

Architecture¶

graph LR
    A[Agent pod] -->|OpenAI-format SDK<br/>e.g. Anthropic, OpenAI, Google| L[LiteLLM proxy<br/>virtual-key auth]
    L -->|rate-limit check| L2[Per-key bucket]
    L -->|budget check| L3[Per-key budget]
    L -->|if ok| AN[Primary provider<br/>default: api.anthropic.com]
    L2 -.->|429 if exceeded| A
    L3 -.->|429 if exhausted| A

Why LiteLLM (and the Portkey escape hatch)

LiteLLM is MIT-licensed and the only OSS option with per-virtual-key budget enforcement plus duration-based resets. Risk factor: BerriAI is YC-stage with ~$2.1M seed; venture-stage risk is real but not imminent. Documented fallback: Portkey Gateway (fully open-sourced March 2026, processing 1T+ tokens/day). See tool-choices.md for the full comparison (LiteLLM vs Portkey vs OpenRouter vs Kong AI Gateway vs Cloudflare AI Gateway).

LiteLLM sits as a proxy between agent pods and whichever LLM provider is configured (Anthropic by default; OpenAI, Gemini, OpenRouter, or local Ollama per provider-portability.md). Each agent pod has its own virtual key with:

Per-key rate limits (RPM, TPM)
Per-key budget (max_budget, budget_duration: 1h and 24h)
Per-key allowed models
Per-key tag for Langfuse correlation

Configuration¶

# litellm-config.yaml (managed by Flux)
model_list:
  - model_name: sonnet-4-6
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: opus-4-7
    litellm_params:
      model: anthropic/claude-opus-4-7

virtual_keys:
  - key_alias: dev-e-dotnet
    max_budget: 20.00
    budget_duration: 1d
    rpm_limit: 60
    tpm_limit: 400000
    models: [sonnet-4-6, haiku-4-5]
  - key_alias: architect-e
    max_budget: 30.00
    budget_duration: 1d
    rpm_limit: 30
    tpm_limit: 600000
    models: [opus-4-7, sonnet-4-6]

Why the proxy layer is mandatory¶

Langfuse and Helicone track cost via attribution but neither enforces cutoffs at the request layer. They alert after the fact. Only a proxy that returns 429 before the call reaches the provider prevents runaway spend.

429 handling in agents¶

When the proxy returns 429 to an agent:

Agent backs off (1s, 2s, 4s, 8s, 16s, 32s, 64s)
After 64s total, emit AgentBudgetExhausted event
Task enters paused-budget state in rig-conductor
Next dispatch check sees CircuitState == Open, withholds new work
When human approves or budget rolls over, circuit closes

Layer 4: Post-hoc attribution (Langfuse)¶

Every LLM call creates a Langfuse trace with:

agent_id, task_id, repo, issue_number
Model, input tokens, output tokens, cost
Tool calls and their costs
Total session cost

Langfuse dashboards:

Cost per (agent × day × task class)
Cost per merged PR (unit economics — is Dev-E getting more efficient?)
Cost per SWE-bench-Pro pass (efficiency trend)
Cache hit rate (targeting ≥80% on long system prompts)

Data flows into budgets¶

Weekly: the cost framework reviews actual spend vs. defaults and suggests budget adjustments. An agent consistently hitting 95% of its hourly cap might get bumped from $2 to $3, or might have its task routing reshaped. A task class consistently costing more than budgeted gets its default expected_effort_tokens adjusted in the TaskSpec schema.

Prompt caching — the 10× optimization¶

Anthropic's prompt cache: cache hits cost ~10% of normal input token cost. For a long system prompt + AGENTS.md + CLAUDE.md context, cache hits are free money.

How to maximize cache hits¶

Stable system prompts — if the system prompt changes every call, cache cannot hit. Pin the system prompt version per agent × version.
Cache breakpoints — explicitly mark breakpoints at the end of the system prompt, after the CLAUDE.md context, after any long tool definitions. Anthropic recommends cache-breaking at natural boundaries.
Don't inject per-call variables in the cached region — put timestamps, user IDs, etc. at the end of the prompt, not in the system instructions.

Claude Code's automatic caching¶

Claude Code does this automatically for the session's own prompts. For calls made directly from rig-conductor (e.g., Spec-E's clarification prompts), caching is opt-in — pass cache_control: {type: "ephemeral"} in the message blocks.

Measurement¶

Langfuse tracks cache_read_input_tokens vs. input_tokens. The ratio is the cache hit rate. Target: ≥80% on long system prompts. Alert if <60%.

The shared-plan problem (per provider)¶

Every provider has some variant of a shared-quota constraint. Anthropic is the one we feel most — multiple agents on the same Max plan contend for three overlapping limits simultaneously (5h rolling, 7d weekly, TPM/RPM). OpenAI has per-organization TPM/RPM limits. Gemini has RPM + TPD per project. Local Ollama has none (CPU/GPU-bound instead).

Anthropic dashboards show only one of the three limits; users regularly see 429s at 72% "reported" utilization because a different ceiling is actually the blocker.

Our mitigation (works across providers)¶

LiteLLM per-agent virtual keys — forces agents to contend at the proxy, with explicit per-agent budgets. Proxy exhaustion visible and predictable regardless of underlying provider.
Per-agent hourly caps <1/N of the shared plan's hourly — ensures no single agent's max fairness-share exceeds its safe allocation on the primary provider.
Model-size routing — large-model tokens are rate-limit-expensive on every provider; reserve the largest (Opus 4.7, GPT-5.2, Gemini 3.1 Pro) for Architect-E and difficult repair-dispatch diagnosis. Routine work uses mid-tier (Sonnet 4.6, GPT-5-mini, Gemini Flash).
Cheapest-viable for high-volume small calls — Haiku 4.5 / GPT-5-mini / Gemini Flash / local Ollama for Spec-E clarifiers and classification. 10× cheaper and "plenty good" for that task class.
Cross-provider fallback — LiteLLM's fallback_models list automatically retries on a secondary provider on 429/529. Visible in Langfuse with billing_source: <provider>-paygo.
Paid API account separate from a shared-plan account — for spend-heavy tasks, route through a pay-per-token key, bypassing any shared-plan ceiling.

See provider-portability.md for the full multi-vendor story including Ollama-local fallback as a zero-cost escape hatch.

Rate-limit handling in detail¶

Provider-side (Anthropic example; same pattern for OpenAI and Gemini via LiteLLM)¶

429 Rate Limited — the proxy returns this when virtual-key budget or rate limit exhausted. Handled uniformly by LiteLLM regardless of underlying provider.
529 Overloaded (Anthropic-specific) — fleet at capacity; global signal, not personal. The published rule: 5 min wait → single retry → 15 min wait → stop. Never loop on 529. OpenAI and Gemini use 503 for the same class of error; LiteLLM normalizes.
401 Unauthorized — key revoked or expired. Surface to human; do not retry.
400 Bad Request — malformed request. Agent bug; do not retry without fix.

Agent-side backoff¶

Exponential: 1s, 2s, 4s, 8s, 16s, 32s, 64s (cumulative ~2min). Then:

429: check if it's virtual-key limit → emit BudgetExhausted, wait for next window
429: check if it's Anthropic-side → wait for retry-after header; if absent, assume 60s
529: 5min, single retry, 15min, stop (never loop)

Checkpoint before retry¶

Before any retry that would sleep >10s, agent checkpoints state to NFS (session context + event cursor) so a hard-stop doesn't lose work. Stuck agents are recoverable via the checkpoint.

Token-budget accounting per task¶

graph LR
    A[TaskSpec.expected_effort_tokens<br/>eg 80000] --> R[Reserved at dispatch]
    R --> U[Used tokens]
    U -->|<100%| OK[Task succeeds in budget]
    U -->|100-120%| SO[SoftOverrun event<br/>log + dashboard]
    U -->|>120%| HO[HardOverrun event<br/>pause task + alert]
    HO --> H[Human approval<br/>or abort]

SoftOverrun (100-120%): logged for calibration; task continues. HardOverrun (>120%): pause, escalate, human decides.

Cost visibility — the dashboard stays honest¶

The cost dashboard shows:

Daily burn by agent (stacked area)
Cost per merged PR trend (line, weekly)
Cost per agent per task class (bar chart)
Cache hit rate trend
Budget-gate rejections (count)
Top 5 expensive tasks this week (table)

The dashboard is public to all humans working on the rig. Hiding cost from humans is a failure of principle 1 (measurable).

What we consciously do not do¶

Build our own LLM gateway. LiteLLM is a mature OSS proxy. Building ours violates principle 10.
Run every task through the cheapest possible model. Principle 4 (execute, don't trust) says output quality matters; routing Architect-E work to Haiku to save money is false economy.
Cache prompt-injection-risky content. Cache entries for content from untrusted sources become attack vectors. Cache only stable system content.
Over-engineer cost attribution. Per-token attribution at session level is sufficient. Per-function attribution costs more to operate than it saves.
Set budgets by guessing. Budgets are derived from measured week-over-week spend + 20% headroom, then adjusted by observation. Default budgets are starting points, not targets.

Evolving the budgets¶

Budget configuration is a T2 change (multi-service impact, policy surface):

PR to dashecorp/rig-gitops/litellm/virtual-keys.yaml
Review-E review
Human approval
Flux reconciles LiteLLM config

Emergency budget increases (e.g., "Architect-E needs $100 extra for this week's interface work"): a tagged exception with human co-sign, expires at end of week, audit event in rig-conductor.