Skip to content

Limitations — What the Trusted Rig Cannot Do

TL;DR

A trusted rig earns trust by being honest about its limits. Twelve enumerated things the rig does not do, plus what humans still own, plus what threats fall outside the guarantees. Read this before treating the whitepaper as a promise — it is an ambition document calibrated by reality.

Where the rig stops and humans begin

graph TB
    classDef rig fill:#e8f5e9,stroke:#2e7d32,color:#000
    classDef human fill:#fff3e0,stroke:#e65100,color:#000
    classDef shared fill:#e3f2fd,stroke:#1565c0,color:#000

    subgraph "Rig handles (T0–T1, measured track record)"
        R1[Bounded implementation]:::rig
        R2[Test-covered refactors]:::rig
        R3[Doc updates + YAML lint]:::rig
        R4[Proposed fixes in canary]:::rig
        R5[Per-event attestation]:::rig
    end

    subgraph "Shared (T2, agent implements, human shapes)"
        S1[Cross-repo + interface design]:::shared
        S2[Event schema changes]:::shared
        S3[New public APIs]:::shared
    end

    subgraph "Humans (T3 + semantic)"
        H1[Intent / product strategy]:::human
        H2[Auth / payments / secrets]:::human
        H3[Destructive DB migrations]:::human
        H4[Kyverno policy changes]:::human
        H5[Novel task classes first 20 runs]:::human
        H6[Ambiguous Repair-E escalations]:::human
        H7[Foundational infra recovery]:::human
    end

What the rig does not do

1. Decide intent

The rig does not decide what to build. "Is this the right feature?" "Should we support use case X?" "Does this align with the product strategy?" are human decisions.

Spec-E refines stated intent into precise TaskSpecs. It does not create intent from nothing. A Spec-E that invents features would be a violation of principle 7 (humans at semantic boundaries).

2. Replace humans for T3 actions

Destructive DB migrations, auth/authz changes, payment logic, credential rotation, cluster-scope RBAC — these remain human-driven. The agent assists (writes candidate code, runs tests, prepares the PR) but the human drives and the Kyverno two-attestor policy requires a human OIDC signature at admission.

This is not a workflow to be optimized later. The irreversibility of T3 actions means the human co-sign requirement is structural, not transitional.

T3 on a strict-single-operator rig is structurally blocked

The two-attestor Kyverno policy requires two distinct human OIDC identities to cosign. On a 1-person rig, this is impossible by design — there is no second human to sign. Honest acknowledgment:

  • During single-operator windows (1-person team, or one person traveling/sick), T3 changes cannot ship under the standard policy. This is not a gap to close via "trust the sole operator" — that would defeat the policy's whole purpose. It is a structural limit.
  • Escape paths (each explicit and limited):
    1. Advisor-tier human: designate one or two trusted outside advisors (contracted, mentor, former team member) whose OIDC identity is pre-registered as an eligible second attestor. Their cosign is the mechanism. Use sparingly — each use is attested.
    2. 24-hour cool-down in lieu of second cosign: a T3 PR can carry a cooldown-attested label allowing it to ship 24 h after open if (a) no objection is raised, (b) it has been posted to a dedicated review channel, and (c) the cool-down itself is cosigned off-chain by the primary operator. Weaker than two-attestor — document in the attestation that this mode was used.
    3. Defer: genuinely irreversible T3 work (destructive migrations, payment logic) waits for team growth or an advisor cosign. This is the expected default; the two escape paths are for when the work cannot wait.
  • This is a growth trigger, not a workaround. If T3 deferrals pile up, the team needs to grow past 1 operator.

3. Guarantee against all prompt-injection variants

CaMeL-style architectural separation (safety.md) provides a formal guarantee against the classes of attack defined at the time of the paper. Novel injection classes (side-channels, timing attacks, metadata manipulation, adversarial model-output patterns not yet characterized) are unaddressed until identified and catalogued.

The 2025-2026 CVE wave (CVE-2025-54794/95, CVE-2025-59536/CVE-2026-21852, CVE-2025-68143-45) demonstrates that new injection vectors emerge with every new tool integration. Our defense is designed to be evolvable, not final.

4. Scale linearly with added agents

Anthropic's "Building Effective Agents" warning stands: most multi-agent setups are slower and worse than a single agent with good tools. Our four-role shape (Dev-E, Review-E, Spec-E, Architect-E — with Dev-E handling repair-dispatch as a mode rather than being a fifth role) is the ceiling.

Worth noting: Cognition's "Don't Build Multi-Agents" essay specifically targets fine-grained intra-task multi-agent — many sub-agents collaborating on one function with shared intermediate state. Our coarse role separation with explicit event handoffs (Dev writes → PR → Review reviews) is not the pattern they warn against. I cited that essay more broadly than its actual scope in earlier drafts; the honest reading is narrower.

Adding a sixth agent role ("QA-E", "Security-E", etc.) requires explicit justification: the new role must have a clean event-shaped boundary with existing agents, and must not share intra-task context with them. GPT Pilot's archived 6-role pipeline is the cautionary tale.

5. Recover from foundational infrastructure loss

The rig's self-healing depends on Conductor-E being alive, Marten/Postgres being intact, Flux being functional. If any of those fail:

  • Conductor-E Postgres corruption — requires backup restore, human-driven
  • Flux controller failure — manual Helm chart re-application
  • k3s cluster-level failure — rebuild cluster, restore state
  • OIDC provider outage — Sigstore signing fails, all admission blocked

Runbooks exist for these scenarios, but they are not auto-recoverable. Humans operate them.

6. Write security-critical code unsupervised

Even at high autonomy tiers, the following are always human-reviewed:

  • Authentication / authorization logic
  • Cryptographic primitives (key generation, signing, encryption)
  • Session management
  • Secret handling code paths
  • Kyverno policies themselves
  • Changes to the attestation chain

Review-E approval is not sufficient. Human approval is mandatory.

7. Handle novel task classes reliably on first attempt

Autonomy is earned per task class. A new class (never seen before) starts at T0 regardless of the agent's ceiling on other classes. "Dev-E is great at backend refactors" does not imply "Dev-E is great at frontend React."

First attempts at novel task classes have higher failure rates. Track record is statistical — it takes ~20 runs to establish a reliable baseline. During that period, human observation is required.

8. Guarantee zero downtime absolutely

"Close to zero downtime" is the target; zero downtime is aspirational, not absolute. Scenarios that break the guarantee:

  • Catastrophic infrastructure failure (hardware, network, DNS)
  • Model provider outage (default Anthropic API unavailable; cross-provider fallback via LiteLLM mitigates — see provider-portability.md — but doesn't eliminate)
  • Foundational bug in the self-healing pipeline itself (e.g., Flagger bug)
  • Security incidents requiring emergency human action
  • Novel failure modes outside our playbook

Our SLO targets (99.9% for most services) acknowledge ~43 minutes/month of downtime as acceptable. Higher targets require multi-region failover we don't have.

9. Detect all classes of model drift

Drift detection (drift-detection.md) catches output-hash changes, embedding drift, refusal-rate shifts, and LLM-as-judge-scored regressions. It does not catch:

  • Very slow gradual drift below the week-over-week threshold
  • Drift localized to rare inputs outside the canary suite
  • Drift that changes reasoning patterns without changing final-output hash (subtle but consequential)
  • Provider-side targeted drift against our specific prompts (possible with a compromised provider, not otherwise detectable without multi-provider cross-checking)

Quarterly human review of the canary suite is the defense against meta-drift — the detector itself becoming stale.

10. Make code reviews a solved problem

Review-E provides a first-pass review. Humans still need to review:

  • Every T2+ change
  • A statistical sample of T1 changes
  • Any change that Review-E and the LLM-as-judge disagree on
  • Any change where cost or latency is anomalous

Review-E is an assistant, not a replacement.

11. Operate reliably below minimum scale

Certain detectors break at sub-10-QPS service load:

  • SLO budget math is noisy (a single 500 consumes 10% of the budget)
  • Canary analysis lacks statistical power
  • Burn-rate alerts produce false positives

Mitigation: synthetic probes provide a baseline rate. For services that genuinely see < 1 QPS in production, different monitoring patterns apply — weekly review rather than burn-rate alerts.

12. Replace the need for on-call humans

"Self-healing" does not mean "no on-call." Human on-call is required for:

  • P0 incidents (security, data loss, full outage)
  • Ambiguous diagnoses where Repair-E's confidence < 0.5
  • T3 incidents (auth, payments, destructive changes)
  • Novel failure signatures the rig hasn't seen before
  • Escalations that auto-escalated through severity tiers to DM + @mention

The rig reduces on-call load; it does not eliminate it.

What the rig's guarantees don't extend to

External dependencies

  • GitHub's availability — if GitHub goes down, so does most of the pipeline
  • The configured primary LLM provider — if the default (api.anthropic.com) is down, agents stop unless a LiteLLM fallback_models entry routes to OpenAI / Gemini / Ollama for the affected agents; see provider-portability.md for the cross-provider failover story
  • Container registry availability — if GHCR is unreachable, no new deploys
  • Cloudflare Pages for docs — if Pages is down, docs are unavailable (not runtime-critical)
  • Grafana Cloud for observability — if unavailable, local Prometheus continues; Flagger analysis still runs

Third-party packages

Defenses catch most supply-chain attacks (Dependabot + Socket.dev + SBOM scans + package-age policy + ephemeral install sandboxes). A deeply-embedded, high-reputation, long-aged malicious dependency passes through. Mitigation: L7 egress allowlist limits what a compromised dep can do.

Human errors

If a human mistakenly approves a T3 action that should have been rejected, the attestation records the human's identity. The approval executes. Post-hoc audit catches it; real-time gates do not.

Insider threats

A human with T3 approval authority who acts maliciously is outside the rig's model. Standard mitigations (separation of duties, audit logs, review cadence) apply but are organizational, not technical.

What requires updating this document

This is not a static list. The following discoveries require adding to it:

  • A new failure class observed in production
  • A newly-disclosed prompt-injection technique
  • A new provider behavior (e.g., Anthropic, OpenAI, or Google announces model-version pinning — limitation #9 softens; see provider-portability.md for the multi-vendor posture)
  • A change in scope (e.g., we start serving customer traffic — SLO and compliance limits shift)
  • Deprecation of a mitigating tool (Sigstore changes root of trust, etc.)

Every new limitation is an attestation in git, reviewed by human, tracked. Never hidden.

The underlying tension

The whitepaper describes an ambitious target. Every limitation above is a place where the ambition meets reality. Three honest framings:

  1. Ambition drives design. The whitepaper describes what we want. Limitations describe where we're honest we don't have it.
  2. Limitations are not failures. Every constraint above is also a design choice — humans at semantic boundaries is a feature, not a bug.
  3. The rig changes. These limits are for the trusted rig as envisioned in early 2026. A year from now, some will be closable; others will remain; new ones will emerge.

Reading this alongside the other docs

The one-sentence summary

The trusted rig handles the volume of engineering work within its bounded blast radius reliably enough to earn trust; humans handle the semantic, irreversible, and novel-category work that the rig is structurally unfit to handle.

See also