Design Principles — The Trusted Rig¶

Ten rules. Each states the principle, what it rejects, the engineering consequence, and where it shows up in the system.

How to use this document

When a proposed feature seems appealing, check it against all ten principles. If it violates one, either the principle is wrong (rare — requires an ADR) or the feature is wrong (common). Principles are filters, not decoration.

1. Measurable over hoped¶

Rule: Every property that matters — quality, cost, latency, drift, reliability — must have a live metric. If it can't be measured, it can't be trusted.

Rejects: "The agents seem to be working well lately." "Costs feel reasonable." "I think this prompt is better."

Consequences: - Every agent operation emits an OpenTelemetry span with token counts, tool names, duration. - Every deploy emits metrics via Prometheus that a Flagger AnalysisTemplate can evaluate. - Every LLM call passes through Langfuse so prompt-version × model × task-type × outcome is a queryable tuple. - Before a feature lands, the dashboard line for it is drawn. Before a feature is "shipped," the line moves in the right direction on real data.

Where: observability.md, quality-and-evaluation.md, drift-detection.md.

2. Bounded blast radius¶

Rule: Every agent action has a maximum scope, enforced by a gate, not by trust.

Rejects: "The agent is well-behaved, it won't touch X." "Agents have the same permissions as humans — if the human trusts itself, the agent can be trusted."

Consequences: - Dangerous-command guard blocks destructive shell commands at the PreToolUse hook layer. - Kyverno admission policies reject manifests that target sensitive namespaces without a human co-signer attestation. - Cilium L7 NetworkPolicy restricts each agent pod's egress to a concrete allowlist. - LiteLLM proxy caps per-agent token budgets so a single looping agent can't drain the shared rate limit. - Tiered autonomy (T0-T3) classifies every task by blast radius and routes T3 through explicit human approval.

Where: safety.md, security.md, trust-model.md, cost-framework.md.

3. Reversible before irreversible¶

Rule: Prefer feature-flag kill > rollback > rollforward-fix. Agents default to the most reversible action that solves the problem.

Rejects: "It will probably work, let's just deploy the fix." "Rollback is embarrassing." "A kill switch adds complexity."

Consequences: - flagd + OpenFeature feature flags are the first-line kill switch (~30 seconds via git commit → Flux reconcile). - Flagger canary auto-rollback is the second line (~5 minutes). - Forward-fix via Repair-E is the third line (PR → canary → promote), only after flag-kill or rollback already stabilized production. - Destructive DB changes use pgroll expand/contract so every migration step is individually reversible until the contract step — and the contract step requires a human co-sign.

Where: self-healing.md.

4. Execute, don't trust¶

Rule: LLM output is a hypothesis. Tests, type-checkers, compilers, schemas, canary analysis, property-based tests, and production metrics are the verifiers.

Rejects: "The code looks right." "The model is reasoning correctly here." "Review-E approved it, so it must be correct."

Consequences: - Every agent-authored diff must pass a test suite. Insufficient coverage → agent adds tests first. - Structured output forced through JSON Schema tool-use. Instructor + Pydantic validates every tool call. - Property-based tests (Hypothesis) generated by a subagent for every non-trivial function. arXiv:2510.09907 shows LLM-generated property tests find bugs beyond unit-test coverage. - Formatter-reversion check: if running prettier/black/gofmt on the diff reverses it, the diff was semantically null — reject. - Canary SLO gate verifies the change survives production traffic before promotion.

Where: safety.md, quality-and-evaluation.md, self-healing.md.

5. Attestable, replayable, auditable¶

Rule: Every change is traceable from the original intent through to the deployed artifact, with cryptographic evidence at each step, and any step can be replayed.

Rejects: Audit trails built by grep-ing logs. "Agent did it." Changes where the plan, the code, and the deploy have no cryptographic link.

Consequences: - Every agent commit is gitsign-signed with an ephemeral Fulcio certificate tied to the agent's OIDC identity. - Every build emits SLSA v1.0 Provenance via slsa-github-generator, signed and logged in Rekor. - Every image is cosign-signed. - Every deploy's Kyverno admission decision records the attestation chain. - rig-conductor's event store is the replay substrate — give it the issue ID and it reconstructs every event, every agent call, every tool invocation.

Where: security.md, observability across the board.

6. Progressive autonomy¶

Rule: Agents start at T0 autonomy on new task classes. They earn higher tiers by maintaining measured track records: N successful canaries in a row, zero human-rework on M merged PRs of that class, zero rollbacks.

Rejects: Setting agent trust level by gut feel. Handing T3-blast-radius work to an agent because it did well on T0 work. Flat permission models.

Consequences: - TaskSpec.blastRadius classification at intake determines starting tier. - rig-conductor tracks per-agent × per-task-class outcomes and adjusts autonomy over rolling 90-day windows. - Policy exceptions (promoting an agent from T1 to T2) are recorded attestations, themselves subject to human co-sign. - When a new model version is deployed (e.g., Sonnet 4.6 → 4.7, or cross-vendor swap to GPT-5.2 / Gemini 3.1 Pro — see provider-portability.md), the track record resets to conservative tiers and re-accumulates.

Where: trust-model.md, quality-and-evaluation.md.

7. Humans at semantic boundaries¶

Rule: The rig handles implementation; humans decide intent. Value trade-offs, ethics, product direction, and semantic-invariant decisions remain human.

Rejects: Agents deciding whether a feature should exist. Agents auto-approving business trade-offs. "The agent suggested this so it must be the right call."

Consequences: - Spec-E refines intent but does not create goals; it asks clarifying questions back to a human. - Architect-E shapes interfaces where the semantic shape matters; the final interface sign-off is human. - Kyverno two-attestor policies enforce human co-sign on T3 actions. - Escalation routes unresolved ambiguity to humans rather than letting agents guess.

Where: trust-model.md, limitations.md.

8. Trusted control plane, untrusted data plane¶

Rule: Separate the code that decides what to do from the data that informs the decision. Untrusted content (issue bodies, code comments, README files, external API responses) never controls the agent's reasoning — it's treated as data, processed by a quarantined LLM without tool access, with outputs passing through a validator back to the trusted plane.

Rejects: Giving the same LLM instance both the authority to call tools and the task of parsing attacker-controllable text. "Our prompts are good, we don't need separation."

Consequences: - CaMeL-style architecture (DeepMind, arXiv:2503.18813) for all agent tasks that consume external input. - Prompt-injection defenses layered (CaMeL + L7 egress + tool scoping + content classification), not stacked. - Tool surfaces minimized per agent role — Review-E has no shell; Dev-E has no admin database access.

Where: safety.md, security.md.

9. Fail closed, fail known¶

Rule: When uncertain, stop. Emit a known-state failure event, not a guess. Unknown state must be indistinguishable from "do nothing" in its external effect.

Rejects: Agents that attempt to recover from unexpected errors by "trying something reasonable." Silent event-loss that looks like healthy idle. Agents that say "I think I fixed it" when they can't verify.

Consequences: - Hook reliability spool retains events locally if rig-conductoris unreachable; no event loss looks like healthy idle. - StuckGuard emits AgentStuck and exits the loop rather than letting a confused agent continue. - Kyverno rejects manifests with unverifiable attestation rather than "allowing for now." - LLM responses without complete structured output are retried with schema feedback, not parsed loosely. - Honest-refusal metric: track how often agents correctly say "I don't know" on a fixed unanswerable-prompt suite.

Where: safety.md, observability.md.

10. Simple enough to operate¶

Rule: 1-2 humans must be able to run the rig. Every tool added is a tax. Every abstraction is a tax. Prefer the concrete option with fewer moving parts.

Rejects: Complex orchestration layers whose operational cost exceeds their capability gain. Frameworks adopted "for future-proofing." 18-role pipelines (Sweep AI's cautionary shape).

Consequences: - Five agent roles, not fifty. GPT Pilot's archived 6-role pipeline is the warning. - Single centralized event store ( rig-conductor+ Marten + Postgres), not parallel systems. - YAML policies (Kyverno) over programmatic policies (OPA Gatekeeper Rego). - Self-hosted Langfuse, not "let's build our own observability platform." - Managed Grafana Cloud Free for traces and logs, not LGTM-on-8GB. - Flagger + flagd (both CNCF) over bespoke canary scripts.

Where: observability.md, security.md, everywhere.

How the principles interact¶

graph TB
    classDef m fill:#e8f5e9,color:#000
    classDef s fill:#fff3e0,color:#000
    classDef a fill:#e3f2fd,color:#000
    classDef h fill:#fce4ec,color:#000

    P1[1. Measurable]:::m
    P4[4. Execute, don't trust]:::m
    P5[5. Attestable]:::m

    P2[2. Bounded blast radius]:::s
    P3[3. Reversible before irreversible]:::s
    P8[8. Trusted control + untrusted data]:::s

    P6[6. Progressive autonomy]:::a
    P7[7. Humans at semantic boundaries]:::a

    P9[9. Fail closed, fail known]:::h
    P10[10. Simple enough to operate]:::h

    P1 -->|enables measurement for| P6
    P1 -->|generates evidence for| P5
    P5 -->|underpins tier advancement| P6
    P4 -->|forces execution of| P2
    P8 -->|enforces| P2
    P2 -->|makes| P3
    P3 -->|enables| P9
    P9 -->|routes exceptions to| P7
    P6 -->|respects| P7
    P10 -->|filters| P1
    P10 -->|filters| P2
    P10 -->|filters| P5

Measurement (1, 4, 5) is the foundation — without it, none of the other principles are enforceable. Safety (2, 3, 8) is the blast-radius contract. Autonomy (6, 7) is the trust-earning process. Honesty (9, 10) is the operational integrity layer.

The ten principles are not separately defensible — each depends on the others.

Principle conflicts¶

Two principles will occasionally conflict. Recorded resolutions:

Conflict	Resolution
Measurability (1) vs. Simplicity (10) — "more metrics = more infra"	Managed observability (Grafana Cloud) + self-hosted only what must be local (Prometheus for Flagger). Adds one SaaS dependency, keeps simplicity.
Autonomy (6) vs. Human-at-semantic-boundary (7) — when does an agent earn the right to decide?	Never for T3. For T0-T2, autonomy is scoped to implementation decisions; the semantic intent (captured by Spec-E and the `TaskSpec`) is always human-shaped.
Execute (4) vs. Cost (10) — running tests + property tests + canary analysis is expensive	Cost is a measurable input; the error-budget gate makes cost part of the decision. High-risk changes get more verification; low-risk changes get less.
Fail-closed (9) vs. Self-healing (3) — "when uncertain, stop" conflicts with "auto-fix production"	Repair-E fails closed on ambiguity: if the trace + git blame + diff doesn't yield a high-confidence fix, it escalates to human instead of guessing. The closed loop still runs, just with a human hand-off on the hard cases.

Anti-principles¶

Specific rules the trusted rig rejects, even though some adjacent systems adopt them:

"Agents should behave like good employees." Anthropomorphizing leads to trust-by-vibes. We trust by measurement.
"Fully autonomous end-to-end." Gastown's GUPP model is coherent for that goal; our model is human-in-loop for the right reasons (PR-based review, semantic judgment, irreversible actions). We are not that system.
"The best prompt solves prompt injection." No amount of prompting is a substitute for architectural separation.
"Add an LLM to supervise the LLM." LLM-as-judge is useful for quality sampling; it is not a substitute for deterministic guards at the tool-call layer.
"One cluster, one agent, one model." Diversity of models and explicit role separation is a feature. Over-consolidation amplifies single-point failure modes.
"Ship fast, observe later." Measurement precedes autonomy. Phase 2 of the roadmap is observability for this reason.

Violating a principle

The only legitimate path to violate a principle is to update this document first (an ADR) with the reason, the consequences, and the compensating controls. "We can't measure this one thing, so here's what we do instead" is fine. "We'll do it and fix it later" is not.