Skip to content

Development Process — How the Trusted Rig Gets Built

TL;DR

The rig can't bootstrap its autonomous operation — humans ship phase 0 (safety floor) before any cron-dispatched Dev-E runs unattended. Three eras: Bootstrap (pair mode — human-in-loop synchronous, no autonomous dispatch), Supervised (Dev-E autonomous with 100% human PR review), Earned (Review-E gates T0–T1, humans gate T2–T3). Everything runs through the same quality gates at merge; canary + SLO gates at deploy. Small batches, short branches, continuous T0/T1, weekly T2, scheduled T3. One weekly 30-minute human review is the most load-bearing ritual. This document is itself a T2 artifact — process changes go through the same trust model as everything else.

This is the operating manual for the team that builds and runs the trusted rig. It's opinionated — methodology-theater is explicitly rejected. Every rule here either enables a measurable outcome or prevents a specific failure mode.

This doc assumes you've read the whitepaper

Terms like T0–T3, TaskSpec, Spec-E, Dev-E, Review-E, repair-dispatch, blast radius, error budget are defined elsewhere. See glossary.md if any are unfamiliar. The trust-model.md tier definitions are load-bearing for everything below.

The bootstrap reality shapes everything

The rig today cannot be trusted to run autonomously. Dev-E has no StuckGuard; hook events can silently vanish; dangerous-command guards don't exist; the attestation chain is partial. Asking Dev-E to pick up TaskSpecs from a cron queue and run unsupervised for hours — while those floor pieces are what make Dev-E safe to run unsupervised — is the chicken-and-egg problem of self-improving systems.

But "cannot be trusted to run autonomously" is not the same as "cannot be used at all." Two modes of human-agent development with different safety properties:

Mode What it looks like Safety comes from
Pair mode (safe in Era 1 without the floor) Human drives intent turn-by-turn, AI produces output, human reviews in real time, iteration synchronous. Example: this entire whitepaper was written in pair mode. The human is the control loop. Catches loops, stops bad commands, redirects in seconds. Bounded scope per turn.
Async autonomous (requires the floor) Dev-E polls /api/assignments/next every 5 minutes, picks up a TaskSpec, writes code unattended for an hour while the human is asleep, opens a PR that sits awaiting review. The floor is the control loop. StuckGuard, dangerous-command guard, egress policy, hook reliability spool, budget proxy — all deterministic gates because the human isn't watching in real time.

The safety floor exists to replace the human-in-loop when the human steps away. Pair mode keeps the human in the loop, so the floor is not yet required.

The process explicitly handles this with three eras. You do not skip eras. Era 1 must ship before Era 2 can begin. Era 2 must measure a baseline before Era 3 can begin.

graph LR
    classDef e1 fill:#ffebee,stroke:#c62828,color:#000
    classDef e2 fill:#fff8e1,stroke:#f57f17,color:#000
    classDef e3 fill:#e8f5e9,stroke:#2e7d32,color:#000

    E1[Era 1 — Bootstrap<br/>~1-3 weeks<br/>Pair mode<br/>Humans + AI synchronous<br/>No autonomous dispatch]:::e1
    E2[Era 2 — Supervised<br/>~2-3 months<br/>Dev-E autonomous<br/>100% human PR review<br/>Review-E not trusted yet]:::e2
    E3[Era 3 — Earned<br/>Ongoing<br/>Review-E gates T0-T1<br/>Humans gate T2-T3<br/>Autonomy by track record]:::e3

    E1 -->|safety floor shipped<br/>verified by human audit<br/>abuse tests pass| E2
    E2 -->|quality baseline established<br/>Review-E track record built| E3

Era 1 — Bootstrap (pair mode, no autonomous dispatch)

Scope: phase 0 of the whitepaper roadmap — dangerous-command guard, agent identity in git, default-deny egress NetworkPolicy, git worktrees per task. Plus hook reliability spool as the bridge to Era 2.

Who: humans + AI in pair mode. The human drives intent turn-by-turn, the AI produces output (code, YAML, scripts, docs, diffs), the human reviews in real time, iteration is synchronous. Agents can author — the human is the control loop for loops, bad commands, scope drift, and hallucinated APIs.

What does not happen yet: autonomous dispatch. No Dev-E cron picks up TaskSpecs from /api/assignments/next while humans are away. No agent runs unattended for longer than a pair-mode turn. No PR sits for async review while the author goes offline.

Exit criterion: the safety floor is deployed, verified by manual audit, and passes abuse tests:

  • Can an autonomous agent circumvent the dangerous-command guard? (Attempt to bypass, confirm blocked.)
  • Can an event go missing while Conductor-E is unreachable? (Simulate outage, confirm spool retains + drains.)
  • Does StuckGuard actually fire on looping tool-call patterns? (Synthetic loop test.)
  • Is egress denied to anything outside the allowlist? (Try to curl attacker.example, confirm blocked.)

Only after those tests pass does Era 2 begin.

Why this matters: the safety floor exists to protect against loss of immediate human oversight. In Era 1 pair mode, the human is the oversight, which is why pair mode is safe before StuckGuard ships. You need the floor before you step out of the loop, not before you invite AI assistance.

Pair mode is faster than humans-alone

~1-3 weeks for phase 0 in pair mode is realistic, vs. ~3-4 weeks humans-alone. The AI pair accelerates boilerplate (YAML manifests, Kyverno policies, test scaffolding) and can reference docs that a human would have to read. The human still owns every merge decision.

Era 2 — Supervised autonomous delegation

Trigger: safety floor shipped, abuse tests passed, autonomous dispatch enabled.

Scope: phases 1-2 of the whitepaper — StuckGuard, Human Prime, observability stack, eval harness, LiteLLM budget proxy (hook reliability spool is a phase-0 prerequisite).

Who: Dev-E runs autonomously — the cron poll against /api/assignments/next is turned on. Dev-E picks up TaskSpecs unattended, writes code, opens PRs. Every PR gets human review before merge. Pair mode is still available for hard problems; async mode handles routine ones.

Review-E may be running alongside in shadow mode (reviewing, posting comments, but its verdicts do not gate merge) to accumulate a track record vs. human verdicts.

Exit criterion: Dev-E has a measured baseline across at least 3 task classes (goal accuracy, rework rate, cost per merged PR) and Review-E's verdicts have agreed with humans ≥ 90% over 50 PRs.

Duration and exit criterion: reality check

50 PRs at 1-2 person scale is not 2-3 months. Realistic rate during Era 2 supervised-autonomous is ~3-8 merged PRs/week (Dev-E runs against the spec backlog, human reviews every one). That's 6-17 weeks to reach 50 PRs if everything goes smoothly — i.e., 1.5-4 months just to build the dataset. Add Review-E shadow-mode accumulation time (has to see each PR, not just Dev-E's), plus time for the baselines-across-3-task-classes to exist at all, and realistic Era 2 duration is 3-6 months, not 2-3.

The earlier "2-3 months" estimate was aggressive. The honest number is 3-6 months with a tail-risk toward longer if T2 work or backlog gaps slow things down.

Why this matters: you need data to promote to Era 3. "Review-E seems fine" is not a basis for trusting Review-E with merge authority. The shadow-mode period is cheap to run and makes the promotion defensible.

Era 3 — Earned delegation

Scope: phases 3+ (per-consumer cursor, subscription registry, bounded-loop sentinel, escalation routing) and ongoing feature work.

Who: Dev-E + Review-E handle T0 and (after per-class track records) T1 without human review. T2 requires human interface approval. T3 is always human-driven.

Exit condition for an agent × task class: 20 successful runs, zero rollbacks → tier ceiling rises. Any rollback attributable to the agent → immediate demotion, 30-day cooldown.

Team topology

Honest staffing, based on what 1-2 humans can actually sustain:

Model names in the table are illustrative defaults, not requirements

The "Haiku-backed" / "Sonnet-backed" tags below are the current default routing. Per provider-portability.md, any LiteLLM-supported model can substitute; see the fallback_models config pattern in cost-framework.md. The role / responsibility / time-commitment columns are the load-bearing content.

Role Who Responsibilities Time commitment
Architect / Operator (human) 1-2 people Product intent, T3 decisions, weekly quality review, incident command, tool-choice re-evaluation, Kyverno policy changes, pair-mode partner during Era 1 High-touch during Era 1 pair mode, tapering through Era 2 as autonomous dispatch takes routine work
Spec-E (agent, build in Era 2) default: Haiku 4.5 Refine fuzzy issues → TaskSpec with acceptance criteria + blast-radius class Many small calls, cheap
Dev-E (agent) default: Sonnet 4.6 Implement TaskSpecs — T0/T1 autonomously, T2 under human interface approval, assist on T3 Continuous
Review-E (agent, trusted only in Era 3) default: Sonnet 4.6 PR review, bounded-loop sentinel on review/dev ping-pong, never reviews own code Triggered per PR
Dev-E in repair-dispatch mode (Era 3) default: Sonnet 4.6, same pod class as normal Dev-E Incident diagnosis from trace + git blame + deploy correlation, propose forward-fix or revert Triggered by SLO burn alerts, not a separate agent role

The role-count ceiling

Five agent roles is the ceiling. Adding a sixth ("QA-E", "Docs-E", "Security-E") requires the new role to have a clean event-shaped boundary with existing agents and not share intra-task context. GPT Pilot's archived six-role pipeline is the cautionary tale. When in doubt, build a tool, not a new agent.

Work intake and planning

The unit of work is a GitHub Issue refined into a TaskSpec. There is no separate project management tool.

sequenceDiagram
    participant H as Human
    participant I as GitHub Issue
    participant S as Spec-E
    participant CE as Conductor-E
    participant D as Dispatcher

    H->>I: Create issue with intent
    I-->>S: Webhook event
    S->>S: Read issue body + repo context
    alt ambiguous
        S->>I: Post clarifying questions<br/>label needs-clarification
    else clear
        S->>CE: Commit TaskSpec<br/>tier, acceptance_criteria,<br/>test_strategy, expected_tokens
    end
    H->>I: Answer clarifications
    I-->>S: Re-evaluate
    S->>CE: Commit refined TaskSpec
    CE->>D: Dispatch per tier rules

What a TaskSpec looks like is defined in trust-model.md. Spec-E doesn't exist yet in Era 1 — humans write TaskSpecs by hand until Era 2 when Spec-E comes online.

Multi-PR work uses the GitHub Spec Kit layout: .specify/spec.md, .specify/plan.md, .specify/tasks/*.md in the target repo. Parent GitHub Issue links to the spec directory; sub-issues are auto-generated from tasks. See tool-choices.md for why we chose Spec Kit over Backlog.md or Beads.

The development loop

Short-branch trunk-based development. No long-lived feature branches.

graph TB
    classDef author fill:#e3f2fd,color:#000
    classDef gate fill:#fff3e0,color:#000
    classDef deploy fill:#e8f5e9,color:#000
    classDef measure fill:#f3e5f5,color:#000

    A[TaskSpec dispatched]:::author --> B[Branch: feature/issue-N-*]:::author
    B --> C[Implementation]:::author
    C --> D[Commit with gitsign]:::author
    D --> E[Push + open PR]:::author
    E --> G1[CI: tests lint types docs-check]:::gate
    G1 --> G2[Property tests on non-trivial changes]:::gate
    G2 --> G3[Review-E or human review]:::gate
    G3 --> G4[LLM-as-judge sample 10% T1 100% T2]:::gate
    G4 --> M[Squash merge to main]:::author
    M --> F[Flux reconciles]:::deploy
    F --> CAN[Flagger canary 5% → SLI → 25% → 50% → 100%]:::deploy
    CAN --> PROM[Promoted]:::deploy
    PROM --> MEAS[Track record projection updated]:::measure
    MEAS --> TIER[Tier promotion eligibility re-evaluated]:::measure

Branch rules: - feature/issue-N-short-slug — most work - fix/issue-N-short-slug — bug fixes - docs/* — documentation - chore/*, ci/*, test/*, refactor/* — matches AGENTS.md branch prefix list - Lifespan <24 hours for T0/T1. Longer branches are a signal the task should have been broken down.

Commit rules (enforced by CI): - Conventional Commits format: feat: ..., fix: ..., docs: ... - Agent commits signed with gitsign (ephemeral Fulcio cert bound to agent OIDC) - PR merged with squash, never merge-commit or rebase-merge — preserves conventional commit on main

PR rules (enforced by branch protection): - Title = conventional commit format, <70 chars - Body = ## Summary + ## Test plan sections (from AGENTS.md) - Closes #N reference to parent issue - Agent-authored PRs end with Generated with Claude Code footer

Testing strategy, layered per blast-radius tier

Not every change needs every test. Test depth scales with tier.

Tier Required tests What this catches
T0 (docs, YAML, test scaffolding) Lint + frontmatter validation + link check Syntactic correctness
T1 (single-service feature) T0 + unit tests holding coverage + property-based tests (Hypothesis) on non-trivial changes + integration tests against real Conductor-E + real Postgres Semantic correctness within bounded scope
T2 (cross-cutting) T1 + schema contract tests + LLM-as-judge (Opus reviews Sonnet's diff against TaskSpec acceptance criteria) Cross-service integration correctness
T3 (irreversible) T2 + reproduction in ephemeral namespace + explicit rollback playbook + human-authored test Semantic correctness + reversal plan

Three evaluation cadences

Cadence What runs Who looks at it
Per-PR Full gate set per tier (above) CI blocks merge on any failure
Nightly (~8h, ~$20-40) 30-task SWE-bench Pro subset + 10-task internal golden suite + accumulated regression cases Dashboard + alert on >10% regression
Weekly (30 min, human-driven) Quality dashboard review — goal accuracy, rework rate, cost per merged PR, rollback rate per agent Decide tier promotions / demotions; file issues on regressions
Monthly (1 human, 1 hour) Tool-choices ADR re-evaluate-when triggers; canary suite freshness audit Decide tool swaps, prompt updates
Quarterly (1 human, half-day) Whitepaper review — is anything in here wrong given what we learned? Update docs; refresh evaluation suite

The weekly review is load-bearing

One person, 30 minutes, dashboard-driven, outcomes recorded. Skip it and the rig drifts without anyone noticing. If only one ritual survives from this document, it should be this one.

Property-based testing specifically

For every non-trivial agent-authored change:

  1. After Dev-E submits a PR, spawn a PropertyTest-E subagent
  2. PropertyTest-E reads the diff, identifies invariants, writes 5-10 Hypothesis property tests
  3. Runs them with reasonable bounds (100 examples per property, 60s max — prevents Hypothesis-heavy CI)
  4. If any property fails, Dev-E iterates
  5. If all pass, they become permanent regression tests checked into the repo

Non-trivial = changes > 20 lines, new function/method, or heuristic match on business-logic names (compute, calculate, validate, process). arXiv:2510.09907 (October 2025) shows LLM-generated property tests find bugs beyond unit-test coverage.

Quality gates — what must be true at every merge

Every single gate below is automated. None of them are "check manually" or "the reviewer eyeballs it." Miss one → automatic block.

  1. CI green: tests + lint + type-check + docs-check + link-check + frontmatter validation
  2. Review approved: Review-E (if Era 3 and agent has earned the tier) OR human
  3. No conflicts on main: rebase if needed before merge
  4. Conventional-commit title: validated by CI
  5. Squash-merge policy: enforced by branch protection
  6. New dependencies pass supply chain check: Socket.dev score ≥ threshold + package age ≥ 14 days + Dependabot malware clean
  7. New event types or interfaces: subscription.yaml in rig-gitops updated in the same PR
  8. Prompt changes: golden-suite regression test passes
  9. T3 changes: two-attestor Kyverno policy satisfied (agent Sigstore sig + human OIDC cosign)
  10. Attestation chain complete: gitsign commit + SLSA provenance + cosign image sig all present and verifiable

No ad-hoc 'merge and fix later'

The gates exist because every failure mode above has burned someone in production. Bypassing them on a Friday evening is the single most reliable way to produce a weekend-ruining incident. If a gate is genuinely wrong for a given PR, the fix is to change the gate in a T2 PR, not to bypass it.

Release cadence

Tier Cadence Rationale
T0 / T1 Continuous — merge-to-production <30 min in Era 3 Canary handles safety. Fast loops = fast learning.
T2 Weekly batch (Monday merges) Weekend gives humans a buffer to catch regressions; Flagger still enforces SLO gates on each promotion.
T3 As-scheduled by humans Never rush. Planned deliberately, executed with human co-sign.

The error-budget gate substitutes for formal feature freezes. When a service's SLO burn rate hits threshold, only fixes (attested as such) can merge until budget recovers. This avoids the "it's Friday, we're stopping all deploys" pattern that doesn't map well to a small-team rig.

Feedback loops

The rig improves only if these loops close. Each one is visible on a dashboard or in the event log.

graph TB
    classDef inc fill:#ffebee,color:#000
    classDef learn fill:#e8f5e9,color:#000
    classDef ops fill:#e3f2fd,color:#000

    INC[Every incident]:::inc --> PM[Post-mortem record in Conductor-E<br/>+ regression test added]:::learn
    RB[Every rollback]:::inc --> ATT[Attributed PR tagged<br/>agent tier drops one level]:::learn
    WK[Every week]:::ops --> TC[Tool-choices ADR triggers checked<br/>ownership shifts license changes CVEs]:::learn
    MO[Every month]:::ops --> CS[Canary suite refreshed<br/>is it still catching relevant drift]:::learn
    QT[Every quarter]:::ops --> WP[Whitepaper review<br/>what did we learn that contradicts this doc]:::learn

    PM --> TIER[Tier policy updates]:::learn
    ATT --> TIER
    TC --> SWAP[Tool swap decision]:::learn
    CS --> UPDATE[Update evaluation suite]:::learn
    WP --> ADR[ADR records for changed picks]:::learn

Specifically: the post-incident loop

Every resolved incident — whether auto-fixed by Dev-E (repair-dispatch) or human-resolved — produces:

  1. Structured incident record in Conductor-E (SLI that fired, trace IDs, diff, decision log, time-to-resolve)
  2. Templated post-mortem GitHub Issue opened automatically
  3. The fix PR tagged with incident-id:N so cross-linking is queryable
  4. A regression test added to the nightly suite covering the failure signature
  5. If a similar signature fires later, Dev-E (repair-dispatch) retrieves prior fixes first — the loop closes

Emergency process

When production breaks (SLO burning, customers affected):

  1. Feature flag kill first — flagd, ~30 s via git commit → Flux reconcile. Reversible. Lowest blast.
  2. If that fails, rollback — Flagger canary re-promotion of previous version, ~5 min. Total but slow.
  3. Never ad-hoc hot-fix without canary. No emergency fast path. This is the Cloudflare Dec 5 2025 lesson enforced by Kyverno policies.
  4. T3 still requires human co-sign even in emergencies. A destructive migration to fix production still needs two attestors. Production urgency is not a reason to weaken safety guarantees.
  5. Within 24 h: post-mortem record + regression test + tool-choice re-evaluation if a tool contributed.

The on-call reality

Self-healing reduces on-call load but does not eliminate it. Human on-call required for: - P0 incidents (security, data loss, full outage) - Ambiguous diagnoses where Dev-E (repair-dispatch) confidence < 0.5 - T3 incidents (auth, payments, destructive changes) - Novel failure signatures the rig hasn't seen - Escalations that auto-bumped through severity tiers to DM + @mention

See self-healing.md for the full incident response pipeline and limitations.md for what self-healing cannot do.

Measuring whether the process itself is working

The development process has its own SLOs. If these slip, the process is failing:

Metric Target Alert on
Median PR lead time (issue-created → merged) <1 day T0, <1 week T1, <1 month T2 >2× baseline for 2 weeks
T2 human-approval turnaround (PR opened → interface approved or rejected) <24 h business-hours >48 h or any one ≥ 72 h
Rework rate (commits added after initial PR draft) <10% >20% for a class over 30d
Rollback rate <5% of merges Any T3 rollback
Weekly review attendance Every week 2 consecutive misses
Time-to-post-mortem (incident closed → post-mortem published) <24 h >48 h
Regression-test-add rate (post-mortems that added a test) 100% Any post-mortem without a test
Tool-choices re-evaluation cadence Monthly Skipped for 2 months

T2 approval is a bottleneck at 1-person scale

On a 1-2 person rig the T2 gate chain — GitHub Deployments API entry + required-reviewer approval + Sigstore human-cosign — serializes through one human. If that human is sick, on vacation, or heads-down on something else, every T2 PR blocks. Mitigations:

  • T2 work queues: Conductor-E batches T2 PRs into a reviewable list; no single PR blocks the pipeline
  • Default-open-then-approve: T2 can ship to a feature branch and canary but cannot merge to main without the cosign — the SLO above governs the cosign, not the PR open
  • Escalation: if the T2 turnaround metric breaches 48h, Conductor-E emits a T2_APPROVAL_STALE event which routes to Discord DM — same severity as a P2 incident

At strict-single-operator scale, this SLO can legitimately slip during travel/illness. The honest framing: the SLO is what we aim for in steady-state; the mitigations are what kick in when it slips. See limitations.md for the structural-limit acknowledgment.

These are dashboarded like any other SLO. They exist because a broken development process is invisible until it fails catastrophically — the point of metrics is to make it visible before that.

What to actively avoid

Standup theater. One human + agents doesn't need daily standups. The Conductor-E event stream is the standup. Humans read the dashboard on Monday morning and that's enough.

Sprint ceremonies for their own sake. The weekly quality review is the only calendar ritual. Everything else runs on work-in-flight triggers.

Calendar-driven feature freezes. Use the error-budget-exhaustion gate instead. When SLO burns, only fixes merge. Recovery closes the gate.

Waterfall T2 planning. T2 interface design goes in a small GitHub Issue + .specify/spec.md sidecar, reviewed by one human, then dispatched. Not a multi-week design doc that expires before it's used.

Process bureaucracy for agents. Dev-E doesn't need a Jira ticket — it needs a TaskSpec. Keep the agent-facing surface frictionless; humans deal with the strategic layer.

Enabling autonomous dispatch before phase 0 ships. This is the single biggest cultural failure mode. Autonomous agents without the safety floor produce incidents at a rate and severity that pair mode doesn't (because pair mode has the human control loop). Pair mode can start on day one — autonomous dispatch waits for the floor. Do not confuse "agents helping" with "agents running unsupervised."

Adopting phase 5+ features before phase 0 ships. Related failure mode. You cannot build Dev-E (repair-dispatch) for autonomous production repair before StuckGuard is shipped — Dev-E (repair-dispatch) would inherit all the failure modes StuckGuard catches, and worse, Dev-E (repair-dispatch) would be acting in production with those failure modes active. Pair-mode development of Dev-E (repair-dispatch) in Era 1 is fine; Dev-E (repair-dispatch) dispatched on production incidents is not.

"Let's skip the tests for speed." The gates exist because every one of them represents a production incident someone already paid for. Skipping a gate is paying the cost again. Gates are changed via T2 process, not bypassed.

Human-reviewed-by-the-PR-author. An author cannot be their own reviewer — this is basic but worth stating because single-human teams drift into self-review under time pressure. If only one human is available, pair with the agent reviewer (Review-E) + LLM-as-judge sampling, and accept that self-review is a gap.

The meta-rule: this process is itself T2

Changes to this document — new agent roles, new tiers, new test cadences, new release rules — are T2 changes. They require:

  • PR against dashecorp/rig-gitops
  • Human interface approval (another human-OIDC cosign)
  • Review-E approval (once Review-E is earned)
  • Measurable rationale for the change (what metric will move? in which direction?)

This doc is not a constitution. It is a living ADR that updates as the rig learns. When a process rule stops serving us, we change it — through the same mechanism we change anything else.

The one-sentence summary

Humans ship the floor, agents handle the volume, every change passes the same gates, the weekly review catches what the gates miss, and the process itself is subject to the trust model.

Open questions for the team

Things this doc deliberately does not answer because they depend on team preferences:

  1. Who's on-call? In a 1-2 person team, on-call is shared but the split is a team choice.
  2. Work hours and async vs sync? Agents work 24/7; humans do not. The team decides whether human-gated T2/T3 work happens on business days only or whenever.
  3. Recognition and compensation for agent-authored work. Agent identity in git attributes commits (principle 5) but the HR/accounting side is a team policy, not a technical rule.
  4. Onboarding a new human. Read order, access provisioning, first-week tasks. Pending when the team grows.
  5. Offboarding a human. Credential rotation, Kyverno identity removal, access audit. Pending when it happens.

These become their own ADRs when the team addresses them.

See also