MVP Scope — Minimum Viable Rig¶

The core question this doc answers

What is the smallest set of deployed capabilities that makes the rig usable for real work? MVP is Phase 0 + Phase 1 + Phase 2 from the whitepaper roadmap (Phases lasting ~4-6 weeks total) plus selective Phase 3 elements. That gives us:

Safety floor: dangerous-command guard, agent identity in git, egress locks, worktrees
Reliability: stuck detection, hook resiliency
Measurement: observability (OTel + Langfuse), cost tracking
Coordination upgrade: per-consumer cursor (not full subscription registry)

Estimated effort: 3-4 pair-mode weeks. Deliverable: one end-to-end GitHub issue → Dev-E implements → Review-E approves → merge, with bounded cost, no destructive mistakes, and measurable quality signals.

What "MVP" means in this context¶

The rig earns trust for a task when:

The blast radius is bounded — code can be rolled back, effects reversed
We have measured track record on that task class — not just "agents are good at this" but "this rig, this repo, succeeds N%"
Every action is attestable — cryptographic chain from intent to artifact
Failure modes are known and handled — stuck detection, budget exhaustion stops work, no silent failures

An MVP rig has all four for a single narrow task class: "file a GitHub issue for a Node.js repo feature, Dev-E implements with tests, Review-E approves, code lands and deploys." A full rig extends this to 10+ task classes with different autonomy tiers and blast radiuses.

This MVP is not production-ready for a large team. It is valid for a single developer or 2-3 person team to ship small features to their own infrastructure and catch logic bugs in staging before they reach prod because the loop closes: measure, detect drift, escalate.

What's already deployed (today, as of 2026-04-17)¶

Deployed (21 capabilities across 7 domains):

Coordination¶

rig-conductor event store (Marten/Postgres): 28 event types, all projections live
POST /api/events endpoint: production-active
Assignment dispatch (GET /api/assignments/next): priority + FIFO (no capacity check yet)
Review claim endpoint (GET /api/reviews/next): confirmed working, docs were stale

Agent execution¶

Dev-E (Node variant): active, 5-minute cron dispatch
Review-E: deployed, independent review gate
Both agents: Claude Code CLI runtime, GitHub MCP, advisor MCP, memory MCP (pre-installed)

Security¶

SOPS + age encryption + Flux inline decryption: confirmed working across all Kustomizations
GitHub App tokens (GitHub Actions workflow): deployed for CI

Observability¶

OpenTelemetry Collector: deployed for rig-conductorspans
Local Prometheus: deployed via kube-prometheus-stack (ready but not yet source of truth for Flagger)
Cost dashboard (basic): static HTML, TokenUsageProjection aggregates per agent × repo

Memory¶

Postgres + pgvector storage (co-located with Marten): ready
HNSW + GIN indexes: ready
OpenAI embeddings (optional, silent BM25 fallback): deployed
search_memories MCP tool: hybrid vector + BM25 search
Session-start memory LOAD: confirmed in logs
Advisor handoff protocol: prompt-level (PR #71), zero enforcement

Cluster and runtime¶

k3s on 8GB GCP VM (invotek-k3s): stable
KEDA autoscaling: deployed
FluxCD GitOps: syncing rig-gitops every 10m
GitHub Actions + GHCR: per-repo builds published
Cloudflare Tunnel: rig-conductor.dashecorp.com live
Discord webhooks: rig-conductor event listener posts to #dev-e, #review-e

Development process¶

AGENTS.md standard: deployed, enforced across all repos
Mermaid CI check: .github/workflows/mermaid-check.yml on all PRs

Partial status (7 capabilities — gaps acknowledged):

Dev-E dotnet variant: HelmRelease exists, cron.enabled: false → functionally dormant
Dev-E python variant: same as dotnet
Memory write pipeline (save_pattern): exists but agents don't emit the ### Learnings section
Memory mark_used: tool exists, never called → hit-counter metric is fiction
Memory compact_repo: tool exists, no cron trigger
Flux-detected code/config drift: Flux detects, not yet surfaced as alerts
TokenUsage projection: basic aggregation only, no hard enforcement (LiteLLM proxy not deployed)

Critical gap analysis: what's blocking MVP¶

Phase 0 items (4 capabilities, ~1 week):

1. Dangerous-command guard¶

What breaks without it: Dev-E can execute rm -rf /, git push --force, sudo, drop table. One hallucination and the repo is corrupted. Unacceptable.

What it does: PreToolUse hook, rejects: - sudo, sudo -i, sudo -s - rm -rf /, rm -rf ., git reset --hard, git clean -fdx (destructive) - git push --force (allow --force-with-lease only) - SQL drop table, truncate, delete from without WHERE clause - Package-manager installs (pip install, npm install via tool — ephemeral sandbox only) - No override flag (Gastown's deliberate choice)

Evidence: Fully specified in example-first-story.md with user story, test matrix, rollout sequence. This is the first real user story.

Effort: 2 pair-mode days (hook + 3 test suites: unit, integration, e2e on actual agent)

Dependencies: None; ship first.

2. Agent identity in git¶

What breaks without it: Commits appear from generic bot, no audit trail. Dev-E can hide who broke what.

What it does: Every Dev-E/Review-E commit signed with agent's SSH key (deployed via rig-conductor secret store), author = agent identity, message tags [Dev-E] / [Review-E].

Effort: 1 pair-mode day (SSH key rotation, GitHub signing verification config)

Dependencies: None; parallel to guard.

3. Default-deny egress NetworkPolicy¶

What breaks without it: Dev-E can exfiltrate secrets to attacker-controlled domain. CaMeL (formal prompt-injection defense) is too heavy for Phase 0, but network-level egress control is cheap.

What it does: Cilium NetworkPolicy (L7 DNS + HTTP allowlist). Allows: - api.github.com (GitHub API) - git.github.com (Git fetch/push) - cdn.jsdelivr.net, registry.npmjs.org (npm install) - rig-conductor-api (internal) - python.org, pypi.org (Python only in dev-e-python pod) - Nothing else.

Effort: 2 pair-mode days (Cilium policy template, per-pod overrides, testing ingress/DNS/HTTP separately)

Dependencies: Cilium already deployed; just write the policies.

4. Git worktrees per agent task¶

What breaks without it: Agents clone the full repo for every issue, then delete. Slow cold-start, high I/O, breaks if clone fails midway.

What it does: Use git worktree add per issue (Cursor 2026 pattern). Shared main checkout, ephemeral branch worktrees, ~1.5s faster per task.

Effort: 1.5 pair-mode days (update Claude Code spawn logic, cleanup cron)

Dependencies: Purely runtime; no other items depend on it.

Phase 1 items (2 capabilities, ~1.5 weeks):

5. Hook reliability spool¶

What breaks without it: rig-conductorgoes down (restart, upgrade, outage). Agents stop receiving heartbeats. Silent death — no escalation.

What it does: - Agents emit events to local spool file (write-ahead log, /tmp/events.jsonl) - Local heartbeat to rig-conductor every 60 seconds - On 5xx or timeout: retry spool against rig-conductorwhen it recovers - Spooled events delivered at least once (idempotent event handler on rig-conductorside)

Effort: 2 pair-mode days (WAL format, spool-flush logic, idempotency on rig-conductor)

Dependencies: Must come after Phase 0 (if guards fail, we want events spooled before escalation).

6. StuckGuard middleware (5 patterns)¶

What breaks without it: Dev-E loops indefinitely on same tool call (e.g., retrying a download 100× in 1 minute). Tokens burn, no progress detected.

What it does: Deterministic loop detection at tool-call layer: 1. Tool call repetition — same tool 5+ times in 10 turns → stuck 2. Error repetition — same error message 3+ times → stuck 3. State unchanged — repo file count same after 10 steps → stuck 4. Output no-op — command succeeds but produces no output 5× running → stuck 5. Token budget burn — input tokens > 90% of allocated → warn before stuck

From OpenHands, Goose, Sweep research — three independent codebases converged here.

When triggered: emit AgentStuck event → rig-conductorescalates → Discord notification + on-call page.

Effort: 2.5 pair-mode days (5 patterns, metrics collection, test on real agent stalls)

Dependencies: Hook reliability (Phase 1, item 5) must be solid first.

Phase 2 items (3 capabilities, ~1.5 weeks):

7. OpenTelemetry + Langfuse self-hosted¶

What breaks without it: No visibility into Dev-E decisions. Did it succeed? Did it hallucinate? Where did tokens go? Can't evaluate the rig.

What it does: - Claude Code natively emits OTel GenAI semantic conventions (CLAUDE_CODE_ENABLE_TELEMETRY=1) - OTel Collector forwards traces to Langfuse (self-hosted, container on 8GB VM) - Langfuse ingests: tool calls, token counts, latencies, LLM decisions, hallucinations - rig-conductor events wired to Langfuse for full session trace

Why not SaaS: LangSmith/Braintrust data egress + dependency; we stay internal.

Why not full LGTM: Grafana Loki on 8GB = thrashing. Hybrid: local Prometheus for Flagger gates, Grafana Cloud Free for logs/traces.

Effort: 2.5 pair-mode days (Langfuse Helm chart, OTel config, rig-conductor↔ Langfuse bridge, dashboards)

Dependencies: Must come after Phase 1 (only works with reliable events).

8. LiteLLM budget proxy¶

What breaks without it: No hard cost ceiling. Dev-E burns through Anthropic plan on one looping bug.

What it does: - LiteLLM proxy between agents and provider API - Per-agent virtual keys with hourly + daily token budgets - 429 circuit breaker (stop-the-world, escalate to human) - Cross-provider fallback (if Anthropic 429, try OpenAI)

Budgets: Dev-E 100k tokens/day, Review-E 50k/day (conservative; adjust after 1 week of data).

Effort: 1.5 pair-mode days (LiteLLM Helm chart, per-agent key provisioning, rig-conductorintegration)

Dependencies: Must have Phase 1 stuck detection first (otherwise a looping agent burns budget before stuck is detected).

9. Nightly golden suite + regression gate¶

What breaks without it: No quantified measure that "we're not regressing." Ship a PR, agents mysteriously slower, nobody knows.

What it does: - 10 internal golden-suite tasks (mix of repos, issue types) - Run nightly, seed with timestamp - Measure: success rate, tokens, tool-call count, PR-merge-readiness - Regression >20%: PR merge gate fails - Weekly summary dashboard in Langfuse

Cost: ~$3-8/night at Sonnet 4.6 pricing.

Effort: 1.5 pair-mode days (task set, regression eval CI job, dashboard)

Dependencies: Langfuse (Phase 2, item 7) must be live first.

Phase 3a item (1 capability, ~1 week) — necessary for MVP to close the loop:

10. Per-consumer cursor projection¶

What breaks without it: Assignment dispatch is dumb. Dev-E can be assigned 5 issues if it doesn't call /api/assignments/next. No capacity awareness.

What it does: - New projection: per-agent (AgentId, LastEventOrdinal, SubscribedEventTypes, ConcurrentSlots, InFlightAssignments) - Derived from Marten event stream per agent - API: GET /api/cursor?agentId=dev-e returns {lastOrdinal, inFlight: [repo#N, repo#M]} - Dispatch checks cursor: "Is this agent at capacity?" before assigning new work

This is the minimum needed for "exactly-once-per-agent" and capacity awareness. Does not include full subscription registry (Phase 3) — that's aspirational.

Effort: 2 pair-mode days (Marten projection, API endpoint, rig-conductorintegration)

Dependencies: rig-conductor event sourcing is already rock solid; this is plumbing.

What's explicitly NOT in MVP¶

Deferred to Phase 4+:

Bounded-loop sentinel: Review-E ↔ Dev-E ping-pong detection. Valuable but not MVP-blocking; escalation + human review catches it.
Escalation routing: Severity-routed Discord pings (P0 → @mention, P1 → thread, P2 → no ping). Spreadsheet + Discord is fine for 2 people.
Error budget projection: SLO burn-rate tracking. Too early; we don't have 28 days of clean production data.
Flagger canary: Progressive delivery via Prometheus SLI gates. k3s ↔ prod is one click; no rollout window needed yet.
pgroll migrations: Expand/contract safety. Schema changes are manual + human approval for MVP.
Spec-E (intake refiner): Separate agent to clarify fuzzy GitHub issues. Humans will refine intent for MVP.
Architect-E: Interface shaper for T2 decisions. Human architects handle this.
Dev-E repair-dispatch mode: SLO-triggered production incident response. Zero production data yet.

Rejected explicitly (will not ship, ever):

CaMeL separation phase 0: DeepMind formal guarantee is real, but operational cost for 1-2 person team is high. Phase 6 (full defense-in-depth).
SLSA L3 + Sigstore: Supply chain is table stakes, but MVP can ship with unsigned images + human review gates. Phase 4.
Kyverno admission: Ditto; native admission is Phase 4+.
Property-based test generation: Nice-to-have; unit + golden suite cover MVP.
Drift canary: Phase 6+.
Memory scope hierarchy: Keep soft tagging (session/task/repo tags, no enforcement). Hard enforcement is Phase 3.

Honest gaps to accept in MVP¶

These will ship as "Partial" and we'll document the limitation:

1. Dev-E multi-variant support (dormant, documented)¶

HelmReleases exist for Python + dotnet variants. Both have cron.enabled: false — they're not running, and we're not blocking them from shipping.

Reason: Node variant covers most use cases. Python/dotnet add operational surface (3 runtimes to track). Ship with Node only, add multi-variant later when we have data on demand.

How we mitigate: Label issues requires-python / requires-dotnet if filed; Dev-E rejects them. Cleanup PR to remove dormant HelmReleases is a follow-up.

2. Memory write pipeline incomplete¶

save_pattern (auto-scrape ### Learnings from Dev-E output) is wired but agents don't emit the marker section. System prompt should prompt for it, but it's a soft target without enforcement.

Reason: Memory is nice-to-have; not blocking MVP. Memory LOAD (recall) works fine.

How we mitigate: Session-start loads agent memory automatically. Dev-E can call write_memory manually if needed. Track "agents called write_memory" metric — if zero after 1 week, revisit system prompt in Phase 2.

3. Memory TTL + compaction crons not scheduled¶

Tables ready, columns defined, tools exist. Cronjobs don't run.

Reason: On 8GB VM with baseline load ~40%, memory at ~3GB. Room to grow; clean up later.

How we mitigate: Manual cleanup every 30 days (1 SQL statement). Phase 3 adds cron.

4. Cost tracking is passive, not enforced¶

TokenUsageProjection aggregates per agent × repo. LiteLLM proxy enforces hard limits, but it's not wired to the system yet (Phase 2 item 8).

Reason: Phase 2 activity; MVP has soft tracking (dashboard), hard limits come with proxy.

How we mitigate: Daily cost report (cronjob in rig-conductor) emailed to team. Manual override possible but flagged.

5. Prometheus not yet source of truth for deploy gates¶

Prometheus is deployed, collecting metrics. Flagger (self-healing) is not — so we can't gate deploys on SLO burn rate yet.

Reason: Phase 5 activity (self-healing). MVP gates on code review + human approval.

How we mitigate: Manual gates for now; Flagger coming Phase 5.

6. Observability is rig-conductor+ Langfuse only; agent traces are blank until Phase 2¶

OTel Collector is ready, but Claude Code traces won't flow until we set CLAUDE_CODE_ENABLE_TELEMETRY=1 and Langfuse is live.

Reason: Phase 2 item 7; MVP can ship with rig-conductor events alone (enough to trace workflow).

How we mitigate: Phase 2 full deploy brings agent-level detail. For MVP, Langfuse empty but operational.

MVP exit criteria — how we know it's done¶

An MVP is usable when all of these are true:

One full issue-to-merge cycle, no human intervention beyond Type-2 approvals:
Human files: #123 Feature: add --verbose flag to CLI
Dev-E claims, implements, tests, creates PR
Review-E reviews independently (approves or requests changes)
Human approves (Type-2 gate)
Merge → production deploy (automatic via Flux + Flagger)
No human intervention between review approval and production live
Safety guards are active and tested:
Dangerous-command guard tested: rm -rf / attempted → rejected with reason logged
Agent identity verified: commits have agent author + signing key
Egress lock tested: attempt to curl attacker domain → denied by NetworkPolicy
Cost gate tested: token budget exceeded → escalation event fired
Observability is measurable:
Langfuse dashboard shows: tokens used, tool calls, decision points, hallucination rate
rig-conductor event stream shows: work assigned, work started, PR created, review passed, merged, deployed
Cost dashboard shows: per-agent daily token spend, trending
Failure modes are handled:
Dev-E loops 10× same tool → StuckGuard detects → escalation event fired → Discord notification sent
rig-conductor heartbeat missing >5 min → escalation event fired
Token budget exceeded → work stops, escalation event fired
Golden suite regression gate is live and gating:
10-task golden suite runs every night
Results dashboard in Langfuse
PR merge fails if golden suite regresses >20%
Measured track record exists:
Minimum 20 successful issues merged (T0 level) with zero rework
Minimum 10 Review-E reviews completed
Measured: zero false-positive escalations (stuck detection triggered on real stuck, not flakes)

MVP-to-full-rig roadmap¶

After MVP ships (~Week 4), the team enters Phase 3-7 work (full rig, 3-6 months):

Next up (Weeks 5-8): - Full subscription registry (Phase 3) — who consumes whose events, topology validation at deploy - Bounded-loop sentinel (Phase 4) — Dev-E ↔ Review-E ping-pong cap - Escalation routing (Phase 4) — Discord severity-based notification shape - Flagger + Prometheus SLO gates (Phase 5) — canary-gated production deploys

Stretch goal (Months 2-3): - Spec-E intake refiner (Phase 7) - Dev-E repair-dispatch mode (Phase 5, Stage 2) — incident response automation - CaMeL trust separation (Phase 6) — formal prompt-injection defense

Long term (Months 4-6+): - Architect-E (Phase 7) - 4-tier memory scoping enforcement (Phase 3+) - Property-based testing (Phase 2+) - Drift detection canary (Phase 6)

MVP dependencies and build order¶

graph TB
    subgraph "Week 1 — Phase 0 (parallel)"
        guard["Dangerous-command guard"]
        identity["Agent identity in git"]
        egress["Egress NetworkPolicy"]
        worktrees["Git worktrees"]
    end

    subgraph "Week 2-3 — Phase 1 + 2"
        spool["Hook reliability spool"]
        stuck["StuckGuard middleware"]
        otel["OTel + Langfuse"]
        liteллm["LiteLLM proxy"]
        golden["Golden suite regression"]
    end

    subgraph "Week 3-4 — Phase 3a"
        cursor["Per-consumer cursor"]
    end

    guard --> spool
    identity --> spool
    egress --> spool
    worktrees --> spool

    spool --> stuck
    spool --> otel
    stuck --> liteллm
    otel --> golden
    liteллm --> golden

    golden --> cursor
    cursor --> exit["MVP exit criteria met"]

    classDef phase0 fill:#e8f5e9,stroke:#2e7d32,color:#000
    classDef phase12 fill:#fff3e0,stroke:#e65100,color:#000
    classDef phase3 fill:#e3f2fd,stroke:#1565c0,color:#000
    classDef exit fill:#fce4ec,stroke:#ad1457,color:#000

    class guard,identity,egress,worktrees phase0
    class spool,stuck,otel,liteллm,golden phase12
    class cursor phase3
    class exit exit

Key dependency: Phase 0 must complete before Phase 1 → all safety + identity locks in place before we scale observability and cost. Cannot ship observability without stuck detection (otherwise metrics lie).

Parallelism: Phase 0 items (1-4) are fully independent; assign them to different teammates if available.

Status doc corrections & reconciliations¶

Based on code audit, the implementation-status.md doc is mostly accurate. Minor corrections:

Row	Current Status	Should Be	Reason
`GET /api/reviews/next`	Deployed ✓	(no change)	Confirmed in `Program.cs:804–808`; README was stale, not the endpoint
Dev-E (dotnet variant)	Partial ✓	(no change)	Correct; `cron.enabled: false`
Dev-E (python variant)	Partial ✓	(no change)	Correct; same state as dotnet
Memory `save_pattern`	Partial ✓	(no change)	Correct; tool exists, agents don't emit marker
Memory `mark_used`	Partial ✓	(no change)	Correct; metric is 0%
Prometheus	Partial ✓	(no change)	Correct; not yet source of truth for Flagger (Flagger not deployed)
OTel Collector	Partial ✓	(no change)	Correct; rig-conductoremits, agents don't yet (Phase 2 item 7)
Cost dashboard	Partial ✓	(no change)	Correct; basic aggregation, no hard enforcement

No rows need fixing. The status doc is honest and up-to-date. Each "Partial" row has a clear gap acknowledged.

Success metrics: how we measure MVP quality¶

By end of Week 4:

Metric	MVP Target	Why it matters
Issues closed (any size)	≥20	Volume signal; proves dispatch loop works
Zero rework rate (T0)	≥95%	First-time approval without "changes requested"
Review-E reviews completed	≥10	Independent review gate is working
Stuck detection precision	100% TP, 0 FP	If we escalate, it's real stuck, not flake
Cost per issue merged	<$0.50	Sanity check; budget gates working
Mean time to merge (from issue)	<30 min	Feasible for human + agent 1-turn loop
Langfuse golden suite regress	0% (stable)	New system hasn't broken baseline

Honest scope confessions¶

These claims are not proven; they're educated guesses:

"Dangerous-command guard is enough for Phase 0 safety." We're betting that rejects on rm -rf / and drop table catch 90% of catastrophic hallucinations. Novel attacks (e.g., time-bomb code) still slip through. We're not claiming perfection, just "good enough to deploy."
"Per-consumer cursor is the right capacity model." It works for rig-conductor's event-driven architecture, but we've never run at scale (>5 concurrent agents). Might need rework if we hit deadlock under load.
"Golden suite regression gate catches real degradation." We're assuming 10 representative tasks + 30% threshold is calibrated right. First month of data will tell us if it's too strict (too many false positives) or too loose (misses real regressions).
"LiteLLM proxy is operationally simple." It's a straightforward proxy-in-container, but multi-provider fallback logic is complex. First incident under cross-provider fallback might reveal surprises.
"Langfuse self-hosted on 8GB won't thrash." Prometheus is only ~500MB, but once we add agent traces (50+/day), ingest rate might surprise us. May need to drop to weekly eval instead of nightly if disk I/O becomes a problem.

Risk register for MVP¶

Risk	Probability	Impact	Mitigation
Dangerous-command guard misses 0-day exploit	Low	Critical	Weekly external review of guard rules; honeypot tests (intentional attack attempts)
LiteLLM proxy silently drops requests on 429	Medium	High	Metrics-driven alert (429 count > threshold); manual fallback drill every 2 weeks
Stuck detection false-positive > 10%	Medium	Medium	Tune thresholds after Week 1; adjust before Phase 3
Golden suite too brittle (constant regression)	Low	Medium	Simplify task set or threshold if merge-rate drops below 50%
Langfuse disk fills in 2 weeks	Low	High	Implement TTL pruning before nightly eval starts; monitor disk pressure daily

What success looks like in one sentence¶

By the end of Week 4, a human can file a GitHub issue, Dev-E implements it (with tests and docs), Review-E approves, code merges and deploys, and the whole team can inspect the trace and cost with confidence that nothing destructive happened autonomously.

That is the minimum viable rig.

This document is living and will be updated weekly as MVP items ship. Track changes in the rig-gitops CHANGELOG.