Skip to content

MVP Scope — Minimum Viable Rig

The core question this doc answers

What is the smallest set of deployed capabilities that makes the rig usable for real work? MVP is Phase 0 + Phase 1 + Phase 2 from the whitepaper roadmap (Phases lasting ~4-6 weeks total) plus selective Phase 3 elements. That gives us:

  • Safety floor: dangerous-command guard, agent identity in git, egress locks, worktrees
  • Reliability: stuck detection, hook resiliency
  • Measurement: observability (OTel + Langfuse), cost tracking
  • Coordination upgrade: per-consumer cursor (not full subscription registry)

Estimated effort: 3-4 pair-mode weeks. Deliverable: one end-to-end GitHub issue → Dev-E implements → Review-E approves → merge, with bounded cost, no destructive mistakes, and measurable quality signals.

What "MVP" means in this context

The rig earns trust for a task when:

  1. The blast radius is bounded — code can be rolled back, effects reversed
  2. We have measured track record on that task class — not just "agents are good at this" but "this rig, this repo, succeeds N%"
  3. Every action is attestable — cryptographic chain from intent to artifact
  4. Failure modes are known and handled — stuck detection, budget exhaustion stops work, no silent failures

An MVP rig has all four for a single narrow task class: "file a GitHub issue for a Node.js repo feature, Dev-E implements with tests, Review-E approves, code lands and deploys." A full rig extends this to 10+ task classes with different autonomy tiers and blast radiuses.

This MVP is not production-ready for a large team. It is valid for a single developer or 2-3 person team to ship small features to their own infrastructure and catch logic bugs in staging before they reach prod because the loop closes: measure, detect drift, escalate.


What's already deployed (today, as of 2026-04-17)

Deployed (21 capabilities across 7 domains):

Coordination

  • Conductor-E event store (Marten/Postgres): 28 event types, all projections live
  • POST /api/events endpoint: production-active
  • Assignment dispatch (GET /api/assignments/next): priority + FIFO (no capacity check yet)
  • Review claim endpoint (GET /api/reviews/next): confirmed working, docs were stale

Agent execution

  • Dev-E (Node variant): active, 5-minute cron dispatch
  • Review-E: deployed, independent review gate
  • Both agents: Claude Code CLI runtime, GitHub MCP, advisor MCP, memory MCP (pre-installed)

Security

  • SOPS + age encryption + Flux inline decryption: confirmed working across all Kustomizations
  • GitHub App tokens (GitHub Actions workflow): deployed for CI

Observability

  • OpenTelemetry Collector: deployed for Conductor-E spans
  • Local Prometheus: deployed via kube-prometheus-stack (ready but not yet source of truth for Flagger)
  • Cost dashboard (basic): static HTML, TokenUsageProjection aggregates per agent × repo

Memory

  • Postgres + pgvector storage (co-located with Marten): ready
  • HNSW + GIN indexes: ready
  • OpenAI embeddings (optional, silent BM25 fallback): deployed
  • search_memories MCP tool: hybrid vector + BM25 search
  • Session-start memory LOAD: confirmed in logs
  • Advisor handoff protocol: prompt-level (PR #71), zero enforcement

Cluster and runtime

  • k3s on 8GB GCP VM (invotek-k3s): stable
  • KEDA autoscaling: deployed
  • FluxCD GitOps: syncing rig-gitops every 10m
  • GitHub Actions + GHCR: per-repo builds published
  • Cloudflare Tunnel: conductor-e.dashecorp.com live
  • Discord webhooks: Conductor-E event listener posts to #dev-e, #review-e

Development process

  • AGENTS.md standard: deployed, enforced across all repos
  • Mermaid CI check: .github/workflows/mermaid-check.yml on all PRs

Partial status (7 capabilities — gaps acknowledged):

  • Dev-E dotnet variant: HelmRelease exists, cron.enabled: false → functionally dormant
  • Dev-E python variant: same as dotnet
  • Memory write pipeline (save_pattern): exists but agents don't emit the ### Learnings section
  • Memory mark_used: tool exists, never called → hit-counter metric is fiction
  • Memory compact_repo: tool exists, no cron trigger
  • Flux-detected code/config drift: Flux detects, not yet surfaced as alerts
  • TokenUsage projection: basic aggregation only, no hard enforcement (LiteLLM proxy not deployed)

Critical gap analysis: what's blocking MVP

Phase 0 items (4 capabilities, ~1 week):

1. Dangerous-command guard

What breaks without it: Dev-E can execute rm -rf /, git push --force, sudo, drop table. One hallucination and the repo is corrupted. Unacceptable.

What it does: PreToolUse hook, rejects: - sudo, sudo -i, sudo -s - rm -rf /, rm -rf ., git reset --hard, git clean -fdx (destructive) - git push --force (allow --force-with-lease only) - SQL drop table, truncate, delete from without WHERE clause - Package-manager installs (pip install, npm install via tool — ephemeral sandbox only) - No override flag (Gastown's deliberate choice)

Evidence: Fully specified in example-first-story.md with user story, test matrix, rollout sequence. This is the first real user story.

Effort: 2 pair-mode days (hook + 3 test suites: unit, integration, e2e on actual agent)

Dependencies: None; ship first.

2. Agent identity in git

What breaks without it: Commits appear from generic bot, no audit trail. Dev-E can hide who broke what.

What it does: Every Dev-E/Review-E commit signed with agent's SSH key (deployed via Conductor-E secret store), author = agent identity, message tags [Dev-E] / [Review-E].

Effort: 1 pair-mode day (SSH key rotation, GitHub signing verification config)

Dependencies: None; parallel to guard.

3. Default-deny egress NetworkPolicy

What breaks without it: Dev-E can exfiltrate secrets to attacker-controlled domain. CaMeL (formal prompt-injection defense) is too heavy for Phase 0, but network-level egress control is cheap.

What it does: Cilium NetworkPolicy (L7 DNS + HTTP allowlist). Allows: - api.github.com (GitHub API) - git.github.com (Git fetch/push) - cdn.jsdelivr.net, registry.npmjs.org (npm install) - conductor-e-api (internal) - python.org, pypi.org (Python only in dev-e-python pod) - Nothing else.

Effort: 2 pair-mode days (Cilium policy template, per-pod overrides, testing ingress/DNS/HTTP separately)

Dependencies: Cilium already deployed; just write the policies.

4. Git worktrees per agent task

What breaks without it: Agents clone the full repo for every issue, then delete. Slow cold-start, high I/O, breaks if clone fails midway.

What it does: Use git worktree add per issue (Cursor 2026 pattern). Shared main checkout, ephemeral branch worktrees, ~1.5s faster per task.

Effort: 1.5 pair-mode days (update Claude Code spawn logic, cleanup cron)

Dependencies: Purely runtime; no other items depend on it.


Phase 1 items (2 capabilities, ~1.5 weeks):

5. Hook reliability spool

What breaks without it: Conductor-E goes down (restart, upgrade, outage). Agents stop receiving heartbeats. Silent death — no escalation.

What it does: - Agents emit events to local spool file (write-ahead log, /tmp/events.jsonl) - Local heartbeat to Conductor-E every 60 seconds - On 5xx or timeout: retry spool against Conductor-E when it recovers - Spooled events delivered at least once (idempotent event handler on Conductor-E side)

Effort: 2 pair-mode days (WAL format, spool-flush logic, idempotency on Conductor-E)

Dependencies: Must come after Phase 0 (if guards fail, we want events spooled before escalation).

6. StuckGuard middleware (5 patterns)

What breaks without it: Dev-E loops indefinitely on same tool call (e.g., retrying a download 100× in 1 minute). Tokens burn, no progress detected.

What it does: Deterministic loop detection at tool-call layer: 1. Tool call repetition — same tool 5+ times in 10 turns → stuck 2. Error repetition — same error message 3+ times → stuck 3. State unchanged — repo file count same after 10 steps → stuck 4. Output no-op — command succeeds but produces no output 5× running → stuck 5. Token budget burn — input tokens > 90% of allocated → warn before stuck

From OpenHands, Goose, Sweep research — three independent codebases converged here.

When triggered: emit AgentStuck event → Conductor-E escalates → Discord notification + on-call page.

Effort: 2.5 pair-mode days (5 patterns, metrics collection, test on real agent stalls)

Dependencies: Hook reliability (Phase 1, item 5) must be solid first.


Phase 2 items (3 capabilities, ~1.5 weeks):

7. OpenTelemetry + Langfuse self-hosted

What breaks without it: No visibility into Dev-E decisions. Did it succeed? Did it hallucinate? Where did tokens go? Can't evaluate the rig.

What it does: - Claude Code natively emits OTel GenAI semantic conventions (CLAUDE_CODE_ENABLE_TELEMETRY=1) - OTel Collector forwards traces to Langfuse (self-hosted, container on 8GB VM) - Langfuse ingests: tool calls, token counts, latencies, LLM decisions, hallucinations - Conductor-E events wired to Langfuse for full session trace

Why not SaaS: LangSmith/Braintrust data egress + dependency; we stay internal.

Why not full LGTM: Grafana Loki on 8GB = thrashing. Hybrid: local Prometheus for Flagger gates, Grafana Cloud Free for logs/traces.

Effort: 2.5 pair-mode days (Langfuse Helm chart, OTel config, Conductor-E ↔ Langfuse bridge, dashboards)

Dependencies: Must come after Phase 1 (only works with reliable events).

8. LiteLLM budget proxy

What breaks without it: No hard cost ceiling. Dev-E burns through Anthropic plan on one looping bug.

What it does: - LiteLLM proxy between agents and provider API - Per-agent virtual keys with hourly + daily token budgets - 429 circuit breaker (stop-the-world, escalate to human) - Cross-provider fallback (if Anthropic 429, try OpenAI)

Budgets: Dev-E 100k tokens/day, Review-E 50k/day (conservative; adjust after 1 week of data).

Effort: 1.5 pair-mode days (LiteLLM Helm chart, per-agent key provisioning, Conductor-E integration)

Dependencies: Must have Phase 1 stuck detection first (otherwise a looping agent burns budget before stuck is detected).

9. Nightly golden suite + regression gate

What breaks without it: No quantified measure that "we're not regressing." Ship a PR, agents mysteriously slower, nobody knows.

What it does: - 10 internal golden-suite tasks (mix of repos, issue types) - Run nightly, seed with timestamp - Measure: success rate, tokens, tool-call count, PR-merge-readiness - Regression >20%: PR merge gate fails - Weekly summary dashboard in Langfuse

Cost: ~$3-8/night at Sonnet 4.6 pricing.

Effort: 1.5 pair-mode days (task set, regression eval CI job, dashboard)

Dependencies: Langfuse (Phase 2, item 7) must be live first.


Phase 3a item (1 capability, ~1 week) — necessary for MVP to close the loop:

10. Per-consumer cursor projection

What breaks without it: Assignment dispatch is dumb. Dev-E can be assigned 5 issues if it doesn't call /api/assignments/next. No capacity awareness.

What it does: - New projection: per-agent (AgentId, LastEventOrdinal, SubscribedEventTypes, ConcurrentSlots, InFlightAssignments) - Derived from Marten event stream per agent - API: GET /api/cursor?agentId=dev-e returns {lastOrdinal, inFlight: [repo#N, repo#M]} - Dispatch checks cursor: "Is this agent at capacity?" before assigning new work

This is the minimum needed for "exactly-once-per-agent" and capacity awareness. Does not include full subscription registry (Phase 3) — that's aspirational.

Effort: 2 pair-mode days (Marten projection, API endpoint, Conductor-E integration)

Dependencies: Conductor-E event sourcing is already rock solid; this is plumbing.


What's explicitly NOT in MVP

Deferred to Phase 4+:

  • Bounded-loop sentinel: Review-E ↔ Dev-E ping-pong detection. Valuable but not MVP-blocking; escalation + human review catches it.
  • Escalation routing: Severity-routed Discord pings (P0 → @mention, P1 → thread, P2 → no ping). Spreadsheet + Discord is fine for 2 people.
  • Error budget projection: SLO burn-rate tracking. Too early; we don't have 28 days of clean production data.
  • Flagger canary: Progressive delivery via Prometheus SLI gates. k3s ↔ prod is one click; no rollout window needed yet.
  • pgroll migrations: Expand/contract safety. Schema changes are manual + human approval for MVP.
  • Spec-E (intake refiner): Separate agent to clarify fuzzy GitHub issues. Humans will refine intent for MVP.
  • Architect-E: Interface shaper for T2 decisions. Human architects handle this.
  • Dev-E repair-dispatch mode: SLO-triggered production incident response. Zero production data yet.

Rejected explicitly (will not ship, ever):

  • CaMeL separation phase 0: DeepMind formal guarantee is real, but operational cost for 1-2 person team is high. Phase 6 (full defense-in-depth).
  • SLSA L3 + Sigstore: Supply chain is table stakes, but MVP can ship with unsigned images + human review gates. Phase 4.
  • Kyverno admission: Ditto; native admission is Phase 4+.
  • Property-based test generation: Nice-to-have; unit + golden suite cover MVP.
  • Drift canary: Phase 6+.
  • Memory scope hierarchy: Keep soft tagging (session/task/repo tags, no enforcement). Hard enforcement is Phase 3.

Honest gaps to accept in MVP

These will ship as "Partial" and we'll document the limitation:

1. Dev-E multi-variant support (dormant, documented)

HelmReleases exist for Python + dotnet variants. Both have cron.enabled: false — they're not running, and we're not blocking them from shipping.

Reason: Node variant covers most use cases. Python/dotnet add operational surface (3 runtimes to track). Ship with Node only, add multi-variant later when we have data on demand.

How we mitigate: Label issues requires-python / requires-dotnet if filed; Dev-E rejects them. Cleanup PR to remove dormant HelmReleases is a follow-up.

2. Memory write pipeline incomplete

save_pattern (auto-scrape ### Learnings from Dev-E output) is wired but agents don't emit the marker section. System prompt should prompt for it, but it's a soft target without enforcement.

Reason: Memory is nice-to-have; not blocking MVP. Memory LOAD (recall) works fine.

How we mitigate: Session-start loads agent memory automatically. Dev-E can call write_memory manually if needed. Track "agents called write_memory" metric — if zero after 1 week, revisit system prompt in Phase 2.

3. Memory TTL + compaction crons not scheduled

Tables ready, columns defined, tools exist. Cronjobs don't run.

Reason: On 8GB VM with baseline load ~40%, memory at ~3GB. Room to grow; clean up later.

How we mitigate: Manual cleanup every 30 days (1 SQL statement). Phase 3 adds cron.

4. Cost tracking is passive, not enforced

TokenUsageProjection aggregates per agent × repo. LiteLLM proxy enforces hard limits, but it's not wired to the system yet (Phase 2 item 8).

Reason: Phase 2 activity; MVP has soft tracking (dashboard), hard limits come with proxy.

How we mitigate: Daily cost report (cronjob in conductor-e) emailed to team. Manual override possible but flagged.

5. Prometheus not yet source of truth for deploy gates

Prometheus is deployed, collecting metrics. Flagger (self-healing) is not — so we can't gate deploys on SLO burn rate yet.

Reason: Phase 5 activity (self-healing). MVP gates on code review + human approval.

How we mitigate: Manual gates for now; Flagger coming Phase 5.

6. Observability is Conductor-E + Langfuse only; agent traces are blank until Phase 2

OTel Collector is ready, but Claude Code traces won't flow until we set CLAUDE_CODE_ENABLE_TELEMETRY=1 and Langfuse is live.

Reason: Phase 2 item 7; MVP can ship with Conductor-E events alone (enough to trace workflow).

How we mitigate: Phase 2 full deploy brings agent-level detail. For MVP, Langfuse empty but operational.


MVP exit criteria — how we know it's done

An MVP is usable when all of these are true:

  1. One full issue-to-merge cycle, no human intervention beyond Type-2 approvals:
  2. Human files: #123 Feature: add --verbose flag to CLI
  3. Dev-E claims, implements, tests, creates PR
  4. Review-E reviews independently (approves or requests changes)
  5. Human approves (Type-2 gate)
  6. Merge → production deploy (automatic via Flux + Flagger)
  7. No human intervention between review approval and production live

  8. Safety guards are active and tested:

  9. Dangerous-command guard tested: rm -rf / attempted → rejected with reason logged
  10. Agent identity verified: commits have agent author + signing key
  11. Egress lock tested: attempt to curl attacker domain → denied by NetworkPolicy
  12. Cost gate tested: token budget exceeded → escalation event fired

  13. Observability is measurable:

  14. Langfuse dashboard shows: tokens used, tool calls, decision points, hallucination rate
  15. Conductor-E event stream shows: work assigned, work started, PR created, review passed, merged, deployed
  16. Cost dashboard shows: per-agent daily token spend, trending

  17. Failure modes are handled:

  18. Dev-E loops 10× same tool → StuckGuard detects → escalation event fired → Discord notification sent
  19. Conductor-E heartbeat missing >5 min → escalation event fired
  20. Token budget exceeded → work stops, escalation event fired

  21. Golden suite regression gate is live and gating:

  22. 10-task golden suite runs every night
  23. Results dashboard in Langfuse
  24. PR merge fails if golden suite regresses >20%

  25. Measured track record exists:

  26. Minimum 20 successful issues merged (T0 level) with zero rework
  27. Minimum 10 Review-E reviews completed
  28. Measured: zero false-positive escalations (stuck detection triggered on real stuck, not flakes)

MVP-to-full-rig roadmap

After MVP ships (~Week 4), the team enters Phase 3-7 work (full rig, 3-6 months):

Next up (Weeks 5-8): - Full subscription registry (Phase 3) — who consumes whose events, topology validation at deploy - Bounded-loop sentinel (Phase 4) — Dev-E ↔ Review-E ping-pong cap - Escalation routing (Phase 4) — Discord severity-based notification shape - Flagger + Prometheus SLO gates (Phase 5) — canary-gated production deploys

Stretch goal (Months 2-3): - Spec-E intake refiner (Phase 7) - Dev-E repair-dispatch mode (Phase 5, Stage 2) — incident response automation - CaMeL trust separation (Phase 6) — formal prompt-injection defense

Long term (Months 4-6+): - Architect-E (Phase 7) - 4-tier memory scoping enforcement (Phase 3+) - Property-based testing (Phase 2+) - Drift detection canary (Phase 6)


MVP dependencies and build order

graph TB
    subgraph "Week 1 — Phase 0 (parallel)"
        guard["Dangerous-command guard"]
        identity["Agent identity in git"]
        egress["Egress NetworkPolicy"]
        worktrees["Git worktrees"]
    end

    subgraph "Week 2-3 — Phase 1 + 2"
        spool["Hook reliability spool"]
        stuck["StuckGuard middleware"]
        otel["OTel + Langfuse"]
        liteллm["LiteLLM proxy"]
        golden["Golden suite regression"]
    end

    subgraph "Week 3-4 — Phase 3a"
        cursor["Per-consumer cursor"]
    end

    guard --> spool
    identity --> spool
    egress --> spool
    worktrees --> spool

    spool --> stuck
    spool --> otel
    stuck --> liteллm
    otel --> golden
    liteллm --> golden

    golden --> cursor
    cursor --> exit["MVP exit criteria met"]

    classDef phase0 fill:#e8f5e9,stroke:#2e7d32,color:#000
    classDef phase12 fill:#fff3e0,stroke:#e65100,color:#000
    classDef phase3 fill:#e3f2fd,stroke:#1565c0,color:#000
    classDef exit fill:#fce4ec,stroke:#ad1457,color:#000

    class guard,identity,egress,worktrees phase0
    class spool,stuck,otel,liteллm,golden phase12
    class cursor phase3
    class exit exit

Key dependency: Phase 0 must complete before Phase 1 → all safety + identity locks in place before we scale observability and cost. Cannot ship observability without stuck detection (otherwise metrics lie).

Parallelism: Phase 0 items (1-4) are fully independent; assign them to different teammates if available.


Status doc corrections & reconciliations

Based on code audit, the implementation-status.md doc is mostly accurate. Minor corrections:

Row Current Status Should Be Reason
GET /api/reviews/next Deployed ✓ (no change) Confirmed in Program.cs:804–808; README was stale, not the endpoint
Dev-E (dotnet variant) Partial ✓ (no change) Correct; cron.enabled: false
Dev-E (python variant) Partial ✓ (no change) Correct; same state as dotnet
Memory save_pattern Partial ✓ (no change) Correct; tool exists, agents don't emit marker
Memory mark_used Partial ✓ (no change) Correct; metric is 0%
Prometheus Partial ✓ (no change) Correct; not yet source of truth for Flagger (Flagger not deployed)
OTel Collector Partial ✓ (no change) Correct; Conductor-E emits, agents don't yet (Phase 2 item 7)
Cost dashboard Partial ✓ (no change) Correct; basic aggregation, no hard enforcement

No rows need fixing. The status doc is honest and up-to-date. Each "Partial" row has a clear gap acknowledged.


Success metrics: how we measure MVP quality

By end of Week 4:

Metric MVP Target Why it matters
Issues closed (any size) ≥20 Volume signal; proves dispatch loop works
Zero rework rate (T0) ≥95% First-time approval without "changes requested"
Review-E reviews completed ≥10 Independent review gate is working
Stuck detection precision 100% TP, 0 FP If we escalate, it's real stuck, not flake
Cost per issue merged <$0.50 Sanity check; budget gates working
Mean time to merge (from issue) <30 min Feasible for human + agent 1-turn loop
Langfuse golden suite regress 0% (stable) New system hasn't broken baseline

Honest scope confessions

These claims are not proven; they're educated guesses:

  1. "Dangerous-command guard is enough for Phase 0 safety." We're betting that rejects on rm -rf / and drop table catch 90% of catastrophic hallucinations. Novel attacks (e.g., time-bomb code) still slip through. We're not claiming perfection, just "good enough to deploy."

  2. "Per-consumer cursor is the right capacity model." It works for Conductor-E's event-driven architecture, but we've never run at scale (>5 concurrent agents). Might need rework if we hit deadlock under load.

  3. "Golden suite regression gate catches real degradation." We're assuming 10 representative tasks + 30% threshold is calibrated right. First month of data will tell us if it's too strict (too many false positives) or too loose (misses real regressions).

  4. "LiteLLM proxy is operationally simple." It's a straightforward proxy-in-container, but multi-provider fallback logic is complex. First incident under cross-provider fallback might reveal surprises.

  5. "Langfuse self-hosted on 8GB won't thrash." Prometheus is only ~500MB, but once we add agent traces (50+/day), ingest rate might surprise us. May need to drop to weekly eval instead of nightly if disk I/O becomes a problem.


Risk register for MVP

Risk Probability Impact Mitigation
Dangerous-command guard misses 0-day exploit Low Critical Weekly external review of guard rules; honeypot tests (intentional attack attempts)
LiteLLM proxy silently drops requests on 429 Medium High Metrics-driven alert (429 count > threshold); manual fallback drill every 2 weeks
Stuck detection false-positive > 10% Medium Medium Tune thresholds after Week 1; adjust before Phase 3
Golden suite too brittle (constant regression) Low Medium Simplify task set or threshold if merge-rate drops below 50%
Langfuse disk fills in 2 weeks Low High Implement TTL pruning before nightly eval starts; monitor disk pressure daily

What success looks like in one sentence

By the end of Week 4, a human can file a GitHub issue, Dev-E implements it (with tests and docs), Review-E approves, code merges and deploys, and the whole team can inspect the trace and cost with confidence that nothing destructive happened autonomously.

That is the minimum viable rig.


This document is living and will be updated weekly as MVP items ship. Track changes in the rig-gitops CHANGELOG.