MVP Scope — Minimum Viable Rig¶
The core question this doc answers
What is the smallest set of deployed capabilities that makes the rig usable for real work? MVP is Phase 0 + Phase 1 + Phase 2 from the whitepaper roadmap (Phases lasting ~4-6 weeks total) plus selective Phase 3 elements. That gives us:
- Safety floor: dangerous-command guard, agent identity in git, egress locks, worktrees
- Reliability: stuck detection, hook resiliency
- Measurement: observability (OTel + Langfuse), cost tracking
- Coordination upgrade: per-consumer cursor (not full subscription registry)
Estimated effort: 3-4 pair-mode weeks. Deliverable: one end-to-end GitHub issue → Dev-E implements → Review-E approves → merge, with bounded cost, no destructive mistakes, and measurable quality signals.
What "MVP" means in this context¶
The rig earns trust for a task when:
- The blast radius is bounded — code can be rolled back, effects reversed
- We have measured track record on that task class — not just "agents are good at this" but "this rig, this repo, succeeds N%"
- Every action is attestable — cryptographic chain from intent to artifact
- Failure modes are known and handled — stuck detection, budget exhaustion stops work, no silent failures
An MVP rig has all four for a single narrow task class: "file a GitHub issue for a Node.js repo feature, Dev-E implements with tests, Review-E approves, code lands and deploys." A full rig extends this to 10+ task classes with different autonomy tiers and blast radiuses.
This MVP is not production-ready for a large team. It is valid for a single developer or 2-3 person team to ship small features to their own infrastructure and catch logic bugs in staging before they reach prod because the loop closes: measure, detect drift, escalate.
What's already deployed (today, as of 2026-04-17)¶
Deployed (21 capabilities across 7 domains):
Coordination¶
- Conductor-E event store (Marten/Postgres): 28 event types, all projections live
- POST
/api/eventsendpoint: production-active - Assignment dispatch (
GET /api/assignments/next): priority + FIFO (no capacity check yet) - Review claim endpoint (
GET /api/reviews/next): confirmed working, docs were stale
Agent execution¶
- Dev-E (Node variant): active, 5-minute cron dispatch
- Review-E: deployed, independent review gate
- Both agents: Claude Code CLI runtime, GitHub MCP, advisor MCP, memory MCP (pre-installed)
Security¶
- SOPS + age encryption + Flux inline decryption: confirmed working across all Kustomizations
- GitHub App tokens (GitHub Actions workflow): deployed for CI
Observability¶
- OpenTelemetry Collector: deployed for Conductor-E spans
- Local Prometheus: deployed via kube-prometheus-stack (ready but not yet source of truth for Flagger)
- Cost dashboard (basic): static HTML, TokenUsageProjection aggregates per agent × repo
Memory¶
- Postgres + pgvector storage (co-located with Marten): ready
- HNSW + GIN indexes: ready
- OpenAI embeddings (optional, silent BM25 fallback): deployed
search_memoriesMCP tool: hybrid vector + BM25 search- Session-start memory LOAD: confirmed in logs
- Advisor handoff protocol: prompt-level (PR #71), zero enforcement
Cluster and runtime¶
- k3s on 8GB GCP VM (
invotek-k3s): stable - KEDA autoscaling: deployed
- FluxCD GitOps: syncing rig-gitops every 10m
- GitHub Actions + GHCR: per-repo builds published
- Cloudflare Tunnel: conductor-e.dashecorp.com live
- Discord webhooks: Conductor-E event listener posts to #dev-e, #review-e
Development process¶
- AGENTS.md standard: deployed, enforced across all repos
- Mermaid CI check:
.github/workflows/mermaid-check.ymlon all PRs
Partial status (7 capabilities — gaps acknowledged):
- Dev-E dotnet variant: HelmRelease exists,
cron.enabled: false→ functionally dormant - Dev-E python variant: same as dotnet
- Memory write pipeline (
save_pattern): exists but agents don't emit the### Learningssection - Memory
mark_used: tool exists, never called → hit-counter metric is fiction - Memory
compact_repo: tool exists, no cron trigger - Flux-detected code/config drift: Flux detects, not yet surfaced as alerts
- TokenUsage projection: basic aggregation only, no hard enforcement (LiteLLM proxy not deployed)
Critical gap analysis: what's blocking MVP¶
Phase 0 items (4 capabilities, ~1 week):
1. Dangerous-command guard¶
What breaks without it: Dev-E can execute rm -rf /, git push --force, sudo, drop table. One hallucination and the repo is corrupted. Unacceptable.
What it does: PreToolUse hook, rejects:
- sudo, sudo -i, sudo -s
- rm -rf /, rm -rf ., git reset --hard, git clean -fdx (destructive)
- git push --force (allow --force-with-lease only)
- SQL drop table, truncate, delete from without WHERE clause
- Package-manager installs (pip install, npm install via tool — ephemeral sandbox only)
- No override flag (Gastown's deliberate choice)
Evidence: Fully specified in example-first-story.md with user story, test matrix, rollout sequence. This is the first real user story.
Effort: 2 pair-mode days (hook + 3 test suites: unit, integration, e2e on actual agent)
Dependencies: None; ship first.
2. Agent identity in git¶
What breaks without it: Commits appear from generic bot, no audit trail. Dev-E can hide who broke what.
What it does: Every Dev-E/Review-E commit signed with agent's SSH key (deployed via Conductor-E secret store), author = agent identity, message tags [Dev-E] / [Review-E].
Effort: 1 pair-mode day (SSH key rotation, GitHub signing verification config)
Dependencies: None; parallel to guard.
3. Default-deny egress NetworkPolicy¶
What breaks without it: Dev-E can exfiltrate secrets to attacker-controlled domain. CaMeL (formal prompt-injection defense) is too heavy for Phase 0, but network-level egress control is cheap.
What it does: Cilium NetworkPolicy (L7 DNS + HTTP allowlist). Allows:
- api.github.com (GitHub API)
- git.github.com (Git fetch/push)
- cdn.jsdelivr.net, registry.npmjs.org (npm install)
- conductor-e-api (internal)
- python.org, pypi.org (Python only in dev-e-python pod)
- Nothing else.
Effort: 2 pair-mode days (Cilium policy template, per-pod overrides, testing ingress/DNS/HTTP separately)
Dependencies: Cilium already deployed; just write the policies.
4. Git worktrees per agent task¶
What breaks without it: Agents clone the full repo for every issue, then delete. Slow cold-start, high I/O, breaks if clone fails midway.
What it does: Use git worktree add per issue (Cursor 2026 pattern). Shared main checkout, ephemeral branch worktrees, ~1.5s faster per task.
Effort: 1.5 pair-mode days (update Claude Code spawn logic, cleanup cron)
Dependencies: Purely runtime; no other items depend on it.
Phase 1 items (2 capabilities, ~1.5 weeks):
5. Hook reliability spool¶
What breaks without it: Conductor-E goes down (restart, upgrade, outage). Agents stop receiving heartbeats. Silent death — no escalation.
What it does:
- Agents emit events to local spool file (write-ahead log, /tmp/events.jsonl)
- Local heartbeat to Conductor-E every 60 seconds
- On 5xx or timeout: retry spool against Conductor-E when it recovers
- Spooled events delivered at least once (idempotent event handler on Conductor-E side)
Effort: 2 pair-mode days (WAL format, spool-flush logic, idempotency on Conductor-E)
Dependencies: Must come after Phase 0 (if guards fail, we want events spooled before escalation).
6. StuckGuard middleware (5 patterns)¶
What breaks without it: Dev-E loops indefinitely on same tool call (e.g., retrying a download 100× in 1 minute). Tokens burn, no progress detected.
What it does: Deterministic loop detection at tool-call layer: 1. Tool call repetition — same tool 5+ times in 10 turns → stuck 2. Error repetition — same error message 3+ times → stuck 3. State unchanged — repo file count same after 10 steps → stuck 4. Output no-op — command succeeds but produces no output 5× running → stuck 5. Token budget burn — input tokens > 90% of allocated → warn before stuck
From OpenHands, Goose, Sweep research — three independent codebases converged here.
When triggered: emit AgentStuck event → Conductor-E escalates → Discord notification + on-call page.
Effort: 2.5 pair-mode days (5 patterns, metrics collection, test on real agent stalls)
Dependencies: Hook reliability (Phase 1, item 5) must be solid first.
Phase 2 items (3 capabilities, ~1.5 weeks):
7. OpenTelemetry + Langfuse self-hosted¶
What breaks without it: No visibility into Dev-E decisions. Did it succeed? Did it hallucinate? Where did tokens go? Can't evaluate the rig.
What it does:
- Claude Code natively emits OTel GenAI semantic conventions (CLAUDE_CODE_ENABLE_TELEMETRY=1)
- OTel Collector forwards traces to Langfuse (self-hosted, container on 8GB VM)
- Langfuse ingests: tool calls, token counts, latencies, LLM decisions, hallucinations
- Conductor-E events wired to Langfuse for full session trace
Why not SaaS: LangSmith/Braintrust data egress + dependency; we stay internal.
Why not full LGTM: Grafana Loki on 8GB = thrashing. Hybrid: local Prometheus for Flagger gates, Grafana Cloud Free for logs/traces.
Effort: 2.5 pair-mode days (Langfuse Helm chart, OTel config, Conductor-E ↔ Langfuse bridge, dashboards)
Dependencies: Must come after Phase 1 (only works with reliable events).
8. LiteLLM budget proxy¶
What breaks without it: No hard cost ceiling. Dev-E burns through Anthropic plan on one looping bug.
What it does: - LiteLLM proxy between agents and provider API - Per-agent virtual keys with hourly + daily token budgets - 429 circuit breaker (stop-the-world, escalate to human) - Cross-provider fallback (if Anthropic 429, try OpenAI)
Budgets: Dev-E 100k tokens/day, Review-E 50k/day (conservative; adjust after 1 week of data).
Effort: 1.5 pair-mode days (LiteLLM Helm chart, per-agent key provisioning, Conductor-E integration)
Dependencies: Must have Phase 1 stuck detection first (otherwise a looping agent burns budget before stuck is detected).
9. Nightly golden suite + regression gate¶
What breaks without it: No quantified measure that "we're not regressing." Ship a PR, agents mysteriously slower, nobody knows.
What it does: - 10 internal golden-suite tasks (mix of repos, issue types) - Run nightly, seed with timestamp - Measure: success rate, tokens, tool-call count, PR-merge-readiness - Regression >20%: PR merge gate fails - Weekly summary dashboard in Langfuse
Cost: ~$3-8/night at Sonnet 4.6 pricing.
Effort: 1.5 pair-mode days (task set, regression eval CI job, dashboard)
Dependencies: Langfuse (Phase 2, item 7) must be live first.
Phase 3a item (1 capability, ~1 week) — necessary for MVP to close the loop:
10. Per-consumer cursor projection¶
What breaks without it: Assignment dispatch is dumb. Dev-E can be assigned 5 issues if it doesn't call /api/assignments/next. No capacity awareness.
What it does:
- New projection: per-agent (AgentId, LastEventOrdinal, SubscribedEventTypes, ConcurrentSlots, InFlightAssignments)
- Derived from Marten event stream per agent
- API: GET /api/cursor?agentId=dev-e returns {lastOrdinal, inFlight: [repo#N, repo#M]}
- Dispatch checks cursor: "Is this agent at capacity?" before assigning new work
This is the minimum needed for "exactly-once-per-agent" and capacity awareness. Does not include full subscription registry (Phase 3) — that's aspirational.
Effort: 2 pair-mode days (Marten projection, API endpoint, Conductor-E integration)
Dependencies: Conductor-E event sourcing is already rock solid; this is plumbing.
What's explicitly NOT in MVP¶
Deferred to Phase 4+:
- Bounded-loop sentinel: Review-E ↔ Dev-E ping-pong detection. Valuable but not MVP-blocking; escalation + human review catches it.
- Escalation routing: Severity-routed Discord pings (P0 → @mention, P1 → thread, P2 → no ping). Spreadsheet + Discord is fine for 2 people.
- Error budget projection: SLO burn-rate tracking. Too early; we don't have 28 days of clean production data.
- Flagger canary: Progressive delivery via Prometheus SLI gates. k3s ↔ prod is one click; no rollout window needed yet.
- pgroll migrations: Expand/contract safety. Schema changes are manual + human approval for MVP.
- Spec-E (intake refiner): Separate agent to clarify fuzzy GitHub issues. Humans will refine intent for MVP.
- Architect-E: Interface shaper for T2 decisions. Human architects handle this.
- Dev-E repair-dispatch mode: SLO-triggered production incident response. Zero production data yet.
Rejected explicitly (will not ship, ever):
- CaMeL separation phase 0: DeepMind formal guarantee is real, but operational cost for 1-2 person team is high. Phase 6 (full defense-in-depth).
- SLSA L3 + Sigstore: Supply chain is table stakes, but MVP can ship with unsigned images + human review gates. Phase 4.
- Kyverno admission: Ditto; native admission is Phase 4+.
- Property-based test generation: Nice-to-have; unit + golden suite cover MVP.
- Drift canary: Phase 6+.
- Memory scope hierarchy: Keep soft tagging (session/task/repo tags, no enforcement). Hard enforcement is Phase 3.
Honest gaps to accept in MVP¶
These will ship as "Partial" and we'll document the limitation:
1. Dev-E multi-variant support (dormant, documented)¶
HelmReleases exist for Python + dotnet variants. Both have cron.enabled: false — they're not running, and we're not blocking them from shipping.
Reason: Node variant covers most use cases. Python/dotnet add operational surface (3 runtimes to track). Ship with Node only, add multi-variant later when we have data on demand.
How we mitigate: Label issues requires-python / requires-dotnet if filed; Dev-E rejects them. Cleanup PR to remove dormant HelmReleases is a follow-up.
2. Memory write pipeline incomplete¶
save_pattern (auto-scrape ### Learnings from Dev-E output) is wired but agents don't emit the marker section. System prompt should prompt for it, but it's a soft target without enforcement.
Reason: Memory is nice-to-have; not blocking MVP. Memory LOAD (recall) works fine.
How we mitigate: Session-start loads agent memory automatically. Dev-E can call write_memory manually if needed. Track "agents called write_memory" metric — if zero after 1 week, revisit system prompt in Phase 2.
3. Memory TTL + compaction crons not scheduled¶
Tables ready, columns defined, tools exist. Cronjobs don't run.
Reason: On 8GB VM with baseline load ~40%, memory at ~3GB. Room to grow; clean up later.
How we mitigate: Manual cleanup every 30 days (1 SQL statement). Phase 3 adds cron.
4. Cost tracking is passive, not enforced¶
TokenUsageProjection aggregates per agent × repo. LiteLLM proxy enforces hard limits, but it's not wired to the system yet (Phase 2 item 8).
Reason: Phase 2 activity; MVP has soft tracking (dashboard), hard limits come with proxy.
How we mitigate: Daily cost report (cronjob in conductor-e) emailed to team. Manual override possible but flagged.
5. Prometheus not yet source of truth for deploy gates¶
Prometheus is deployed, collecting metrics. Flagger (self-healing) is not — so we can't gate deploys on SLO burn rate yet.
Reason: Phase 5 activity (self-healing). MVP gates on code review + human approval.
How we mitigate: Manual gates for now; Flagger coming Phase 5.
6. Observability is Conductor-E + Langfuse only; agent traces are blank until Phase 2¶
OTel Collector is ready, but Claude Code traces won't flow until we set CLAUDE_CODE_ENABLE_TELEMETRY=1 and Langfuse is live.
Reason: Phase 2 item 7; MVP can ship with Conductor-E events alone (enough to trace workflow).
How we mitigate: Phase 2 full deploy brings agent-level detail. For MVP, Langfuse empty but operational.
MVP exit criteria — how we know it's done¶
An MVP is usable when all of these are true:
- One full issue-to-merge cycle, no human intervention beyond Type-2 approvals:
- Human files:
#123 Feature: add --verbose flag to CLI - Dev-E claims, implements, tests, creates PR
- Review-E reviews independently (approves or requests changes)
- Human approves (Type-2 gate)
- Merge → production deploy (automatic via Flux + Flagger)
-
No human intervention between review approval and production live
-
Safety guards are active and tested:
- Dangerous-command guard tested:
rm -rf /attempted → rejected with reason logged - Agent identity verified: commits have agent author + signing key
- Egress lock tested: attempt to curl attacker domain → denied by NetworkPolicy
-
Cost gate tested: token budget exceeded → escalation event fired
-
Observability is measurable:
- Langfuse dashboard shows: tokens used, tool calls, decision points, hallucination rate
- Conductor-E event stream shows: work assigned, work started, PR created, review passed, merged, deployed
-
Cost dashboard shows: per-agent daily token spend, trending
-
Failure modes are handled:
- Dev-E loops 10× same tool → StuckGuard detects → escalation event fired → Discord notification sent
- Conductor-E heartbeat missing >5 min → escalation event fired
-
Token budget exceeded → work stops, escalation event fired
-
Golden suite regression gate is live and gating:
- 10-task golden suite runs every night
- Results dashboard in Langfuse
-
PR merge fails if golden suite regresses >20%
-
Measured track record exists:
- Minimum 20 successful issues merged (T0 level) with zero rework
- Minimum 10 Review-E reviews completed
- Measured: zero false-positive escalations (stuck detection triggered on real stuck, not flakes)
MVP-to-full-rig roadmap¶
After MVP ships (~Week 4), the team enters Phase 3-7 work (full rig, 3-6 months):
Next up (Weeks 5-8): - Full subscription registry (Phase 3) — who consumes whose events, topology validation at deploy - Bounded-loop sentinel (Phase 4) — Dev-E ↔ Review-E ping-pong cap - Escalation routing (Phase 4) — Discord severity-based notification shape - Flagger + Prometheus SLO gates (Phase 5) — canary-gated production deploys
Stretch goal (Months 2-3): - Spec-E intake refiner (Phase 7) - Dev-E repair-dispatch mode (Phase 5, Stage 2) — incident response automation - CaMeL trust separation (Phase 6) — formal prompt-injection defense
Long term (Months 4-6+): - Architect-E (Phase 7) - 4-tier memory scoping enforcement (Phase 3+) - Property-based testing (Phase 2+) - Drift detection canary (Phase 6)
MVP dependencies and build order¶
graph TB
subgraph "Week 1 — Phase 0 (parallel)"
guard["Dangerous-command guard"]
identity["Agent identity in git"]
egress["Egress NetworkPolicy"]
worktrees["Git worktrees"]
end
subgraph "Week 2-3 — Phase 1 + 2"
spool["Hook reliability spool"]
stuck["StuckGuard middleware"]
otel["OTel + Langfuse"]
liteллm["LiteLLM proxy"]
golden["Golden suite regression"]
end
subgraph "Week 3-4 — Phase 3a"
cursor["Per-consumer cursor"]
end
guard --> spool
identity --> spool
egress --> spool
worktrees --> spool
spool --> stuck
spool --> otel
stuck --> liteллm
otel --> golden
liteллm --> golden
golden --> cursor
cursor --> exit["MVP exit criteria met"]
classDef phase0 fill:#e8f5e9,stroke:#2e7d32,color:#000
classDef phase12 fill:#fff3e0,stroke:#e65100,color:#000
classDef phase3 fill:#e3f2fd,stroke:#1565c0,color:#000
classDef exit fill:#fce4ec,stroke:#ad1457,color:#000
class guard,identity,egress,worktrees phase0
class spool,stuck,otel,liteллm,golden phase12
class cursor phase3
class exit exit
Key dependency: Phase 0 must complete before Phase 1 → all safety + identity locks in place before we scale observability and cost. Cannot ship observability without stuck detection (otherwise metrics lie).
Parallelism: Phase 0 items (1-4) are fully independent; assign them to different teammates if available.
Status doc corrections & reconciliations¶
Based on code audit, the implementation-status.md doc is mostly accurate. Minor corrections:
| Row | Current Status | Should Be | Reason |
|---|---|---|---|
GET /api/reviews/next |
Deployed ✓ | (no change) | Confirmed in Program.cs:804–808; README was stale, not the endpoint |
| Dev-E (dotnet variant) | Partial ✓ | (no change) | Correct; cron.enabled: false |
| Dev-E (python variant) | Partial ✓ | (no change) | Correct; same state as dotnet |
Memory save_pattern |
Partial ✓ | (no change) | Correct; tool exists, agents don't emit marker |
Memory mark_used |
Partial ✓ | (no change) | Correct; metric is 0% |
| Prometheus | Partial ✓ | (no change) | Correct; not yet source of truth for Flagger (Flagger not deployed) |
| OTel Collector | Partial ✓ | (no change) | Correct; Conductor-E emits, agents don't yet (Phase 2 item 7) |
| Cost dashboard | Partial ✓ | (no change) | Correct; basic aggregation, no hard enforcement |
No rows need fixing. The status doc is honest and up-to-date. Each "Partial" row has a clear gap acknowledged.
Success metrics: how we measure MVP quality¶
By end of Week 4:
| Metric | MVP Target | Why it matters |
|---|---|---|
| Issues closed (any size) | ≥20 | Volume signal; proves dispatch loop works |
| Zero rework rate (T0) | ≥95% | First-time approval without "changes requested" |
| Review-E reviews completed | ≥10 | Independent review gate is working |
| Stuck detection precision | 100% TP, 0 FP | If we escalate, it's real stuck, not flake |
| Cost per issue merged | <$0.50 | Sanity check; budget gates working |
| Mean time to merge (from issue) | <30 min | Feasible for human + agent 1-turn loop |
| Langfuse golden suite regress | 0% (stable) | New system hasn't broken baseline |
Honest scope confessions¶
These claims are not proven; they're educated guesses:
-
"Dangerous-command guard is enough for Phase 0 safety." We're betting that rejects on
rm -rf /anddrop tablecatch 90% of catastrophic hallucinations. Novel attacks (e.g., time-bomb code) still slip through. We're not claiming perfection, just "good enough to deploy." -
"Per-consumer cursor is the right capacity model." It works for Conductor-E's event-driven architecture, but we've never run at scale (>5 concurrent agents). Might need rework if we hit deadlock under load.
-
"Golden suite regression gate catches real degradation." We're assuming 10 representative tasks + 30% threshold is calibrated right. First month of data will tell us if it's too strict (too many false positives) or too loose (misses real regressions).
-
"LiteLLM proxy is operationally simple." It's a straightforward proxy-in-container, but multi-provider fallback logic is complex. First incident under cross-provider fallback might reveal surprises.
-
"Langfuse self-hosted on 8GB won't thrash." Prometheus is only ~500MB, but once we add agent traces (50+/day), ingest rate might surprise us. May need to drop to weekly eval instead of nightly if disk I/O becomes a problem.
Risk register for MVP¶
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| Dangerous-command guard misses 0-day exploit | Low | Critical | Weekly external review of guard rules; honeypot tests (intentional attack attempts) |
| LiteLLM proxy silently drops requests on 429 | Medium | High | Metrics-driven alert (429 count > threshold); manual fallback drill every 2 weeks |
| Stuck detection false-positive > 10% | Medium | Medium | Tune thresholds after Week 1; adjust before Phase 3 |
| Golden suite too brittle (constant regression) | Low | Medium | Simplify task set or threshold if merge-rate drops below 50% |
| Langfuse disk fills in 2 weeks | Low | High | Implement TTL pruning before nightly eval starts; monitor disk pressure daily |
What success looks like in one sentence¶
By the end of Week 4, a human can file a GitHub issue, Dev-E implements it (with tests and docs), Review-E approves, code merges and deploys, and the whole team can inspect the trace and cost with confidence that nothing destructive happened autonomously.
That is the minimum viable rig.
This document is living and will be updated weekly as MVP items ship. Track changes in the rig-gitops CHANGELOG.