Engineering Rig — Proposed Improvements¶

Five improvements to close the gaps identified in architecture-current.md. Inspired by patterns from Gastown, a production multi-agent orchestration system.

1. Session Recovery (Prime)¶

Problem¶

When an agent restarts, crashes, or Claude Code compacts context — all state is lost. The agent starts fresh with no idea what it was working on. Humans experience the same issue when starting a new Claude Code session.

Current flow (broken)¶

sequenceDiagram
    participant A as Agent
    participant C as rig-conductor

    A->>A: Restart / context compaction
    A->>A: Lost: branch, issue, PR, review comments
    A->>C: GET /api/assignments/next
    Note over A: Picks up NEW work<br/>instead of resuming

Proposed flow¶

sequenceDiagram
    participant A as Agent / Human
    participant S as Prime Script
    participant C as rig-conductor
    participant G as GitHub

    A->>A: Session starts (or compaction)
    A->>S: SessionStart hook fires
    S->>S: Read current git branch
    S->>C: GET /api/agents/{agentId}
    S->>G: Check open PRs for this branch
    S->>S: Build context summary
    S-->>A: Inject: "You are working on repo#42,<br/>branch feature/issue-42-login,<br/>PR #15 has 2 review comments"
    A->>A: Resume work where it left off

Implementation¶

Add hooks/conductor-e-prime.sh to rig-tools (and bake into devcontainer):

# Reads: git branch,  rig-conductoragent status, GitHub PRs
# Outputs: context summary injected via SessionStart hook

Add to Claude Code settings.json:

{
  "hooks": {
    "SessionStart": [{"type": "command", "command": "conductor-e-prime"}]
  }
}

Effort: Small (1 shell script + hook config)¶

2. Pre-Tool Guards¶

Problem¶

Agents can run destructive commands without guardrails. A confused agent could git push --force, rm -rf /workspace, or kubectl delete namespace production.

Current flow (unprotected)¶

graph LR
    agent[Agent] -->|git push --force| git[Git]
    agent -->|rm -rf /| fs[Filesystem]
    agent -->|kubectl delete ns| k8s[Cluster]
    style git fill:#ff6666,color:#000
    style fs fill:#ff6666,color:#000
    style k8s fill:#ff6666,color:#000

Proposed flow¶

graph LR
    agent[Agent] -->|command| guard[PreToolUse Guard]
    guard -->|safe| exec[Execute]
    guard -->|dangerous| block[Block + Log]
    block -->|event| conductor[rig-conductor]
    style block fill:#ff9999,color:#000
    style exec fill:#99ff99,color:#000

What gets blocked¶

Pattern	Why
`git push --force`	Destroys remote history
`git reset --hard`	Loses uncommitted work
`rm -rf /` or `rm -rf ~`	Filesystem destruction
`kubectl delete namespace`	Cluster destruction
`DROP TABLE`, `DROP DATABASE`	Data loss
`chmod 777`	Security risk

Implementation¶

Add hooks/pretool-guard.sh to rig-tools:

# Reads tool_input from Claude Code PreToolUse hook
# Checks against blocklist
# Exit 2 to block, exit 0 to allow

Add to Claude Code settings.json:

{
  "hooks": {
    "PreToolUse": [{"type": "command", "command": "pretool-guard"}]
  }
}

Effort: Small (1 shell script + hook config)¶

3. Agent Identity Attribution¶

Problem¶

Git commits from agents use generic names. When reviewing history, you can't tell which agent made a change or trace quality issues to a specific agent instance.

Current state¶

abc1234 feat: add login (Dev-E <noreply@dashecorp.com>)
def5678 fix: auth bug (Dev-E <noreply@dashecorp.com>)

Which Dev-E? Node? Dotnet? Was it a human or agent?

Proposed state¶

abc1234 feat: add login (dev-e-node <agent@dashecorp.com>)
def5678 fix: auth bug (human-stig <stig@dashecorp.com>)
ghi9012 refactor: cleanup (dev-e-dotnet <agent@dashecorp.com>)

Implementation¶

graph TB
    subgraph "Agent (k8s)"
        env[AGENT_ID=dev-e-node]
        git_config[git config user.name = dev-e-node]
    end

    subgraph "Human (local)"
        hooks_env[CONDUCTOR_AGENT_ID=human-stig]
        git_user[git config user.name = human-stig]
    end

    subgraph "rig-conductor"
        history[Work history per agent identity]
        cost[Cost tracking per agent identity]
    end

Set in HelmRelease values:

extraEnv:
  - name: GIT_AUTHOR_NAME
    value: "dev-e-node"
  - name: GIT_AUTHOR_EMAIL
    value: "agent@dashecorp.com"

For humans, rig-tools install sets CONDUCTOR_AGENT_ID=human-$(whoami).

Effort: Small (env vars in HelmRelease + rig-tools)¶

4. Centralized Hooks Config¶

Problem¶

Each developer and agent workspace configures Claude Code hooks independently. No consistency. New team members miss critical hooks. Updates require manual changes everywhere.

Current state¶

graph TB
    ws1[Workspace 1<br/>settings.json] -->|manual| hooks1[heartbeat hook]
    ws2[Workspace 2<br/>settings.json] -->|manual| hooks2[heartbeat + guard]
    ws3[Workspace 3<br/>settings.json] -->|missing| hooks3[no hooks]
    style hooks3 fill:#ff9999,color:#000

Proposed state¶

graph TB
    base[rig-tools/hooks-base.json<br/>Base config for everyone]
    dev_override[hooks-overrides/dev.json<br/>Dev-specific overrides]
    review_override[hooks-overrides/reviewer.json<br/>Reviewer overrides]

    base --> merge1[Merge]
    dev_override --> merge1
    merge1 --> ws1[Dev workspace<br/>settings.json]

    base --> merge2[Merge]
    review_override --> merge2
    merge2 --> ws2[Reviewer workspace<br/>settings.json]

    base --> ws3[Default workspace<br/>settings.json]

Base hooks (all roles)¶

Hook	Event	Purpose
`conductor-e-prime`	SessionStart	Resume context after restart
`conductor-e-hook`	PostToolUse	Heartbeat + event detection
`conductor-e-hook`	Stop	Mark idle
`pretool-guard`	PreToolUse	Block dangerous commands

Implementation¶

Add to rig-tools:

hooks-base.json                    # Shared base config
hooks-overrides/
  dev.json                         # Dev-E specific
  reviewer.json                    # Review-E specific
scripts/hooks-sync.sh             # Generate settings.json from merged config

./install.sh runs hooks-sync.sh automatically.

Effort: Medium (config files + merge script + install update)¶

5. Escalation with Severity Routing¶

Problem¶

When agents get stuck, they post a message to Discord. No severity levels, no routing, no tracking, no re-escalation. Critical issues get the same treatment as minor blockers.

Current flow¶

graph LR
    agent[Stuck Agent] -->|"🛑 Stuck on repo#42"| discord[Discord #tasks]
    discord -->|human notices... eventually| human[Human]
    style discord fill:#ffcc00,color:#000

Proposed flow¶

graph TB
    agent[Agent] -->|ESCALATE P2| conductor[rig-conductor]

    conductor -->|P2: Medium| thread[Discord Thread<br/>on the PR]
    conductor -->|P1: High| channel[Discord Channel<br/>#admin]
    conductor -->|P0: Critical| dm[Discord DM<br/>+ @mention]

    conductor -->|4h unacked?| bump[Bump Severity<br/>P2→P1→P0]
    bump -->|re-route| conductor

Severity levels¶

Level	When	Notification	Auto-escalate
P2	Minor blocker, needs guidance	Discord thread	→ P1 after 4h
P1	CI stuck, review conflict	Discord #admin	→ P0 after 4h
P0	Security issue, data risk	Discord DM + @mention	Stays P0

Implementation¶

New rig-conductor events:

ESCALATION_CREATED  { severity, reason, agentId, repo, issueNumber }
ESCALATION_ACKED    { escalationId }
ESCALATION_CLOSED   { escalationId, resolution }

New rig-tools command:

conductor-e-hook ESCALATE --severity P1 "CI fails on auth tests, tried 3 times"

rig-conductorcron job checks unacked escalations every hour, bumps severity after threshold.

Effort: Medium (rig-conductor API changes + rig-tools CLI + Discord routing)¶

Implementation Roadmap¶

gantt
    title Rig Improvements
    dateFormat YYYY-MM-DD

    section Phase 1 (Quick Wins)
    Session Recovery (Prime)        :p1, 2026-04-17, 2d
    Pre-Tool Guards                 :p2, 2026-04-17, 1d
    Agent Identity Attribution      :p3, 2026-04-17, 1d

    section Phase 2 (Consistency)
    Centralized Hooks Config        :p4, after p1, 3d

    section Phase 3 (Reliability)
    Escalation System               :p5, after p4, 5d

What This Does NOT Change¶

** rig-conductorstays** as the central coordinator (not replaced by a CLI)
GitHub Issues stays as the issue tracker (not replaced by Beads)
FluxCD stays for GitOps (no change)
PostgreSQL + Marten stays for event sourcing (not replaced by Dolt)
Discord stays for communication (enhanced, not replaced)
KEDA scale-to-zero stays (no change)

These improvements layer on top of the existing architecture. No rewrites.