Engineering Rig — Current Architecture¶

Overview¶

The engineering rig is an AI-assisted development platform where AI agents and humans collaborate on code. Agents run on a GCP k8s cluster. Humans run Claude Code locally. Both report to a central coordinator (rig-conductor) and follow the same workflow rules.

System Diagram¶

graph TB
    subgraph "GitHub"
        issues[GitHub Issues]
        prs[Pull Requests]
        webhooks[Webhooks]
    end

    subgraph "GCP k3s Cluster"
        subgraph "rig-conductor"
            api[rig-conductor API<br/>.NET 10]
            postgres[(PostgreSQL<br/>Event Store)]
            valkey[(Valkey<br/>Streams + Signals)]
            cost[Cost Dashboard]
        end

        subgraph "Dev-E Agents"
            dev_node[Dev-E Node<br/>StatefulSet]
            dev_dotnet[Dev-E Dotnet<br/>StatefulSet]
            dev_python[Dev-E Python<br/>StatefulSet]
        end

        subgraph "Review-E"
            review[Review-E<br/>StatefulSet]
        end

        subgraph "Infrastructure"
            keda[KEDA<br/>Autoscaler]
            flux[FluxCD<br/>GitOps]
            tunnel[Cloudflare<br/>Tunnel]
            weave[Weave GitOps<br/>Dashboard]
        end
    end

    subgraph "Human Workstations"
        claude[Claude Code<br/>Workspaces]
        hooks[rig-tools<br/>Hooks]
    end

    subgraph "Monitoring"
        discord[Discord<br/>Channels]
        flux_dash[flux.dashecorp.com]
        conductor_dash[rig-conductor.dashecorp.com]
    end

    issues -->|label: agent-ready| webhooks
    webhooks -->|POST /api/webhook/github| api
    api --> postgres
    api --> valkey
    valkey -->|signal| keda
    keda -->|scale 0→1| dev_node
    keda -->|scale 0→1| dev_dotnet
    keda -->|scale 0→1| dev_python
    keda -->|scale 0→1| review
    dev_node -->|clone, branch, implement| prs
    dev_dotnet -->|clone, branch, implement| prs
    review -->|review PR| prs
    prs -->|webhook| api
    api -->|alerts| discord
    claude -->|hooks| hooks
    hooks -->|events| api
    flux -->|reconcile| dev_node
    flux -->|reconcile| review
    tunnel --> api
    tunnel --> weave

Components¶

rig-conductor (the brain)¶

Event-sourced coordinator. Receives GitHub webhooks, assigns work to agents, tracks progress.

Endpoint	Method	Purpose
`/api/webhook/github`	POST	Receives issues, PRs, reviews, check_runs
`/api/events`	POST	Agents report: WORK_STARTED, PR_CREATED, HEARTBEAT, AGENT_STUCK
`/api/assignments/next`	GET	Agent claims next assignment
`/api/issues`	GET	All tracked issues with state
`/api/agents`	GET	Agent status (working/idle/stuck)

Tech: .NET 10, Marten event sourcing, PostgreSQL, Valkey for streams/signals.

State machine for each issue:

stateDiagram-v2
    [*] --> queued: issue labeled agent-ready
    queued --> assigned: agent claims
    assigned --> in_progress: WORK_STARTED
    in_progress --> in_review: PR_CREATED
    in_review --> changes_requested: review rejects
    changes_requested --> in_review: agent pushes fix
    in_review --> ready_to_merge: review approves
    ready_to_merge --> done: PR merged
    in_progress --> stuck: AGENT_STUCK
    stuck --> in_progress: unstuck / reassigned

Rig Agent Runtime (the hands)¶

Shared Node.js runtime that all agents use. Loaded with a character config (persona, tools, LLM provider, MCP servers). One image, many agents.

graph LR
    char[character.json] --> runtime[Rig Agent Runtime]
    runtime --> discord[Discord Gateway]
    runtime --> llm[LLM Provider<br/>Claude CLI / Codex / API]
    runtime --> mcp[MCP Servers<br/>GitHub, Advisor, Memory]
    runtime --> heartbeat[Heartbeat<br/>→ rig-conductor]
    runtime --> dashboard[Dashboard<br/>:3000]

Character config defines everything about an agent:

character:
  name: "Dev-E (Node)"
  # Persona loaded from a file baked into the image at build time (rar#245+).
  # personaFile takes precedence over an inline personality: block.
  # Updating persona only requires a personas/ edit + image rebuild —
  # no HelmRelease change needed.
  personaFile: /app/personas/dev-e-node.md
  llm:
    provider: claude-cli
    model: claude-opus-4-7
  mcpServers:
    github:
      command: npx
      args: ["-y", "@modelcontextprotocol/server-github"]
  cron:
    schedule: "*/5 * * * *"
    prompt: "Check for work..."

Multi-stack images — same runtime, different language tooling:

Tag	Tools	Used by
`base`	Node.js 22, Claude CLI, Codex CLI, gh	Base for all
`node`	+ TypeScript, Jest, ESLint	Dev-E Node
`dotnet`	+ .NET 10 SDK	Dev-E Dotnet
`python`	+ Python 3, pytest, black	Dev-E Python

Dev-E (the developers)¶

Three stack variants, all running the same runtime with different character configs. Each polls rig-conductorfor assignments every 5 minutes.

Work flow:

sequenceDiagram
    participant C as rig-conductor
    participant D as Dev-E
    participant G as GitHub
    participant R as Review-E

    D->>C: GET /api/assignments/next?agentId=dev-e
    C-->>D: Assignment: repo#42 "Add login"
    D->>C: POST WORK_STARTED
    D->>G: Clone repo, create branch
    D->>D: Implement with Claude Code CLI
    D->>G: Push branch, create PR
    D->>C: POST PR_CREATED
    G->>C: Webhook: pull_request opened
    C->>R: Routes to Review-E
    R->>G: Review PR
    alt Approved
        R->>G: Approve
        G->>G: Auto-merge
        G->>C: Webhook: PR merged
        C->>C: Issue → done
    else Changes Requested
        R->>G: Request changes
        G->>C: Webhook: review submitted
        C->>D: Routes back to Dev-E
        D->>D: Fix, push
    end

Review-E (quality gate)¶

Reviews every PR from Dev-E. Structurally separate — the agent that writes code cannot approve it.

Review checklist: 1. Correctness — does it match the issue? 2. Security — OWASP top 10 3. Tests — adequate coverage 4. Docs — updated if behavior changed, valid YAML frontmatter 5. Commits — conventional format

Human gate: Sensitive files (auth, payment, migration, GDPR, schema) trigger escalation to human. Review-E will NOT approve these.

KEDA (scale-to-zero)¶

Agents idle most of the time. KEDA watches Valkey for signals and scales agents 0→1 when work arrives. Cooldown: 20 minutes.

No work → 0 pods (zero cost)
Issue labeled →  rig-conductorwrites signal to Valkey
KEDA detects signal → scales Dev-E to 1 pod
Work completes → 20 min cooldown → back to 0

Human Developers¶

Humans use Claude Code locally with the same workflow rules as agents.

rig-tools hooks connect humans to rig-conductor:

graph LR
    human[Human + Claude Code] -->|PostToolUse| hook[conductor-e-hook]
    hook -->|HEARTBEAT| conductor[rig-conductor API]
    human -->|git checkout -b| hook
    hook -->|WORK_STARTED| conductor
    human -->|gh pr create| hook
    hook -->|PR_CREATED| conductor

Hooks fire automatically via Claude Code settings.json. For other AI tools (Codex, Copilot, Cursor), call the CLI directly.

Devcontainers¶

Humans can work inside the same container image as agents:

Agent on k8s:  rig-agent-runtime:dotnet → rig-conductor
Human locally: rig-agent-runtime:dotnet (devcontainer) → rig-conductor

Each repo has .devcontainer/devcontainer.json pointing to the right stack image.

Infrastructure¶

GCP k3s Cluster¶

Resource	Detail
VM	`invotek-k3s` — `e2-standard-2` (2 vCPU, 8GB)
Region	`europe-north1-b`
K8s	k3s v1.34, single node
GitOps	FluxCD — watches `dashecorp/rig-gitops`
Images	GCP Artifact Registry: `europe-north1-docker.pkg.dev/invotek-github-infra/dashecorp/`
Tunnel	Cloudflare `dashecorp-gcp` → `rig-conductor.dashecorp.com`, `flux.dashecorp.com`
Access	`gcloud compute ssh invotek-k3s --zone europe-north1-b --project invotek-github-infra`

Monitoring¶

System	What	Where
Flux Discord alerts	Reconciliation failures	Discord channel
Weave GitOps	Visual Flux dashboard	https://flux.dashecorp.com
rig-conductorcost dashboard	Per-agent token usage	https://rig-conductor.dashecorp.com
rig-conductor API	Agent status, issue state	https://rig-conductor.dashecorp.com/api/agents

GitOps Flow¶

graph LR
    dev[Developer] -->|PR| gitops[dashecorp/rig-gitops]
    gitops -->|FluxCD watches| flux[Flux on k3s]
    flux -->|reconcile| cluster[k8s Resources]
    flux -->|error| discord[Discord Alert]

All deployments go through git. No kubectl apply manually.

Repositories¶

Repo	Purpose
`dashecorp/rig-conductor`	Event store, assignment engine, API
`dashecorp/rig-agent-runtime`	Shared agent runtime + Helm chart
`dashecorp/rig-gitops`	FluxCD manifests, AGENTS.md, docs, templates
`dashecorp/dev-e`	Dev agent .NET worker (future replacement)
`dashecorp/review-e`	Review agent .NET worker (future replacement)
`dashecorp/rig-tools`	Developer hooks, workflow sync
`dashecorp/infra`	OpenTofu — GCP VM, Cloudflare, GitHub repos

Memory¶

All agents use a shared rig-memory-mcp server backed by the Marten Postgres (with pgvector). Memory is cross-agent — Dev-E, Dev-E Dotnet, Dev-E Python, and Review-E all read/write to the same store.

Component	Detail
MCP server	`@dashecorp/rig-memory-mcp` (pre-installed in rig-agent-runtime image)
Backend	PostgreSQL + pgvector extension — same instance as Marten event store
Connection	`DB_URL` env var from `{agent}-secrets.database-url`
Extension init	`apps/rig-conductor/postgres-pgvector-job.yaml` (one-time Job)

Gaps and Limitations¶

No pre-tool guards — agents can run destructive commands unchecked
No agent identity attribution — git commits use generic names
No escalation routing — stuck agents just post to Discord
No centralized hooks config — each workspace configured independently
Single node — no HA, single point of failure
No inter-agent messaging — all communication routes through rig-conductoror Discord