Skip to content

Architecture

C4 Diagrams

See the C4 Diagrams page for all rendered diagrams.

C4 Containers

Clean Architecture

The API follows Clean Architecture — dependencies point inward only.

┌─────────────────────────────────────────────────────┐
│ ConductorE.Api (Frameworks & Drivers)               │
│                                                     │
│  Program.cs          DI wiring, endpoint routing    │
│  Adapters/                                          │
│    MartenEventStore  Implements IEventStore          │
│    MartenIssueQuery  Implements IIssueQuery          │
│    MartenAgentQuery  Implements IAgentQuery          │
│    MartenProjections Marten-specific projections     │
│                                                     │
│  ┌─────────────────────────────────────────────┐    │
│  │ ConductorE.Core (Domain + Use Cases)        │    │
│  │                                             │    │
│  │  Domain/                                    │    │
│  │    Events.cs      Pure event records        │    │
│  │    ReadModels.cs  Pure read model records   │    │
│  │                                             │    │
│  │  Ports/                                     │    │
│  │    IEventStore    Interface                  │    │
│  │    IIssueQuery    Interface                  │    │
│  │    IAgentQuery    Interface                  │    │
│  │                                             │    │
│  │  UseCases/                                  │    │
│  │    SubmitEvent    Maps request → event       │    │
│  │                                             │    │
│  │  ⚠ ZERO external dependencies              │    │
│  └─────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────┘

Core has zero NuGet packages. No Marten, no ASP.NET, no framework code. Domain events, read models, ports, and use cases are pure C#.

Marten is only in the Api project (adapters). If we ever swap PostgreSQL for another event store, only the adapters change — Core stays untouched.

Two-Component Design

rig-conductor runs as two components in the same Kubernetes namespace:

┌──────────────────── rig-conductor namespace ────────────────────┐
│                                                               │
│  ┌──────────────────┐        ┌──────────────────┐            │
│  │ rig-conductor      │  HTTP  │ rig-conductor API  │            │
│  │ (Rig Agent)      │───────▶│ (.NET 10)        │            │
│  │                  │        │                  │            │
│  │ Discord bot      │        │ Domain:          │            │
│  │ GitHub MCP tools │        │   Events         │            │
│  │ Claude Haiku     │        │   ReadModels     │            │
│  │ 1-year sub token │        │ Ports:           │            │
│  └──────────────────┘        │   IEventStore    │            │
│                              │   IIssueQuery    │            │
│                              │ Adapters:        │            │
│                              │   Marten → PG    │            │
│                              └────────┬─────────┘            │
│                                       │                      │
│                              ┌────────▼─────────┐            │
│                              │ PostgreSQL 16    │            │
│                              │ Marten schema    │            │
│                              └──────────────────┘            │
└───────────────────────────────────────────────────────────────┘

Event Flow

sequenceDiagram
    participant H as Human/Agent
    participant A as Rig Agent Runtime (Discord)
    participant UC as SubmitEvent (Use Case)
    participant P as IEventStore (Port)
    participant M as MartenEventStore (Adapter)
    participant PG as PostgreSQL

    H->>A: Message in #conductor-e
    A->>A: Claude processes message
    A->>UC: SubmitEventRequest
    UC->>UC: MapToEvent (domain logic)
    UC->>P: AppendAsync(streamId, event)
    P->>M: Marten session.Events.Append
    M->>PG: INSERT + inline projection update
    PG-->>M: Stored
    M-->>UC: Done
    UC-->>A: SubmitEventResponse
    A-->>H: "Issue #547 queued"

Stream Identity

String-based (not Guid):

  • Issue streams: dashecorp/rig-conductor#42
  • Agent streams: dev-e-1

Projections

Marten inline projections (update synchronously with event append):

Projection Source Events Key Fields
IssueStatus All lifecycle events state, agentId, prNumber, priority, attempt, queuedAt, orphanStatus
AgentStatus Heartbeat, IssueAssigned, WorkStarted, IssueUnassigned, AgentStuck, TokenUsage, AgentQuotaReported, MemoryQueried status, currentIssue, IssuesFailed, lastHeartbeatAt, providers, capabilities, quota, memory metrics

Dashboard

Dashboard.html is a single-page control plane UI served at /. It auto-refreshes every 30 seconds.

Panel Data source Description
Agents GET /api/agents Online/offline status, current task, heartbeat age
Queue GET /api/queue, GET /api/issues?state=in_progress Pending and active issues
Queues GET /api/streams/status, GET /api/streams/{agentId} Per-agent Valkey stream depths with last-assigned entry. Rows with >20 queued items are highlighted amber. Clicking a row expands to show the last 10 queued items with GitHub links.
Costs GET /api/costs/summary 30-day cost breakdown per agent
Logs GET /api/agent-logs + SSE Real-time agent log stream

Health Probes (rc#1188)

GET /healthz/deep runs every registered IDependencyHealthChecker in parallel with a 2-second per-checker hard timeout, aggregates per-dep results via DeepHealthCheck.AggregateOverall, and returns HTTP 503 when any critical dep is Unreachable (Valkey / Marten / GitHub). Non-critical deps (Discord) can soft-degrade overall but never trip 503.

K8s probes target /healthz/deep: - Readiness sustained 503 for 60s → pod pulled from LB - Liveness sustained 503 for 5 min → pod restart (fresh connection pools)

DependencyHealthDegradedWatcher (rc#947 pattern) tracks observations across scan ticks; auto-files a gap-analysis issue when any dep stays non-Ok for ≥30 min across ≥3 consecutive scans.

See docs/2026-05-19-deep-health-probe.md for the full design, the four-slice rollout (PR-A → PR-D), and the rc#1173 silent-degrade incident that prompted it.

Implementation: src/ConductorE.Api/Adapters/{DeepHealthService.cs,*HealthChecker.cs} and src/ConductorE.Api/Services/SelfImprovement/DependencyHealthDegradedWatcher.cs.

Webhook Handlers

GitHub sends webhook events to POST /api/webhook/github. rig-conductor handles the following event/action pairs; everything else returns { skipped: true }.

Event Action What it does
issues labeled (with agent-ready) Dispatches the issue — reads .rig-agent.yaml for stack, routes to dev-e-<stack> or ibuild-e, emits ISSUE_APPROVED + ISSUE_ASSIGNED, pushes to Valkey stream.
issues closed Emits ISSUE_DONE to transition tracked entries. Catches issues closed directly on GitHub (as duplicate, not-planned, or shipped via a PR that didn't link them).
pull_request opened / ready_for_review / synchronize Links PR to its execution log, fires PR_CREATED, routes agent+Dependabot PRs to Review-E.
pull_request review_requested Catches human-authored PRs that request Review-E explicitly (via per-repo request-review.yml).
pull_request closed (with merged=true) Parses Closes #N from the PR body, emits MERGED + ISSUE_DONE, then runs DuplicateCloseService to close sibling PRs.
pull_request_review submitted Routes changes_requested back to the implementing agent.

Implementation: src/ConductorE.Api/Program.cs (search for eventType == "...").

Reconciler Recovery

ReconciliationService runs every 5 min and reconciles conductor state with live GitHub state. Beyond the main state-transition pass, it runs three independent recovery scans for known stall patterns:

Path Trigger Tracking
Abstention recovery PR in in_review with only a COMMENTED review-e review (no binding verdict) rc#608 / rc#610
Timeout recovery PR with ≥2 ReviewFailed(reason="timeout") events — CLI ran but never posted rc#765
Quota recovery PR in state=failed with last AgentStuck.Reason matching a quota signature, and agent quota recovered to <80% rc#944

Each path emits RE_REVIEW_REQUESTED with a path-specific reason and re-publishes to signal:review-e. All three share a 30-min throttle window and the same COI guard (review-e-authored PRs are never re-dispatched to review-e). See docs/principles.md §5 for the design and docs/2026-05-18-quota-recovery-reconciliation.md for the most recent (quota) path.

Implementation: src/ConductorE.Api/Services/ReconciliationService.cs with pure policies in src/ConductorE.Core/Domain/QuotaRecoveryPolicy.cs and src/ConductorE.Core/UseCases/ReconcileIssue.cs.

Stream Reclaim

StreamReclaimService runs every 60 s as the stream-side actuator for rc#959. It scans XPENDING on each agent's assignments:<agentId> stream; if an entry is idle past 5 min AND the consumer it's assigned to owns an agent with no heartbeat in 10 min, it XCLAIMs the entry to a healthy consumer in the same group.

Pure policy in src/ConductorE.Core/Domain/StreamReclaimPolicy.cs decides ShouldReclaim + PickReclaimTarget. The companion detector StreamConsumerWithoutHeartbeatWatcher (in the rc#947 SelfImprovementService framework) observes the same pattern and files gap-analysis issues — both run in steady state so a regression in either surface remains visible.

See docs/2026-05-18-stream-side-reaper.md for the full design.

Implementation: src/ConductorE.Api/Services/StreamReclaimService.cs.

Orphan Issue Detection

When an issue has been in the queue too long without any agent claiming it, conductor alerts and escalates automatically — no human grep required.

See Orphan Issue Detection for the full spec.

Summary: - OrphanScanService runs hourly - > 24h queued, no claim → Discord #admin alert, orphan label applied, ISSUE_ORPHANED event emitted - > 48h queued, still unclaimed → escalated alert, needs-human label, agent-ready removed, auto-dispatch stopped

Implementation: src/ConductorE.Api/Services/OrphanScanService.cs

Duplicate PR Auto-Close

When two agents claim the same issue in parallel (claim-race), each opens a PR. After the winning PR merges, rig-conductor automatically closes any sibling PRs that reference the same issue.

Trigger: MERGED event (from MergeGate or GitHub webhook pull_request.closed)

Behaviour: 1. Scans all open PRs in the target repo for bodies matching Closes/Fixes/Resolves #N (or cross-repo owner/repo#N). 2. Closes each sibling PR with a comment:

**Duplicate of #<MERGED_PR>** — closed automatically by rig-conductor.
PR #<MERGED_PR> covered the same issue #<N> and merged at <MERGED_AT>.
Two agents claimed the same issue; this one lost the race.
3. Deletes the Valkey key session:{repo}#{N} (the agent claim lock). 4. Emits a DUPLICATE_PR_CLOSED event with fields: repo, mergedPrNumber, closedPrNumber, issueNumber.

Implementation: src/ConductorE.Api/Services/DuplicateCloseService.cs, called from MergeGate.Merge() and the webhook pull_request.closed handler.

Note: This does not prevent the double-claim — atomic SETNX claim:{repo}#{N} is tracked in a separate issue.

Test Coverage

Suite Tests Line Branch
Core (unit) 53
API (unit, DuplicateCloseService + OrphanScanService) 22
API (integration, Testcontainers PostgreSQL) 11 32.9% 25.4%
Total 86

Core coverage gap is auto-generated record methods. API coverage gap is ASP.NET framework-generated code. All business logic is covered.