Architecture¶
C4 Diagrams¶
See the C4 Diagrams page for all rendered diagrams.
Clean Architecture¶
The API follows Clean Architecture — dependencies point inward only.
┌─────────────────────────────────────────────────────┐
│ ConductorE.Api (Frameworks & Drivers) │
│ │
│ Program.cs DI wiring, endpoint routing │
│ Adapters/ │
│ MartenEventStore Implements IEventStore │
│ MartenIssueQuery Implements IIssueQuery │
│ MartenAgentQuery Implements IAgentQuery │
│ MartenProjections Marten-specific projections │
│ │
│ ┌─────────────────────────────────────────────┐ │
│ │ ConductorE.Core (Domain + Use Cases) │ │
│ │ │ │
│ │ Domain/ │ │
│ │ Events.cs Pure event records │ │
│ │ ReadModels.cs Pure read model records │ │
│ │ │ │
│ │ Ports/ │ │
│ │ IEventStore Interface │ │
│ │ IIssueQuery Interface │ │
│ │ IAgentQuery Interface │ │
│ │ │ │
│ │ UseCases/ │ │
│ │ SubmitEvent Maps request → event │ │
│ │ │ │
│ │ ⚠ ZERO external dependencies │ │
│ └─────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
Core has zero NuGet packages. No Marten, no ASP.NET, no framework code. Domain events, read models, ports, and use cases are pure C#.
Marten is only in the Api project (adapters). If we ever swap PostgreSQL for another event store, only the adapters change — Core stays untouched.
Two-Component Design¶
rig-conductor runs as two components in the same Kubernetes namespace:
┌──────────────────── rig-conductor namespace ────────────────────┐
│ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ rig-conductor │ HTTP │ rig-conductor API │ │
│ │ (Rig Agent) │───────▶│ (.NET 10) │ │
│ │ │ │ │ │
│ │ Discord bot │ │ Domain: │ │
│ │ GitHub MCP tools │ │ Events │ │
│ │ Claude Haiku │ │ ReadModels │ │
│ │ 1-year sub token │ │ Ports: │ │
│ └──────────────────┘ │ IEventStore │ │
│ │ IIssueQuery │ │
│ │ Adapters: │ │
│ │ Marten → PG │ │
│ └────────┬─────────┘ │
│ │ │
│ ┌────────▼─────────┐ │
│ │ PostgreSQL 16 │ │
│ │ Marten schema │ │
│ └──────────────────┘ │
└───────────────────────────────────────────────────────────────┘
Event Flow¶
sequenceDiagram
participant H as Human/Agent
participant A as Rig Agent Runtime (Discord)
participant UC as SubmitEvent (Use Case)
participant P as IEventStore (Port)
participant M as MartenEventStore (Adapter)
participant PG as PostgreSQL
H->>A: Message in #conductor-e
A->>A: Claude processes message
A->>UC: SubmitEventRequest
UC->>UC: MapToEvent (domain logic)
UC->>P: AppendAsync(streamId, event)
P->>M: Marten session.Events.Append
M->>PG: INSERT + inline projection update
PG-->>M: Stored
M-->>UC: Done
UC-->>A: SubmitEventResponse
A-->>H: "Issue #547 queued"
Stream Identity¶
String-based (not Guid):
- Issue streams:
dashecorp/rig-conductor#42 - Agent streams:
dev-e-1
Projections¶
Marten inline projections (update synchronously with event append):
| Projection | Source Events | Key Fields |
|---|---|---|
| IssueStatus | All lifecycle events | state, agentId, prNumber, priority, attempt, queuedAt, orphanStatus |
| AgentStatus | Heartbeat, IssueAssigned, WorkStarted, IssueUnassigned, AgentStuck, TokenUsage, AgentQuotaReported, MemoryQueried | status, currentIssue, IssuesFailed, lastHeartbeatAt, providers, capabilities, quota, memory metrics |
Dashboard¶
Dashboard.html is a single-page control plane UI served at /. It auto-refreshes every 30 seconds.
| Panel | Data source | Description |
|---|---|---|
| Agents | GET /api/agents |
Online/offline status, current task, heartbeat age |
| Queue | GET /api/queue, GET /api/issues?state=in_progress |
Pending and active issues |
| Queues | GET /api/streams/status, GET /api/streams/{agentId} |
Per-agent Valkey stream depths with last-assigned entry. Rows with >20 queued items are highlighted amber. Clicking a row expands to show the last 10 queued items with GitHub links. |
| Costs | GET /api/costs/summary |
30-day cost breakdown per agent |
| Logs | GET /api/agent-logs + SSE |
Real-time agent log stream |
Health Probes (rc#1188)¶
GET /healthz/deep runs every registered IDependencyHealthChecker in parallel with a 2-second per-checker hard timeout, aggregates per-dep results via DeepHealthCheck.AggregateOverall, and returns HTTP 503 when any critical dep is Unreachable (Valkey / Marten / GitHub). Non-critical deps (Discord) can soft-degrade overall but never trip 503.
K8s probes target /healthz/deep:
- Readiness sustained 503 for 60s → pod pulled from LB
- Liveness sustained 503 for 5 min → pod restart (fresh connection pools)
DependencyHealthDegradedWatcher (rc#947 pattern) tracks observations across scan ticks; auto-files a gap-analysis issue when any dep stays non-Ok for ≥30 min across ≥3 consecutive scans.
See docs/2026-05-19-deep-health-probe.md for the full design, the four-slice rollout (PR-A → PR-D), and the rc#1173 silent-degrade incident that prompted it.
Implementation: src/ConductorE.Api/Adapters/{DeepHealthService.cs,*HealthChecker.cs} and src/ConductorE.Api/Services/SelfImprovement/DependencyHealthDegradedWatcher.cs.
Webhook Handlers¶
GitHub sends webhook events to POST /api/webhook/github. rig-conductor handles the following event/action pairs; everything else returns { skipped: true }.
| Event | Action | What it does |
|---|---|---|
issues |
labeled (with agent-ready) |
Dispatches the issue — reads .rig-agent.yaml for stack, routes to dev-e-<stack> or ibuild-e, emits ISSUE_APPROVED + ISSUE_ASSIGNED, pushes to Valkey stream. |
issues |
closed |
Emits ISSUE_DONE to transition tracked entries. Catches issues closed directly on GitHub (as duplicate, not-planned, or shipped via a PR that didn't link them). |
pull_request |
opened / ready_for_review / synchronize |
Links PR to its execution log, fires PR_CREATED, routes agent+Dependabot PRs to Review-E. |
pull_request |
review_requested |
Catches human-authored PRs that request Review-E explicitly (via per-repo request-review.yml). |
pull_request |
closed (with merged=true) |
Parses Closes #N from the PR body, emits MERGED + ISSUE_DONE, then runs DuplicateCloseService to close sibling PRs. |
pull_request_review |
submitted |
Routes changes_requested back to the implementing agent. |
Implementation: src/ConductorE.Api/Program.cs (search for eventType == "...").
Reconciler Recovery¶
ReconciliationService runs every 5 min and reconciles conductor state with live GitHub state. Beyond the main state-transition pass, it runs three independent recovery scans for known stall patterns:
| Path | Trigger | Tracking |
|---|---|---|
| Abstention recovery | PR in in_review with only a COMMENTED review-e review (no binding verdict) |
rc#608 / rc#610 |
| Timeout recovery | PR with ≥2 ReviewFailed(reason="timeout") events — CLI ran but never posted |
rc#765 |
| Quota recovery | PR in state=failed with last AgentStuck.Reason matching a quota signature, and agent quota recovered to <80% |
rc#944 |
Each path emits RE_REVIEW_REQUESTED with a path-specific reason and re-publishes to signal:review-e. All three share a 30-min throttle window and the same COI guard (review-e-authored PRs are never re-dispatched to review-e). See docs/principles.md §5 for the design and docs/2026-05-18-quota-recovery-reconciliation.md for the most recent (quota) path.
Implementation: src/ConductorE.Api/Services/ReconciliationService.cs with pure policies in src/ConductorE.Core/Domain/QuotaRecoveryPolicy.cs and src/ConductorE.Core/UseCases/ReconcileIssue.cs.
Stream Reclaim¶
StreamReclaimService runs every 60 s as the stream-side actuator for rc#959. It scans XPENDING on each agent's assignments:<agentId> stream; if an entry is idle past 5 min AND the consumer it's assigned to owns an agent with no heartbeat in 10 min, it XCLAIMs the entry to a healthy consumer in the same group.
Pure policy in src/ConductorE.Core/Domain/StreamReclaimPolicy.cs decides ShouldReclaim + PickReclaimTarget. The companion detector StreamConsumerWithoutHeartbeatWatcher (in the rc#947 SelfImprovementService framework) observes the same pattern and files gap-analysis issues — both run in steady state so a regression in either surface remains visible.
See docs/2026-05-18-stream-side-reaper.md for the full design.
Implementation: src/ConductorE.Api/Services/StreamReclaimService.cs.
Orphan Issue Detection¶
When an issue has been in the queue too long without any agent claiming it, conductor alerts and escalates automatically — no human grep required.
See Orphan Issue Detection for the full spec.
Summary:
- OrphanScanService runs hourly
- > 24h queued, no claim → Discord #admin alert, orphan label applied, ISSUE_ORPHANED event emitted
- > 48h queued, still unclaimed → escalated alert, needs-human label, agent-ready removed, auto-dispatch stopped
Implementation: src/ConductorE.Api/Services/OrphanScanService.cs
Duplicate PR Auto-Close¶
When two agents claim the same issue in parallel (claim-race), each opens a PR. After the winning PR merges, rig-conductor automatically closes any sibling PRs that reference the same issue.
Trigger: MERGED event (from MergeGate or GitHub webhook pull_request.closed)
Behaviour:
1. Scans all open PRs in the target repo for bodies matching Closes/Fixes/Resolves #N (or cross-repo owner/repo#N).
2. Closes each sibling PR with a comment:
**Duplicate of #<MERGED_PR>** — closed automatically by rig-conductor.
PR #<MERGED_PR> covered the same issue #<N> and merged at <MERGED_AT>.
Two agents claimed the same issue; this one lost the race.
session:{repo}#{N} (the agent claim lock).
4. Emits a DUPLICATE_PR_CLOSED event with fields: repo, mergedPrNumber, closedPrNumber, issueNumber.
Implementation: src/ConductorE.Api/Services/DuplicateCloseService.cs, called from MergeGate.Merge() and the webhook pull_request.closed handler.
Note: This does not prevent the double-claim — atomic
SETNX claim:{repo}#{N}is tracked in a separate issue.
Test Coverage¶
| Suite | Tests | Line | Branch |
|---|---|---|---|
| Core (unit) | 53 | — | — |
| API (unit, DuplicateCloseService + OrphanScanService) | 22 | — | — |
| API (integration, Testcontainers PostgreSQL) | 11 | 32.9% | 25.4% |
| Total | 86 | — | — |
Core coverage gap is auto-generated record methods. API coverage gap is ASP.NET framework-generated code. All business logic is covered.