Patterns & Principles¶

Architecture Patterns¶

1. Clean Architecture¶

Dependencies point inward only. The Core project has zero external dependencies — no Marten, no ASP.NET, no framework code.

Layer	Project	Depends On
Domain	`ConductorE.Core/Domain/`	Nothing
Ports	`ConductorE.Core/Ports/`	Domain
Use Cases	`ConductorE.Core/UseCases/`	Domain + Ports
Adapters	`ConductorE.Api/Adapters/`	Ports + Marten
Controllers	`ConductorE.Api/Program.cs`	Use Cases + Ports

If we swap PostgreSQL for another store, only the adapters change. Core stays untouched.

2. Event Sourcing¶

Every action emits an immutable event. Events are the source of truth — current state is derived from replaying events.

Append-only: events are never modified or deleted
Inline projections: read models (IssueStatus, AgentStatus) update synchronously with event appends
String streams: repo#issueNumber for issues, agentId for agents
Full audit trail: replay events to see exactly what happened, when, and by whom

We use Marten on PostgreSQL.

3. Ports & Adapters¶

Use cases depend on interfaces (ports), not implementations. Adapters implement ports and live in the outer layer.

Use Case → IEventStore (port) → MartenEventStore (adapter) → PostgreSQL

This enables:

Unit testing with FakeEventStore (no database needed)
Swapping infrastructure without changing business logic
Clear boundaries between domain and framework code

4. Adapter Pattern¶

Platform-specific concerns are abstracted behind unified interfaces. Applied in Rig Agent Runtime for multi-platform messaging:

Discord Adapter ─┐
                  ├─→ Message Handler (platform-agnostic) → Agent Loop
Slack Adapter ───┘

Same agent code works on Discord or Slack — change messaging.platform in config, no code changes.

5. Configuration over Code¶

Agents are defined by character.json configuration, not by writing new code. Rig Agent Runtime is the shared runtime — one Docker image serves all agents.

{
  "name": "rig-conductor",
  "messaging": { "platform": "discord" },
  "llm": { "model": "claude-haiku-4-5-20251001" },
  "tools": [...]
}

New agent = new config file + optional backend API. No code changes to the runtime.

Design Principles¶

1. SOLID¶

Principle	Application
Single Responsibility	Each use case does one thing. `SubmitEvent` maps and appends — nothing else.
Open/Closed	Add new event types by adding records to Domain, no existing code modified.
Liskov Substitution	`FakeEventStore` substitutes `MartenEventStore` in tests.
Interface Segregation	`IEventStore`, `IIssueQuery`, `IAgentQuery` — three focused interfaces, not one mega-interface.
Dependency Inversion	Use cases depend on `IEventStore` (abstraction), not `MartenEventStore` (implementation).

2. YAGNI¶

Build only what's needed now. Don't add abstractions, features, or configurability for hypothetical future requirements.

Three lines of similar code is better than a premature abstraction
No feature flags or backwards-compatibility shims
If it's not in the current issue, it doesn't go in the PR

3. TDD + DDD (hard rule)¶

Test-first and policy-first. Operator-set 2026-05-18.

For every behavior PR:

Identify the domain invariant the change enforces ("review-e dispatches require an explicit prNumber"; "phantom IssueStatus rows have at least one event and none are IssueApproved"; etc.).
Write the pure Core policy at src/ConductorE.Core/<Area>/Policies/<Name>Policy.cs or src/ConductorE.Core/Domain/<Name>.cs — plain functions over plain inputs, no DI, no I/O, clock injected.
Write the policy unit tests FIRST in tests/ConductorE.Core.Tests — covering the contract and edge cases. They will fail (red).
Implement the policy to make the tests pass (green).
Then write the Api adapter (thin I/O shell calling the policy) and the e2e test that exercises the adapter end-to-end via ConductorEApiFactory or equivalent.

Test layers:

Unit (Core) — pure domain logic, no infrastructure. Fast (<100ms).
Projection-contract — events through POST /api/events → assert on the materialised projection shape. See IssueStatusProjectionContractTests (rc#1080). Pins read-model contracts watchers and dashboards rely on.
Adapter unit — Api-side services with stubbed ports.
E2e — full webhook → projection → watcher path via ConductorEApiFactory testcontainer.

PR body must state explicitly:

"Policy in Core, tests written first" — for behavior PRs.
"No-behavior refactor, no new tests needed" — for renames, mechanical refactors, or dependency bumps.

Run dotnet test before every push. CI runs both projects on every PR.

Past evidence the rule pays off: rc#1046 (review-e codex-crash watcher), rc#1071 (review-e-spurious-pr widening), rc#1075 (phantom-cleanup refactor), rc#1080 (projection-contract test layer) all shipped policy-first with the pure unit tests written before any production code. Skipping the discipline cost a 30-min production outage on 2026-05-18 (rar#456 prNumber shadow crash) when I shipped code-first then tests-after.

4. Separation of Concerns¶

The agent that produces a thing cannot approve that thing. This is structural, not cultural.

Agent	Can Do	Cannot Do
Dev-E	Write code, create PRs	Approve its own PRs
Review-E	Review code, approve/reject	Write implementation code
rig-conductor	Assign work, escalate	Write code or review code

Operational Rules¶

1. Fix Forward¶

When production breaks, fix forward. Never auto-rollback.

Production breaks → Agent attempts fix → If failed, reassign
  → If failed again → Escalate to CTO
  → CTO decides: fix forward or rollback (human decision only)

Rollback is never automatic. That's always a CTO decision.

2. Two Strikes Then Human¶

If an agent fails the same issue twice (two different attempts), escalate to human. No third automatic attempt.

Strike	Action
1st failure	`AGENT_STUCK` → reassign to different agent, fresh branch
2nd failure	`ESCALATED` → post to Discord #admin, wait for human

3. Diagram-First¶

Create C4 diagrams before coding complex systems. If the diagram is complex, the code will be complex — simplify the diagram first.

Level	When to Create
L1 Context	Before starting a new system
L2 Containers	Before adding services or databases
L3 Components	Before refactoring internal architecture
L4 Flow	Before implementing complex interactions

Use PlantUML for C4 diagrams. Mermaid for everything else (flows, state machines, timelines).

4. Event-Driven Coordination¶

Agents communicate through events, not direct calls. The event store is the shared nervous system.

Dev-E emits WORK_STARTED → Event Store → rig-conductor reads → assigns next
Dev-E emits PR_CREATED → Event Store → rig-conductor reads → monitors review
Review-E approves → GitHub → rig-conductor reads → auto-merge

No agent calls another agent directly. All coordination flows through events.

5. Three-path reconciler recovery¶

ReconciliationService runs every 5 min and runs three independent recovery paths after each main reconciliation tick. Each path detects a different stall pattern, applies a 30-min throttle to avoid thundering-herd, and emits RE_REVIEW_REQUESTED with a path-specific reason before re-publishing to signal:review-e.

Path	Reason	Detects	Tracking
Abstention	`prior_review_was_abstention`	PR in `in_review` whose only review-e review is `COMMENTED` (no binding verdict)	rc#608 / rc#610
Timeout	`prior_review_timed_out`	PR with ≥2 `ReviewFailed(reason="timeout")` events — review-e CLI completed but never posted a review	rc#765
Quota recovery	`prior_provider_quota_recovered`	PR in `state=failed` where the last `AgentStuck.Reason` matches a quota-saturation signature (codex 429, claude rate-cap) AND the agent's `QuotaFiveHourPct` has dropped below 80%	rc#944 / PR #1094

Pure policies in ConductorE.Core/Domain/ decide the recovery; the service is the thin I/O shell that walks streams, queries agents, and emits events.

Defensive guards shared across all three: - COI guard — review-e-authored PRs are never re-dispatched to review-e (GitHub 422). - Throttle — same 30-min window via AbstainedReviewReDispatchThrottle. - Idempotency key — RE_REVIEW_REQUESTED:<repo>#<issue>:<pr>:<path-discriminator>:<minute> so a same-tick duplicate is deduped at the event-store layer.

See docs/2026-05-18-quota-recovery-reconciliation.md for the quota-recovery path details.

Quota-aware dispatch (proactive)¶

ReconciliationService's quota recovery (above) is reactive — it salvages stalled PRs after one provider saturates. QuotaAwareReviewRouter is the proactive twin — at each review dispatch, pick the candidate (review-e vs review-e-codex) with the most quota headroom before the assignment lands on a stream.

Pure policy in ConductorE.Core/UseCases/QuotaAwareReviewRouter.cs. Thin adapter ReviewDispatchRouter.SelectAsync(IAgentQuery) used by every review-dispatch site (webhook + reconciler scan). Falls back to review-e when no candidate is alive + non-saturated, preserving the legacy default. See docs/2026-05-18-quota-aware-review-dispatch.md.

6. Stream-side reclamation¶

ReconciliationService (§5) handles state-level recovery — issues stuck in known bad states. StreamReclaimService is the sibling transport-level recovery — Redis-streams entries stuck in a consumer's PEL because the consumer pod went silent (crashed, OOMd, hung in a slow CLI) without XACKing.

Runs every 60 s. For each known agent's assignments:<agentId> stream:

Step	Behavior
`XPENDING`	List pending entries across all consumers in the `agents` group.
Per-entry policy	`StreamReclaimPolicy.ShouldReclaim` — reclaim iff entry idle > 5 min AND the assigned consumer's agent has no heartbeat within 10 min.
Target selection	`StreamReclaimPolicy.PickReclaimTarget` — pick a consumer in the same group whose agent has a fresh heartbeat; prefer freshest among healthy candidates.
`XCLAIM`	Force-move the entry to the target consumer. The target's next `XREADGROUP` picks it up.

Pairs with a detector watcher: StreamConsumerWithoutHeartbeatWatcher in the rc#947 SelfImprovementService framework. The detector files gap-analysis issues; this service takes action. Both stay active so a regression in either surface remains visible. Tracking: rc#959.

See docs/2026-05-18-stream-side-reaper.md for the full design + tuning knobs.