Dashecorp Rig — Brain¶
Fresh-agent entry point. Read this first. One fetch (~27 KB) gives you the repo manifest, deployed surfaces (including rig-conductor's 13 endpoints and built-in Dashboard), agent instances, primary flows, frontmatter schema, 40+ event types (summary; full schemas at /events.md), 18-whitepaper catalog, and the current backlog with prior_art links. Every claim traces to its source file in
facts/.Compiled from
facts/*.yaml+ live GitHub state (gh api /orgs/dashecorp/reposfor the repo list; manifest validation for agents). Do not hand-edit BRAIN.md. Regenerate withnpm run brain. CI runs--checkand fails on drift.
What this is¶
The Dashecorp rig is an autonomous coding-agent system. A human posts a user
story; agents research, propose, code, review, and ship. Canonical docs live
in dashecorp/rig-docs (Astro Starlight); operational memory lives in a
Postgres + pgvector Memory MCP; deployments are Flux-managed on a k3s
cluster running on a GCE VM (invotek-k3s in invotek-github-infra).
Published surfaces¶
Rig landing — discoverable index of all surfaces¶
- URL: https://rig.dashecorp.com/
- Type: html
Canonical brain entry point (this file, rendered)¶
- URL: https://docs.rig.dashecorp.com/brain/
- Raw: https://research.rig.dashecorp.com/BRAIN.md
- Type: markdown
Brain map — visual architecture + doc-linkage graph¶
- URL: https://research.rig.dashecorp.com/map/
- Type: astro-starlight
- Note: Two auto-derived diagrams (architecture from facts/, linkage from doc frontmatter). See the shape of what the rig knows before fetching individual pages.
LLM site map (research, proposals, user-stories)¶
- URL: https://research.rig.dashecorp.com/llms.txt
- Type: llms-txt
Full content dump (single-shot ingestion)¶
- URL: https://research.rig.dashecorp.com/llms-full.txt
- Type: llms-full-txt
Research, proposals, user-stories (rendered Starlight site)¶
- URL: https://research.rig.dashecorp.com/
- Type: astro-starlight
- Source: dashecorp/rig-docs
Aggregated engineering docs (architecture, guides, whitepapers, per-repo docs)¶
- URL: https://docs.rig.dashecorp.com/
- Type: mkdocs-material
- Source: dashecorp/rig-gitops (docs-site/)
- Note: Built by scripts/build-docs.sh in rig-gitops on push + hourly cron. Pulls each rig repo's docs/ via gh api. Different scope from research.rig.dashecorp.com (engineering reference vs. research).
Sitemap (XML)¶
- URL: https://research.rig.dashecorp.com/sitemap-index.xml
- Type: sitemap-xml
rig-conductor API (cluster-internal)¶
- Type: rest-api
- Visibility: cluster-internal-only
- Endpoints:
POST /api/events— Submit any of the 40+ event types — see /events.mdGET /api/assignments/next— Claim next issue assignment. Query: agentId=dev-e-nodeGET /api/pr-reviews/next— Claim direct-PR review (no issue) for infra/tooling PRsGET /api/pr-reviews/item— Inspect a single PR review item. Query: repo, prNumberPOST /api/pr-reviews/merge— Server-side merge gate for direct PR reviews (rc#1028)GET /api/issues— List tracked issues. Query: state=open|done|stuckGET /api/issues/item— Fetch a single issue projection by (repo, issueNumber)GET /api/issues/trace— Per-issue event trace + state transitions for debuggingGET /api/stuck-issues— List issues in a non-terminal state for too long (stuck-watcher candidate set)GET /api/queue— Current dispatch queue stateGET /api/usage— Token / cost usage by agent and/or repo. Query: agentId, repoGET /api/costs/issue— Cost for a specific issue. Query: repo, issueNumberGET /api/costs/summary— Aggregate cost. Query: days (default 7)GET /api/costs/daily— Daily cost time series. Query: daysGET /api/events/live— SSE stream of live events (for Dashboard.html)GET /api/streams/status— Stream consumer statusGET /api/streams/{agentId}— Per-agent stream tail (recent assignment messages). Query: countGET /api/agents— List registered agents (heartbeat + status). Query: archived=trueDELETE /api/agents/{agentId}— Forcibly archive a specific agent (admin)DELETE /api/agents/offline— Bulk-archive all agents that are offline (no recent heartbeat)GET /api/agent-capacity— Per-agent capacity / quota / dispatch eligibility snapshotPOST /api/webhook/github— GitHub webhook intake — normalizes GH events into rig-conductor streamPOST /api/webhook/flux— Flux deploy confirmation webhook (rc#413 in_deploy → deployed)POST /api/merge— Server-side merge gatePOST /api/execution-logs— Create execution log envelopePOST /api/execution-logs/{id}/logs— Append log entriesPOST /api/execution-logs/{id}/steps— Append structured stepPOST /api/execution-logs/{id}/complete— Mark log completeGET /api/execution-logs/{id}— Fetch log by idGET /api/execution-logs/issue— Logs per issue. Query: repo, issueNumberGET /api/execution-logs— List logs. Query: limit, statusPOST /api/execution-logs/cleanup— Prune old logsGET /api/repo-learnings— Fetch learnings. Query: repoPOST /api/repo-learnings— Upsert learningDELETE /api/repo-learnings— Delete learning. Query: repo, keyGET /api/guard-blocked— Guard-block counts per agent. Query: agentIdGET /health— Liveness probe — always 200 if the process is aliveGET /healthz/deep— Deep readiness probe — Marten + Valkey + dependency checks (rc#1188)GET /api/health— Detailed health snapshot for the dashboard (component-level)GET /api/version— Build version + git SHAGET /dashboard— Built-in single-page dashboard (HTML) — Engineering Rig control planeGET /api/events/stream— Single event-stream tail by stream id. Query: idGET /api/events/recent— Recent events across all streams. Query: hoursGET /api/main-ci— Main-branch CI status snapshot. Query: repoGET /api/ci-failures— List CI failures across repos. Query: repo, includeAckedPOST /api/ci-failures/{repo}/{workflowName}/{runId:long}/ack— Ack a CI failure so it stops showing as activeGET /api/main-guard/incidents— Main-guard incidents (rc#1226 + rc#1234). Query: repo, statusGET /api/a11y— Accessibility scan results per repo. Query: repoGET /api/stuck-watch— Live stuck-watch snapshot (proxies upstream cluster check)GET /api/stuck-patterns— List active stuck patterns. Query: includeResolved=true for allPOST /api/stuck-patterns/{fingerprint}/resolve— Mark a stuck pattern as resolved (writes memory)GET /api/stuck-patterns/brain-section— Generate the## Known stuck patternsmarkdown for BRAIN.mdGET /api/agent-logs— List recent agent log entries across all agents. Query: countGET /api/agent-logs/{agentId}— Tail recent log entries for one agent. Query: countPOST /api/agent-logs— Append a batch of log entries from an agent (push from pod)GET /api/agent-logs/live— SSE stream of live agent log entriesGET /api/self-improvement/signatures— Watcher signature states (rc#947): occurrences, OpenIssue, clean-tick counterPOST /api/admin/issues/force-done— Operator force-close an issue's read-model state to Done (admin)POST /api/admin/overrides— Record an operator override event (audit trail)GET /api/admin/overrides— List recent operator overrides for auditPOST /api/planner/trigger— Dispatch a planner task (planner agent stream)- Note: The conductor's in-cluster API endpoint. Reachable only from inside the cluster — exact host/port intentionally not surfaced publicly.
rig-conductor Dashboard (the built-in cost/activity UI)¶
- Type: html-dashboard
- Source: dashecorp/rig-conductor (src/ConductorE.Api/Dashboard.html)
- Visibility: cluster-internal-only
- Note: 42 KB single-page HTML dashboard — "Engineering Rig — Control Plane". Has Costs, Issues, Agents, Streams tabs. Driven by /api/costs/, /api/usage, /api/issues, /api/streams/ endpoints. No separate Grafana/Starlight dashboard is needed — this one already renders per-agent / per-issue / per-day cost.
Memory MCP (Postgres + pgvector)¶
- Type: mcp-server
- Package: @dashecorp/rig-memory-mcp
- Tools:
read_memories— Query prior memory by topic/repo/scope with vector similaritywrite_memory— Persist a new memory with scope/kind/importance/tagsmark_used— Increment hit_count on a memory that informed a decision
Discord agent channels (notifications)¶
- Type: discord
- Channels: #dev-e, #review-e, #ibuild-e, #admin
- Note: Agents post thread updates here; humans watch for stuck / pending state.
Repos¶
Live from gh api /orgs/dashecorp/repos merged with facts/repos.yaml annotations. Archived repos are dropped automatically.
| Repo | Purpose | Language | Depends on | AGENTS.md |
|---|---|---|---|---|
rig-gitops |
GitOps manifests (Flux HelmReleases, Kustomize bases) and the canonical AGENTS.md shared by every rig repo via `@dashecorp/rig-gitops/AGENTS | shell | — | compiled |
rig-agent-runtime |
The AI agent runtime (Node) — one image that deploys as Dev-E, Review-E, or iBuild-E depending on character file + environment. Handles prom | javascript | rig-memory-mcp, rig-conductor | imports-rig-gitops |
rig-memory-mcp |
MCP server backing persistent agent memory with Postgres + pgvector. Exposes read_memories / write_memory / mark_used tools consumed b |
javascript | postgres-pgvector | claude-md |
rig-conductor |
Event store + dispatch service (C# + Marten + Postgres). Receives PR/issue events, assigns work, tracks turns/cost/stuck state, serves the ` | csharp | postgres, pgvector | imports-rig-gitops |
rig-docs |
Research, proposals, user-stories, and rig-wide reference (Astro Starlight). This repo — you're reading its BRAIN.md. Deploys to research.ri | astro | — | hand |
rig-tools |
Shell scripts, Git hooks, and workflow sync for AI-assisted development. Developer tooling, not deployed. The one repo without an AGENTS.md | shell | — | none |
infra |
OpenTofu/Terraform for GitHub org settings, Cloudflare (DNS, Pages, tunnels), GCP (k3s cluster on a GCE VM (invotek-k3s) hosting the rig), a | hcl | — | imports-rig-gitops |
Per-repo doc index (token-efficient discovery)¶
Before cloning a repo to find docs, consult this list to decide which docs are relevant to your issue. Then fetch raw markdown for only the relevant ones:
Auto-derived per compile via gh api /repos/<r>/contents/docs. Repos without a docs/ dir are omitted.
rig-gitops— architecture-current.md, architecture-proposed-v2.md, architecture-proposed.md, documentation-standard.md, onboarding.md, research-multi-agent-platforms.md, review-e-bootstrap.md, sops.mdrig-agent-runtime— architecture.md, configuration.md, dashboard.md, deployment.md, discord-setup.md, heartbeat.md, index.md, memory.md, messaging.md, observability.md, quickstart.md, usage-tracking.mdrig-memory-mcp— api.mdrig-conductor— api.md, architecture.md, deployment.md, event-store.md, index.md, principles.mdrig-tools— agent-workflow.md
Agents (deployment instances)¶
Dev-E — writes code¶
- Runtime: dashecorp/rig-agent-runtime
- Deployed in: k3s cluster on GCE VM (invotek-k3s, invotek-github-infra)
- Manifest:
dashecorp/rig-gitops/apps/dev-e/ - Variants:
- node:
apps/dev-e/rig-agent-helmrelease.yaml - python:
apps/dev-e/python-helmrelease.yaml - dotnet:
apps/dev-e/dotnet-helmrelease.yaml - Character: baked into HelmRelease values
- Triggers: rig-conductor dispatch (issue.assigned events)
Review-E — reviews PRs¶
- Runtime: dashecorp/rig-agent-runtime
- Deployed in: k3s cluster on GCE VM (invotek-k3s, invotek-github-infra)
- Manifest:
dashecorp/rig-gitops/apps/review-e/rig-agent-helmrelease.yaml - Cron:
*/5 * * * * - Search filter:
org:dashecorp is:pr is:open author:app/dev-e-bot author:app/ibuild-e-bot -reviewed-by:app/review-e-bot - Discord: #review-e
- Notes: The cron search_filter targets agent-bot authors only, but human/operator-authored PRs are still routed to Review-E by rig-conductor's ReviewScanService broad GitHub-poll (ReviewScanService.cs) — so operator PRs to rig repos ARE reviewed (verified: review-e-bot approved + auto-merged dashecorp/rig-conductor#1488, 2026-06-08).
iBuild-E — macOS / iOS builds¶
- Runtime: dashecorp/rig-agent-runtime
- Deployed in: Mac Mini (Oslo, on the operator's Tailnet)
- Manifest:
not-in-cluster - Discord: #ibuild-e
- Notes: Apple Silicon host, Xcode + App Store Connect. Auto-reauth cron refreshes OAuth every 5 min. Separate from the GCE-hosted agents because iOS builds require macOS.
Planner-E — plans sprints, manages backlog, assigns issues to agents¶
- Runtime: dashecorp/rig-agent-runtime
- Deployed in: k3s cluster on GCE VM (invotek-k3s, invotek-github-infra)
- Manifest:
dashecorp/rig-gitops/apps/rig-planner/ - Triggers: signal:rig-planner LIST + assignments:rig-planner STREAM
- Discord: #planner
- Notes: GitHub App rig-planner-bot (App ID 3546083) handles GitHub issue intake. KEDA scales 0→1 on signal:rig-planner (Redis LIST); also reads assignments:rig-planner (Redis STREAM). Provider: claude-cli + claude-sonnet-4-6. Persona reference: /whitepaper/planner/.
Primary flows¶
PR lifecycle in dashecorp (orchestrator-owned — DO NOT copy legacy personal-org workflow files)¶
Trigger: Any PR opened in a dashecorp-org repo (dashe-, rig-, infra, etc.)
- GitHub — Fires webhook to POST rig-conductor /api/webhook/github
- rig-conductor — Normalizes the PR event, enforces gates (issue-link rule, labels), assigns review
- Review-E — Polls GET /api/reviews/next (or /api/pr-reviews/next for direct infra PRs), reviews, posts approval or CHANGES_REQUESTED
- rig-conductor — On approval + green CI + no unresolved threads + no
manual-mergelabel, calls POST /api/merge to merge server-side
Rules:
- Do NOT copy the operator's per-repo .github/workflows/request-review.yml or auto-merge.yml from legacy personal-org repos into dashecorp repos. Those files are the legacy pattern from before rig-conductor. The conductor endpoints above own this lifecycle for dashecorp. If a dashecorp repo isn't getting reviewed or merged, the fix is configure the GitHub webhook, not add a workflow file.
- The operator's personal-org repos still use the per-repo workflow pattern because they predate rig-conductor's scope. That pattern stays until those repos are archived post-migration.
Complete when: conductor emits PR_MERGED event and downstream consumers (CF Pages, iBuild-E, etc.) react
Epic to merged work¶
Trigger: Human opens a user-story GitHub issue in dashecorp/rig-docs
- rig-conductor — Scans open issues, classifies, dispatches to appropriate agent
- Dev-E — Reads issue + relevant research; authors research / proposal / code PR
- Review-E (cron every 5 min) — Finds PR, reviews against AGENTS.md + memory, requests changes or approves
- Human — Merges (or Review-E's approval satisfies branch protection; auto-merge fires)
- Cloudflare Pages — Redeploys research.rig.dashecorp.com and docs.rig.dashecorp.com Complete when: issue closed via `Closes
Research and proposal authoring¶
Trigger: An Epic needs investigation before implementation
- author dated research/YYYY-MM-DD-slug.md with user_story frontmatter
- author proposals/YYYY-MM-DD-slug.md with source_research frontmatter
- user_story file gets research_docs and proposal fields pointing back
- RelatedDocs component auto-renders the graph; no manual cross-linking
Rules: - bidirectional links required - schema enforced in src/content.config.ts - CI rejects PRs missing required fields
Cold-start agent session¶
Trigger: Fresh agent with blank memory receives an Epic or task
- WebFetch https://research.rig.dashecorp.com/brain/ (or raw BRAIN.md)
- Parse facts/repos.yaml equivalent in BRAIN.md — learn repo manifest
- Parse facts/surfaces.yaml equivalent — learn URLs and endpoints
- WebFetch https://research.rig.dashecorp.com/llms.txt for topic index
- WebFetch relevant research/proposal docs directly via raw URL
- For the target repo, fetch its AGENTS.md (compiled or imports-rig-gitops)
- read_memories scoped to repo + topic via Memory MCP
- Begin work with full context in ~15 KB total Token budget: ~15 KB read, leaves 200K+ for actual work on Opus
Docs-memory promotion (weekly Lint)¶
Trigger: Weekly scheduled Lint job
- Scan Memory MCP for rows with importance >= 4 AND hit_count >= 5
- For each candidate, check if docs already cover the topic (BM25 sim)
- If not covered, propose a docs PR with the memory content promoted
- Human approves PR, merge triggers redeploy Status: not-yet-built (design in research/2026-04-18-docs-memory-drift-lint)
Diagram-as-code authoring¶
Trigger: A research / proposal / user-story needs a diagram
Rule: Mermaid source inline in fenced code block. No PNG or SVG ever committed.
Rendering: remark-mermaid plugin wraps in <figure> with <pre class=mermaid> and <details> source; mermaid.js renders client-side; source preserved post-render for agent readers.
Frontmatter schema (for authoring rig-docs content)¶
- type (optional): one of
research|proposal|decision|postmortem|reference|user-story|runbook - audience (optional): one of
human|agent|both— not a free-form array - Required:
title,description - Optional linkage fields (paths are relative to src/content/docs/, no leading slash, no .md or .mdx extension):
type— See type enum above.subtype— See subtype enum above (whitepapers only).audience— See audience enum above.created— ISO date string YYYY-MM-DD.updated— ISO date string YYYY-MM-DD.topic— Short slug grouping related docs.source_refs— Array of URLs (external sources supporting this doc).supersedes— Path to doc this replaces (no leading slash, no .md extension).superseded_by— Path to newer doc that replaces this (same format).user_story— (research/proposal only) Path to the user story this supports.research_docs— (user-story only) Array of research doc paths this story spawned.proposal— (user-story only) Path to the proposal answering this story.source_research— (proposal only) Array of research paths this proposal synthesises.github_issue— (user-story only) Full GitHub issue URL. Omit the field entirely if there is no issue — do NOT use empty string.whitepaper— (user-story, optional) Slug of the single whitepaper this story primarily supports. Matches the whitepaper filename without extension (e.g. "safety", "memory", "observability"). Used by the Starlight sidebar to roll up story counts next to each whitepaper link at build time.whitepapers— (user-story, optional) List form ofwhitepaper:— use when a story supports more than one whitepaper (e.g. a domain paper AND a synthesis paper). Formatwhitepapers: [a, b]or a block list. A story tagged for multiple papers counts on each paper's sidebar badge and appears in each paper's page-level Related list. Accepts both inline and block-list YAML. Mutually exclusive in spirit withwhitepaper:, but supplying both is tolerated (values are merged and deduped).
Path examples: user-stories/2026-04-18-docs-memory-strategy, research/2026-04-18-docs-tools-evaluation, proposals/2026-04-18-docs-tooling-decision, decisions/2026-04-18-docs-tooling-decision.
Omit a field entirely when it has no value — do not use empty string.
Whitepapers (private — catalog only)¶
These whitepapers live at dashecorp/rig-gitops/docs/whitepaper/*.md (private repo — requires gh auth to fetch). BRAIN.md surfaces their titles + 1-line summaries so agents know what exists. Full content must be fetched with: gh api /repos/dashecorp/rig-gitops/contents/docs/whitepaper/<file> --jq .download_url | xargs curl -sL.
- Whitepaper index (
index.md) — Entry point listing all whitepaper sections and their companion docs. - MVP scope (
mvp-scope.md) — What the rig does in the minimum viable release. Gatekeeper for "is this in scope?" - Design principles (
principles.md) — First principles (measurement precedes trust; honest gaps; provider portability). - Trust model (
trust-model.md) — Who can approve what, which gates exist, human-in-the-loop rules. - Safety (
safety.md) — Dangerous-command guards, sandboxing, blast-radius containment. - Security (
security.md) — Secrets handling, attestation, audit trail, SOPS+age. - Agent secrets broker (
agent-secrets-broker.md) — Capability-based secret lifecycle broker for LLM agents. Agents operate on opaque references; the broker handles plaintext across Bitwarden, GitHub, SOPS, k8s, and Cloudflare — plaintext never enters a prompt, tool argument, or log line. Covers tool surface (mint/store/deploy/rotate/ retire/verify/list/generate_and_deploy), destination ref grammar (gh:, gh-env:, sops:, k8s:, cf-worker:, bw:), policy model with hardware-key override, and append-only audit schema. Complementary to security.md (supply-chain: Sigstore/SLSA/Kyverno); covers the runtime secret-lifecycle layer. - Provider portability (
provider-portability.md) — Multi-runtime (Claude Code, Codex CLI, Gemini CLI) via OTel GenAI conventions. Swap runtime without changing backend. - Observability — OTel, Langfuse, Prometheus, SLOs (
observability.md) — Self-hosted Langfuse (agent traces) + Grafana Cloud (infra) + local Prometheus (SLO gates) hybrid. Native OTel viaCLAUDE_CODE_ENABLE_TELEMETRY=1. OTel Collector runs per-cluster, routes LLM traces to Langfuse, infra to managed. Per implementation-status: OTel Collector "Partial" (deployed for rig-conductor, agents not yet emitting), Langfuse "Planned", cost dashboard "Partial" (TokenUsageProjection exists, no LiteLLM proxy yet). - Cost framework (
cost-framework.md) — Budget policy, per-model rate tables, cost attribution strategy. Companion to observability. - Self-healing (
self-healing.md) — Automatic recovery loops, StaleHeartbeatService, escalation severity routing. - Memory architecture (
memory.md) — Memory MCP scope, importance/hit_count model, promotion-to-docs threshold design. - Quality and evaluation (
quality-and-evaluation.md) — How the rig evaluates its own output. Judge-agent pattern, fixed rubrics. - Drift detection (
drift-detection.md) — Schema drift, docs drift, infra drift — detection thresholds and response. - Development process (
development-process.md) — Issue → Epic → research → proposal → PR lifecycle, agent-human gates. - Example first story (
example-first-story.md) — Worked walkthrough of one Epic end-to-end. - Glossary (
glossary.md) — Rig-specific terminology (Epic, proposal, rig-conductor, Review-E, etc). - Known limitations (
limitations.md) — Honest catalog of what the rig can't do today. - Implementation status (
implementation-status.md) — Single source of truth for deployed vs planned per capability. 78 tracked across 11 domains; 21 deployed/partial (27%), 44 planned/deferred (56%). Every capability named in the whitepapers gets a row with status + whitepaper section + ticket/evidence. - Tool choices (ADRs) (
tool-choices.md) — Decision records for tooling. Includes rejection list with rationale.
Most agents should start with: the /implementation/ dashboard (structured per-capability status — see summary below) and whichever domain-specific whitepaper matches the Epic.
Capability status (38 in registry · full dashboard)¶
shipped:15 · partial:7 · planned:15 · deferred:0 (registry seed — full migration tracked in rig-docs#124)
Top blockers: default-deny-egress (dashecorp/rig-docs#57)
rig-conductor event types (POST /api/events)¶
All events from dashecorp/rig-conductor/src/ConductorE.Core/UseCases/SubmitEvent.cs MapToEvent switch. Names only here — fetch /events.md for full field schemas (no auth required).
Pipeline (issue → PR → merge → deploy): ISSUE_APPROVED, ISSUE_ASSIGNED, ISSUE_UNASSIGNED, WORK_STARTED, BRANCH_CREATED, PR_CREATED, CI_PASSED, CI_FAILED, REVIEW_ASSIGNED, REVIEW_PASSED, REVIEW_DISPUTED, HUMAN_GATE_TRIGGERED, HUMAN_GATE_REMINDER, MERGED, MERGE_GATE_WAITING, MERGE_GATE_MERGED, MERGE_GATE_TIMEOUT, MAIN_CI_STARTED, MAIN_CI_PASSED, MAIN_CI_FAILED, DEPLOYED_STAGING, DEPLOYED_PRODUCTION, SMOKE_PASSED, SMOKE_FAILED, BUILD_FAILED, VERIFIED, ISSUE_DONE, ESCALATED, MILESTONE_COMPLETE, DUPLICATE_PR_CLOSED
Direct PR path (no issue): PR_OPENED, PR_REVIEW_ASSIGNED, PR_REVIEW_APPROVED, PR_REVIEW_REJECTED
Agent lifecycle: AGENT_STARTED, HEARTBEAT, AGENT_STUCK
CLI sessions: CLI_STARTED, CLI_PROGRESS, CLI_COMPLETED
Observability (cost + tooling): TOKEN_USAGE, TOOL_USED
Memory MCP: MEMORY_WRITE, MEMORY_READ, MEMORY_HIT_USED
Known gaps (rig backlog)¶
Cold-start agents should see these so they don't re-discover what's already identified. Each gap links to prior_art — existing stubs, research, or PRs that have already touched it. When a gap is being worked, linked_user_story points to the user story; when closed, the entry is removed from facts/backlog.yaml.
[observability] Cost tracking mostly deployed — LiteLLM proxy + external access are the remaining gaps¶
DO NOT propose "build a cost pipeline" — most of it is already shipped:
- Data pipeline: TokenUsageProjection + CostProjection in rig-conductor consume TOKEN_USAGE + CLI_COMPLETED events. Read models live on Marten/Postgres.
- API: GET /api/usage, /api/costs/issue, /api/costs/summary, /api/costs/daily on the rig-conductor cluster-internal URL (see BRAIN.md Published surfaces). Query by agent, repo, date range.
- Dashboard: src/ConductorE.Api/Dashboard.html (~42 KB SPA, "Engineering Rig — Control Plane"). Served at / and /dashboard. Has a Costs tab driven by the /api/costs/* endpoints.
The remaining gaps: a. LiteLLM proxy — not deployed. Blocks hard budget enforcement (agent ceiling kill-switch). b. External access — /dashboard is cluster-internal. A human on laptop can't view it without kubectl port-forward or a Cloudflare tunnel. Consider publishing a read-only projection. c. Alerting — no Discord webhook on cost threshold breach yet.
Rough current spend: ~$5-15/day fleet-wide (order-of-magnitude only).
Prior art: - rig-conductor cost endpoints and Dashboard.html — dashecorp/rig-conductor src/ConductorE.Api/ - TokenUsageProjection + CostProjection source: dashecorp/rig-conductor src/ConductorE.Api/Adapters/MartenProjections.cs - TOKEN_USAGE + CLI_COMPLETED events defined and emitted — see /events.md - Cost framework design: rig-gitops/docs/whitepaper/cost-framework.md (private) - Observability whitepaper: rig-gitops/docs/whitepaper/observability.md (private; summary in facts/whitepapers.yaml) - LiteLLM proxy not yet deployed — blocks hard budget enforcement
Status: mostly-deployed
[observability] OTel collector deployed for rig-conductor only — agents not yet emitting¶
OpenTelemetry Collector is "Partial": deployed for rig-conductor; agent
pods (Dev-E, Review-E, iBuild-E) have not yet enabled native OTel via
CLAUDE_CODE_ENABLE_TELEMETRY=1. Langfuse (self-hosted) and Grafana
Cloud ingest are both "Planned". Full design in the observability
whitepaper.
Prior art: - Observability whitepaper: rig-gitops/docs/whitepaper/observability.md (private; summary in facts/whitepapers.yaml) - Implementation status: whitepaper/implementation-status.md marks OTel Collector 'Partial', Langfuse 'Planned' - rig-memory-mcp/events.js FUTURE comment: migrate to OTel GenAI spans - Env var to enable native OTel: CLAUDE_CODE_ENABLE_TELEMETRY=1 + OTEL_EXPORTER_OTLP_ENDPOINT pointed at the in-cluster collector
Status: partial
[docs-memory] Docs-memory drift lint not implemented¶
Weekly LLM-as-judge pass that promotes memory→docs (when importance≥4 AND hit_count≥5), flags stale research, catches orphan docs. Designed but no runtime built.
Prior art: - Full design in research/2026-04-18-docs-memory-drift-lint - Parent user story: user-stories/2026-04-18-docs-memory-strategy - Principles synthesis: research/2026-04-18-docs-vs-memory-principles
Linked user story: user-stories/2026-04-18-docs-memory-strategy
Status: open
[docs-surfaces] Two docs surfaces with overlapping scope¶
docs.rig.dashecorp.com (MkDocs aggregation from rig-gitops/docs-site/) and research.rig.dashecorp.com (Starlight research hub from dashecorp/rig-docs). Both host rig docs; boundaries not formalised. Agents currently learn this empirically. Eventually unify or formalise the split.
Prior art: - MkDocs site built by dashecorp/rig-gitops/scripts/build-docs.sh - Starlight site defined in dashecorp/rig-docs/ (this repo) - Docs tooling decision: decisions/2026-04-18-docs-tooling-decision (picked Starlight for research hub; MkDocs kept for aggregation)
Status: open
[deployment] CLOUDFLARE_API_TOKEN / CLOUDFLARE_ACCOUNT_ID not in rig-docs repo secrets¶
The deploy workflow gracefully skips deploy when secrets absent (notice
only). Current deploys happen via direct wrangler pages deploy from
the operator's laptop. Adding the secrets would enable per-PR preview
deploys and automatic main-branch publishing.
Prior art: - .github/workflows/deploy.yml has the has_cf_secrets guard - Cloudflare Pages project already exists: rig-research (created via wrangler)
Status: open
[agents] ATL-E retired, no active coordinator agent¶
ATL-E (a legacy personal-org atl-agent repo) was previously deployed as
a k3s CronJob on a personal host and handled handoff-stall Discord
notifications. As of ~2026-03-26 it is no longer deployed (not present
in the operator's personal-org cluster GitOps manifests). The repo still
exists but is dormant. If an Epic needs a coordinator/team-lead role,
decide whether to redeploy ATL-E or build a replacement.
Prior art:
- Dormant personal-org atl-agent repo (last push 2026-03-26)
- Operator's personal-org cluster GitOps repo — no atl-agent ArgoCD manifest
Status: open
[networking] iBuild-E cannot reach rig-conductor cluster-internal API¶
Empirically verified on 2026-04-19: from iBuild-E (Mac Mini, Oslo, on the
operator's Tailnet), curling the conductor's in-cluster API endpoint
(/api/health) fails with DNS resolve timeout. The cluster-internal DNS
name only resolves inside the k3s cluster via CoreDNS; Tailscale connects
the host but doesn't federate cluster DNS.
Impact: iBuild-E today cannot:
- Send TOKEN_USAGE / HEARTBEAT / CLI_COMPLETED events (POST /api/events)
- Pick up assignments (GET /api/assignments/next)
- Reach the cost Dashboard or /api/costs/*
iBuild-E is effectively disconnected from rig-conductor coordination. She operates from GitHub issues + Discord channels directly.
Fix options (none implemented): a. Tailscale subnet router on a cluster node → expose the cluster service range b. Ingress / GCP load balancer for the conductor API with mTLS c. Cloudflare tunnel into the cluster d. Accept the gap: iBuild-E never sees rig-conductor; she runs on GitHub-only flows
This has been a chronic "unknown" flagged by every cold-start test (v1 through v5). Now measured.
Prior art: - facts/agents.yaml — iBuild-E: deployed_in: Mac Mini (Oslo, on the operator's Tailnet) - curl to the conductor's in-cluster API endpoint → DNS resolve timeout after 3s (measured 2026-04-19) - Every cold-start test session-log flagged 'iBuild-E routing through cluster-internal services — latency unknown'. Not latency — reachability. Zero, not high.
Status: open
[cleanup] Plane residue — uninstall GitHub App + archive workspace¶
Plane was retired 2026-04-18 but the makeplane GitHub App is still installed on the dashecorp org, and the Plane workspace at app.plane.so is still alive (token revoked). Manual UI action needed.
Prior art: - Retraction decision: decisions/2026-04-18-docs-tooling-decision (What retires section) - Retirement commit: dashecorp/infra PR #74
Status: open
Architecture at a glance¶
flowchart LR
H[Human]
subgraph Code["Code repos"]
RD[rig-docs]
RG[rig-gitops]
RAR[rig-agent-runtime]
CE_R[rig-conductor]
RMM_R[rig-memory-mcp]
RT[rig-tools]
INF[infra]
end
subgraph Deployed["Deployed services + agents"]
direction TB
CE[rig-conductor svc]
RMM[rig-memory-mcp svc]
DE[Dev-E pod]
RE[Review-E cron]
IB[iBuild-E — Mac Mini]
end
subgraph Publish["Published surfaces"]
direction TB
S1[research.rig.dashecorp.com<br/>Astro Starlight]
S2[docs.rig.dashecorp.com<br/>MkDocs aggregator]
CFP[Cloudflare Pages]
end
%% Authoring + dispatch
H -->|user-story issue| RD
RD -->|dispatch| CE
CE -->|assign issue| DE
CE -->|assign PR review| RE
CE -->|assign iOS build| IB
DE -->|author PR| RD
RD -->|PR opens| RE
RE -->|approve / request changes| RD
RD -->|merge| CFP
CFP -->|publish| S1
RG -->|docs aggregation| S2
%% MCP + memory
DE -->|tool use| RMM
RE -->|tool use| RMM
IB -->|tool use| RMM
RMM_R -.implements.-> RMM
%% Flux GitOps
RG -->|Flux deploys| CE
RG -->|Flux deploys| RMM
RG -->|Flux deploys| DE
RG -->|Flux deploys| RE
%% Runtime image used by all agent deployments
RAR -.image.-> DE
RAR -.image.-> RE
RAR -.image.-> IB
CE_R -.image.-> CE
%% Per-repo docs/ feeding into the MkDocs aggregator
RG -.docs/.-> S2
RAR -.docs/.-> S2
CE_R -.docs/.-> S2
RMM_R -.docs/.-> S2
RT -.docs/.-> S2
%% Infra — outside the loop but manages everything above
INF -.OpenTofu.-> CFP
Legend: solid arrows are runtime flows (dispatch, tool calls, deploys). Dashed arrows are source-of relationships — "this repo's image powers that pod" or "this repo's docs/ feeds that site". Every rig repo from facts/repos.yaml is represented.
Conventions (rig-wide)¶
- Docs are markdown with YAML frontmatter. Required fields:
title,description,type,audience,created/updated,topic. See AGENTS.md in this repo. - Bidirectional linkage. User story ↔ research ↔ proposal → decision via
research_docs,proposal,user_story,source_research,supersedes/superseded_by. RelatedDocs component renders the graph. - Diagrams as code. Mermaid source inline in markdown. No PNG or SVG committed. Source preserved post-render via
<details>blocks. - Per-repo CLAUDE.md auto-loads when Claude Code starts a session in that repo's cwd (Claude Code reads
CLAUDE.md, notAGENTS.md— cross-vendor standard is AGENTS.md but the loader is CLAUDE.md). Same-repo local@AGENTS.mdimports work; cross-repo@owner/repo/filedoes not fetch from GitHub (filesystem-only, max 5 hops). - Rig-wide agent instructions live in TWO places: (1) each running agent's HelmRelease
character.personalityprompt (authoritative for Dev-E, Review-E in-cluster), (2) each repo's rootCLAUDE.md(authoritative for interactive sessions). Both include the BRAIN.md fetch at session start. - Closes #N required in PR bodies. Review-E blocks on this.
- Memory MCP scope: operational / ephemeral state only. Durable knowledge goes to rig-docs.
- Default to a two-PR split for feature work >500 LOC.
large-pr-okis reserved for migrations, codemods, dependency bumps, and generated code — not feature work that decomposes into policy + adapter. A/B-validated 2026-05-18: same code shipped as a labelled single PR got zero code-level feedback; the disciplined split caught 3 real bugs. Rig-side enforcement in rar#492; full decision tree in research/2026-05-18-pr-size-and-large-pr-ok-semantics. - Behavior PRs ship their doc updates in the same PR. Per-file convention: when
src/<X>.{cs,js,ts,go,py,...}changes,docs/<X>.md(if it exists) updates alongside. Rig-side enforcement in rar#497 (detectDocMismatchessurfaces a warning in the size-gate review body). - Three-layer drift-prevention playbook. When the operator catches the orchestrator drifting on a discipline recurringly + structurally observable + measurable cost: ship L1 memory rule + L2 rig-side enforcement at the trigger point + L3 durable artifact. Three instances codified the week of 2026-05-18 (PR-split shortcut, doc-staleness, main-guard rig-internal dispatch). Meta-playbook in research/2026-05-18-three-layer-drift-prevention-playbook.
Token-efficient cold start¶
When you pick up a new Epic with blank memory, the cheapest order of operations:
- Fetch this file (
https://research.rig.dashecorp.com/BRAIN.md, public, no auth) — ~27 KB. - Fetch
/llms.txtfor the research hub topic index — ~2 KB. - Identify 1-3 relevant research / proposal docs, fetch raw — ~5-15 KB.
- Fetch target repo's
AGENTS.md(each repo's is ≤8 KB) — ~5 KB. read_memoriesfrom Memory MCP scoped to repo + topic — ~2 KB.
Total cold-start context: ~35-45 KB. Leaves the rest of the budget for actual work.
When this file needs updating¶
Manual fields that live in facts/*.yaml — update when the matching reality changes:
facts/repos.yaml— annotations only (purpose, depends_on, used_by, agents_md, docs_surface). The repo list itself is auto-derived fromgh apion every compile. Adding a new annotation, or updating an existing one, happens here.facts/surfaces.yaml— URLs, API endpoints, MCP tools. Update when an endpoint changes or a new surface is published.facts/agents.yaml— agent deployment instances. Compile validates eachmanifest:path exists on GitHub and warns on drift (how ATL-E retirement was caught).facts/flows.yaml— documented rig processes. Update after retrospectives.facts/schema.yaml— mirrors the Zod schema insrc/content.config.ts. Keep in sync manually when the schema changes.facts/events.yaml— rig-conductor event types. Keep in sync withMapToEventin the C# source.facts/backlog.yaml— known gaps. Add when identified; remove when closed.
Then run npm run brain. CI (build workflow) runs brain:check and fails on drift.