API Reference¶
Base URL: http://rig-conductor-api:8080 (internal ClusterIP)
Endpoints¶
Health¶
Deep Health (rc#1188)¶
Returns the status of every dependent system the conductor needs to function. Critical deps (Valkey, Marten, GitHub API) drive the readiness verdict; non-critical deps (Discord) can soft-degrade overall but cannot trip the 503.
Status-code mapping:
| Overall | HTTP | Meaning |
|---|---|---|
Ok |
200 | all critical deps reachable; non-critical deps Ok |
Degraded |
200 | reachable but slow / rate-limited / a non-critical dep is non-Ok — pod stays in the load balancer |
Unreachable |
503 | at least one critical dep is unreachable — readiness probe should pull the pod |
Post-merge baseline (PR-B1 only):
Populated response (after PR-B2 wires production checkers):
{
"overall": "Degraded",
"dependencies": [
{ "name": "valkey", "status": "Ok", "critical": true, "reason": null, "latencyMs": 3, "lastCheckAt": "..." },
{ "name": "marten", "status": "Ok", "critical": true, "reason": null, "latencyMs": 12, "lastCheckAt": "..." },
{ "name": "github", "status": "Degraded", "critical": true, "reason": "http 429", "latencyMs": 88, "lastCheckAt": "..." },
{ "name": "discord", "status": "Ok", "critical": false, "reason": null, "latencyMs": 41, "lastCheckAt": "..." }
],
"checkedAt": "2026-05-19T13:42:00Z"
}
reason is null on Ok and short on non-Ok ("high ping latency", "http 429", "ping failed: RedisConnectionException", etc.). latencyMs is the per-checker round-trip; the orchestrator runs checkers in parallel with a 2-second per-checker hard timeout that maps to Status=Degraded, reason="timeout".
Rollout slices:
- PR-B1 (this slice): port + orchestrator + /healthz/deep endpoint; empty deps array
- PR-B2 (rc#1204): four production checkers (Valkey / Marten / GitHub API / Discord) + DI
- PR-D: Kubernetes readiness + liveness probes switched from /health to /healthz/deep + decision doc
See rc#1188 for the overall design and rc#1173 for the Valkey silent-degrade incident that motivated it.
Submit Event¶
Submit any rig event. The stream ID is derived from repo#issueNumber for issue events, or agentId for heartbeats.
Request:
{
"type": "ISSUE_APPROVED",
"repo": "dashecorp/rig-conductor",
"issueNumber": 42,
"title": "feat: Add health check endpoint",
"priority": "normal",
"dependsOn": []
}
Response (200):
{
"streamId": "dashecorp/rig-conductor#42",
"type": "ISSUE_APPROVED",
"timestamp": "2026-04-02T06:56:22Z"
}
Error (400):
See Event Store for all event types and their fields.
Get Issues¶
Returns all tracked issues, optionally filtered by state.
States: queued, assigned, in_progress, in_review, deploying, done, failed
Get Issue Trace (rc#951)¶
Returns the ordered Marten event history for one issue stream, plus the current IssueStatus.State from the projection. Events are sorted by Marten's global Sequence (authoritative — not by timestamp, which can collide or be reported out of clock-skew order).
404 Not Found when the stream has zero events (issue never landed in conductor).
{
"issueId": "dashecorp/rig-conductor#244",
"currentState": "in_progress",
"eventCount": 28,
"events": [
{"at": "2026-05-14T07:26:17Z", "type": "work_started", "data": {...}, "sequence": 364467},
{"at": "2026-05-14T07:26:46Z", "type": "merged", "data": {...}, "sequence": 364470}
]
}
Replaces the prior psql-into-Marten + log-tail workflow (~10 min per triage cycle). Also feeds the rc#947 SelfImprovementService a stable per-issue event-sequence API.
Get Priority Queue¶
Returns unassigned issues sorted by priority (critical > high > normal > oldest).
Get Next Assignment¶
Returns the top-priority unassigned issue, or 204 No Content if nothing available.
Response (200):
{
"streamId": "dashecorp/rig-conductor#42",
"issue": {
"number": 42,
"repo": "dashecorp/rig-conductor",
"title": "feat: Add health check endpoint",
"milestone": null
},
"priority": "normal",
"attempt": 1
}
Get Agent Status¶
Returns status of all known agents. The status and currentIssue/currentRepo fields are
updated by both heartbeat events and assignment events (ISSUE_ASSIGNED, WORK_STARTED)
so the dashboard reflects live work immediately without waiting for a heartbeat cycle.
Each item includes:
isOnline— computed from recent heartbeat freshness (last heartbeat within 2 minutes)status—idle,working, orstuck. Set toworkingwhenISSUE_ASSIGNEDorWORK_STARTEDis dispatched; cleared toidleonISSUE_DONE,ISSUE_CANCELLED, or next idle heartbeat.currentIssue— issue number the agent is currently working on (null if idle)currentRepo— repo of the current issue (null if idle)activeProvideravailableProvidersproviders[]— provider health snapshotintegrations[]— integration health snapshotlastModel— last LLM model the agent reported viaTOKEN_USAGE(e.g.claude-opus-4-7).nulluntil the agent emits its firstTOKEN_USAGEevent after deploy. Added in #530 (rc#529) so the dashboard can show which agents have switched models after a config change.
Response:
[
{
"id": "dev-e-1",
"status": "working",
"isOnline": true,
"currentIssue": 136,
"currentRepo": "dashecorp/rig-conductor",
"activeProvider": "claude",
"availableProviders": ["claude", "codex"],
"providers": [
{ "name": "claude", "status": "ready", "details": "Claude Code auth configured", "active": true },
{ "name": "codex", "status": "authenticated", "details": "logged in using ChatGPT", "active": false }
],
"integrations": [
{ "name": "github", "status": "ready", "details": "GitHub App configured" },
{ "name": "discord-webhook", "status": "ready", "details": "operator webhook configured" }
],
"lastModel": "claude-opus-4-7",
"lastHeartbeat": "2026-04-02T09:14:22Z",
"issuesCompleted": 12,
"issuesFailed": 1
}
]
lastModelisnullfor agents that have not yet emitted aTOKEN_USAGEevent after deploy.
Get Execution Logs¶
Returns recent execution log summaries (without full step/log payload). Used by the dashboard overview.
Query parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
limit |
int | 50 |
Max results to return |
status |
string | (all) | Filter by status. Must be one of: running, completed, failed, stuck. Case-insensitive — RUNNING, Running, and running are all accepted and treated identically. |
Status filter validation: If status is provided and is not one of the allowed values, the endpoint returns 400 Bad Request:
Note:
cancelledis not a valid status — the domain usesstuckto represent issues that have stalled. Sending?status=cancelledreturns 400.
Response (200 OK):
[
{
"id": "uuid",
"repo": "dashecorp/rig-conductor",
"issueNumber": 42,
"prNumber": 99,
"agentId": "dev-e-1",
"status": "completed",
"startedAt": "2026-04-23T10:00:00Z",
"completedAt": "2026-04-23T10:12:00Z",
"durationSeconds": 720,
"totalCostUsd": 0.18,
"totalTurns": 24,
"model": "claude-sonnet-4-5",
"stepCount": 7,
"logCount": 130
}
]
Get Event Stream¶
Returns all events for a specific stream. The id parameter must be URL-encoded (use %23 for #).
Response:
[
{
"id": "uuid",
"type": "IssueApproved",
"data": { "repo": "...", "issueNumber": 42, "title": "..." },
"timestamp": "2026-04-02T06:56:22Z"
}
]
Token Usage¶
GET /api/usage
GET /api/usage?agentId=dev-e-node
GET /api/usage?repo=dashecorp%2Frig-conductor
GET /api/usage?days=7
Returns per-agent token usage totals including cache token counts.
Query parameters:
| Param | Default | Description |
|---|---|---|
agentId |
(all) | Filter to a single agent |
repo |
(all repos) | Filter to a specific repo |
days |
(all time) | Rolling window in days. Required to compare with /api/costs/summary — without it the endpoint returns all-time projection totals which are naturally larger than any windowed cost summary. |
Response:
[
{
"agentId": "dev-e-node",
"totalInputTokens": 1200000,
"totalOutputTokens": 480000,
"totalCacheReadTokens": 5600000,
"totalCacheCreationTokens": 920000,
"totalCostUsd": 39.591,
"byRepo": [
{
"repo": "dashecorp/rig-conductor",
"inputTokens": 800000,
"outputTokens": 300000,
"cacheReadTokens": 3200000,
"cacheCreationTokens": 600000,
"costUsd": 25.4
}
]
}
]
Cost Summary¶
Returns agent-level cost breakdown by category for the rolling window. Uses the same raw-event source as /api/usage?days=N — both endpoints agree for the same window.
Cost computation (fixed in #148):
- Events with zero tokens (
inputTokens=outputTokens=cacheReadTokens=cacheCreationTokens=0) contribute$0, regardless of anycostUsdvalue in the event. This prevents phantom idle costs. - Events with cache tokens (
cacheReadTokens > 0 || cacheCreationTokens > 0) are recomputed using the Anthropic pricing table inAnthropicPricingrather than trusting the agent-reportedcostUsd, which historically excluded cache costs. - Legacy events (no cache token fields) continue to use the agent-reported
costUsd.
TOKEN_USAGE event now accepts optional cache token fields:
{
"type": "TOKEN_USAGE",
"agentId": "dev-e-node",
"repo": "dashecorp/rig-conductor",
"issueNumber": 148,
"model": "claude-sonnet-4-5",
"inputTokens": 10,
"outputTokens": 4994,
"cacheReadTokens": 160855,
"cacheCreationTokens": 28927,
"costUsd": 0.242194,
"category": "work"
}
See cost attribution for the full pricing model and fix details.
Daily Costs¶
Returns a per-agent, per-day breakdown for the rolling window.
Response:
{
"period": "7d",
"entries": [
{ "date": "2026-04-23", "agentId": "dev-e-node", "costUsd": 12.30 }
]
}
Stream Status¶
Returns per-agent Valkey stream status. The primary field to watch is lag inside each
consumer-group entry — it represents messages the consumer group has not yet delivered,
i.e. the real backlog. xlen (total stream length) is included for diagnostics only; it
does not decrease when messages are acknowledged and will accumulate until the stream is
explicitly trimmed.
Response:
{
"dev-e-dotnet": {
"xlen": 17,
"groups": {
"agents": { "lag": 0, "pending": 0 }
}
},
"dev-e-node": {
"xlen": 27,
"groups": {
"agents": { "lag": 2, "pending": 1 }
}
}
}
lag— unread entries waiting to be delivered to the group (the real queue backlog)pending— entries delivered but not yet acknowledged (in-flight)xlen— total stream length (misleading as a backlog metric; stays high until trim)
When no consumer group has been created yet (stream never consumed), groups is empty and
xlen reflects the raw message count.
Force-Done (Epic override)¶
Operator escape hatch for a parent epic that was blocked by the epic-completion guard
(rc#459). Use when the team has deliberately abandoned remaining sub-issues and wants to
declare the epic done without waiting for all subs to reach done/production.
Auth: No token required. Relies on cluster-network trust (internal ClusterIP only —
the same model used by all /api/admin/* endpoints). Do not expose via public ingress.
Query params:
| Param | Type | Required | Description |
|---|---|---|---|
repo |
string | ✅ | Repository slug, e.g. dashecorp/rig-conductor (URL-encode the /) |
issueNumber |
int | ✅ | GitHub issue number of the parent epic |
Responses:
| Status | Body | Meaning |
|---|---|---|
| 200 | {"forcedDone": true, "previousState": "deploying", "repo": "...", "issueNumber": N} |
Transition emitted |
| 200 | {"alreadyDone": true, "state": "done", ...} |
Epic was already done/production — no-op |
| 404 | {"error": "Issue ... not found in conductor"} |
Issue not tracked in conductor |
Example:
curl -X POST \
"http://rig-conductor-api:8080/api/admin/issues/force-done?repo=dashecorp%2Frig-conductor&issueNumber=433"
Architecture¶
POST /api/events → SubmitEvent (Use Case) → IEventStore (Port) → MartenEventStore (Adapter) → PostgreSQL
GET /api/queue → IIssueQuery (Port) → MartenIssueQuery (Adapter) → PostgreSQL (Marten projection)
Clean Architecture: endpoints delegate to use cases/ports, never touch Marten directly.