API Reference¶

Base URL: http://rig-conductor-api:8080 (internal ClusterIP)

Endpoints¶

Health¶

GET /health

{"status": "healthy", "timestamp": "2026-04-02T06:56:08Z"}

Deep Health (rc#1188)¶

GET /healthz/deep

Returns the status of every dependent system the conductor needs to function. Critical deps (Valkey, Marten, GitHub API) drive the readiness verdict; non-critical deps (Discord) can soft-degrade overall but cannot trip the 503.

Status-code mapping:

Overall	HTTP	Meaning
`Ok`	200	all critical deps reachable; non-critical deps Ok
`Degraded`	200	reachable but slow / rate-limited / a non-critical dep is non-Ok — pod stays in the load balancer
`Unreachable`	503	at least one critical dep is unreachable — readiness probe should pull the pod

Post-merge baseline (PR-B1 only):

{
  "overall": "Ok",
  "dependencies": [],
  "checkedAt": "2026-05-19T13:42:00Z"
}

Populated response (after PR-B2 wires production checkers):

{
  "overall": "Degraded",
  "dependencies": [
    { "name": "valkey",  "status": "Ok",         "critical": true,  "reason": null,         "latencyMs": 3,    "lastCheckAt": "..." },
    { "name": "marten",  "status": "Ok",         "critical": true,  "reason": null,         "latencyMs": 12,   "lastCheckAt": "..." },
    { "name": "github",  "status": "Degraded",   "critical": true,  "reason": "http 429",   "latencyMs": 88,   "lastCheckAt": "..." },
    { "name": "discord", "status": "Ok",         "critical": false, "reason": null,         "latencyMs": 41,   "lastCheckAt": "..." }
  ],
  "checkedAt": "2026-05-19T13:42:00Z"
}

reason is null on Ok and short on non-Ok ("high ping latency", "http 429", "ping failed: RedisConnectionException", etc.). latencyMs is the per-checker round-trip; the orchestrator runs checkers in parallel with a 2-second per-checker hard timeout that maps to Status=Degraded, reason="timeout".

Rollout slices: - PR-B1 (this slice): port + orchestrator + /healthz/deep endpoint; empty deps array - PR-B2 (rc#1204): four production checkers (Valkey / Marten / GitHub API / Discord) + DI - PR-D: Kubernetes readiness + liveness probes switched from /health to /healthz/deep + decision doc

See rc#1188 for the overall design and rc#1173 for the Valkey silent-degrade incident that motivated it.

Submit Event¶

POST /api/events

Submit any rig event. The stream ID is derived from repo#issueNumber for issue events, or agentId for heartbeats.

Request:

{
  "type": "ISSUE_APPROVED",
  "repo": "dashecorp/rig-conductor",
  "issueNumber": 42,
  "title": "feat: Add health check endpoint",
  "priority": "normal",
  "dependsOn": []
}

Response (200):

{
  "streamId": "dashecorp/rig-conductor#42",
  "type": "ISSUE_APPROVED",
  "timestamp": "2026-04-02T06:56:22Z"
}

Error (400):

{"error": "Unknown event type: INVALID"}

See Event Store for all event types and their fields.

Get Issues¶

GET /api/issues
GET /api/issues?state=queued

Returns all tracked issues, optionally filtered by state.

States: queued, assigned, in_progress, in_review, deploying, done, failed

Get Issue Trace (rc#951)¶

GET /api/issues/trace?repo={repo}&issueNumber={n}

Returns the ordered Marten event history for one issue stream, plus the current IssueStatus.State from the projection. Events are sorted by Marten's global Sequence (authoritative — not by timestamp, which can collide or be reported out of clock-skew order).

404 Not Found when the stream has zero events (issue never landed in conductor).

{
  "issueId": "dashecorp/rig-conductor#244",
  "currentState": "in_progress",
  "eventCount": 28,
  "events": [
    {"at": "2026-05-14T07:26:17Z", "type": "work_started", "data": {...}, "sequence": 364467},
    {"at": "2026-05-14T07:26:46Z", "type": "merged",       "data": {...}, "sequence": 364470}
  ]
}

Replaces the prior psql-into-Marten + log-tail workflow (~10 min per triage cycle). Also feeds the rc#947 SelfImprovementService a stable per-issue event-sequence API.

Get Priority Queue¶

GET /api/queue

Returns unassigned issues sorted by priority (critical > high > normal > oldest).

Get Next Assignment¶

GET /api/assignments/next?agentId=dev-e-1

Returns the top-priority unassigned issue, or 204 No Content if nothing available.

Response (200):

{
  "streamId": "dashecorp/rig-conductor#42",
  "issue": {
    "number": 42,
    "repo": "dashecorp/rig-conductor",
    "title": "feat: Add health check endpoint",
    "milestone": null
  },
  "priority": "normal",
  "attempt": 1
}

Get Agent Status¶

GET /api/agents

Returns status of all known agents. The status and currentIssue/currentRepo fields are updated by both heartbeat events and assignment events (ISSUE_ASSIGNED, WORK_STARTED) so the dashboard reflects live work immediately without waiting for a heartbeat cycle.

Each item includes:

isOnline — computed from recent heartbeat freshness (last heartbeat within 2 minutes)
status — idle, working, or stuck. Set to working when ISSUE_ASSIGNED or WORK_STARTED is dispatched; cleared to idle on ISSUE_DONE, ISSUE_CANCELLED, or next idle heartbeat.
currentIssue — issue number the agent is currently working on (null if idle)
currentRepo — repo of the current issue (null if idle)
activeProvider
availableProviders
providers[] — provider health snapshot
integrations[] — integration health snapshot
lastModel — last LLM model the agent reported via TOKEN_USAGE (e.g. claude-opus-4-7). null until the agent emits its first TOKEN_USAGE event after deploy. Added in #530 (rc#529) so the dashboard can show which agents have switched models after a config change.

Response:

[
  {
    "id": "dev-e-1",
    "status": "working",
    "isOnline": true,
    "currentIssue": 136,
    "currentRepo": "dashecorp/rig-conductor",
    "activeProvider": "claude",
    "availableProviders": ["claude", "codex"],
    "providers": [
      { "name": "claude", "status": "ready", "details": "Claude Code auth configured", "active": true },
      { "name": "codex", "status": "authenticated", "details": "logged in using ChatGPT", "active": false }
    ],
    "integrations": [
      { "name": "github", "status": "ready", "details": "GitHub App configured" },
      { "name": "discord-webhook", "status": "ready", "details": "operator webhook configured" }
    ],
    "lastModel": "claude-opus-4-7",
    "lastHeartbeat": "2026-04-02T09:14:22Z",
    "issuesCompleted": 12,
    "issuesFailed": 1
  }
]

lastModel is null for agents that have not yet emitted a TOKEN_USAGE event after deploy.

Get Execution Logs¶

GET /api/execution-logs?limit=50&status=running

Returns recent execution log summaries (without full step/log payload). Used by the dashboard overview.

Query parameters:

Parameter	Type	Default	Description
`limit`	int	`50`	Max results to return
`status`	string	(all)	Filter by status. Must be one of: `running`, `completed`, `failed`, `stuck`. Case-insensitive — `RUNNING`, `Running`, and `running` are all accepted and treated identically.

Status filter validation: If status is provided and is not one of the allowed values, the endpoint returns 400 Bad Request:

{"error": "Invalid status 'xyz'. Allowed values: running, completed, failed, stuck."}

Note: cancelled is not a valid status — the domain uses stuck to represent issues that have stalled. Sending ?status=cancelled returns 400.

Response (200 OK):

[
  {
    "id": "uuid",
    "repo": "dashecorp/rig-conductor",
    "issueNumber": 42,
    "prNumber": 99,
    "agentId": "dev-e-1",
    "status": "completed",
    "startedAt": "2026-04-23T10:00:00Z",
    "completedAt": "2026-04-23T10:12:00Z",
    "durationSeconds": 720,
    "totalCostUsd": 0.18,
    "totalTurns": 24,
    "model": "claude-sonnet-4-5",
    "stepCount": 7,
    "logCount": 130
  }
]

Get Execution Log Stats¶

GET /api/execution-logs/stats?agentId=review-e&days=14&status=completed

Aggregated turn / cost / token distribution for execution logs in a window — the instrument that sizes cost levers, e.g. choosing a safe --max-turns set above the p95 of real runs. Pure aggregation lives in ConductorE.Core.UseCases.ExecutionStatsPolicy; see the Review-E cost-reduction decision and Cost Attribution.

Query parameters:

Parameter	Type	Default	Description
`agentId`	string	(all agents)	Filter to a single agent. Omit to aggregate across all agents; echoed back as `null` when omitted.
`days`	int	`14`	Look-back window from now. Must be `1`–`365`; out-of-range returns `400` (a non-positive window would push the cutoff into the future and report zero rows, indistinguishable from "no runs").
`status`	string	`completed`	Status to aggregate. One of `running`, `completed`, `failed`, `stuck` (case-insensitive). An unknown value returns `400` — it is not silently treated as "no data", since an empty result would size `--max-turns` from a zero p95.

Response (200 OK):

{
  "agentId": "review-e",
  "windowDays": 14,
  "status": "completed",
  "stats": {
    "runCount": 42,
    "turns": { "count": 42, "min": 3, "p50": 18, "p90": 47, "p95": 61, "max": 120, "mean": 21.4 },
    "totalCostUsd": 38.21,
    "meanCostUsd": 0.91,
    "totalInputTokens": 18450000,
    "totalOutputTokens": 412000,
    "totalCacheReadTokens": 15200000,
    "totalCacheCreationTokens": 980000
  }
}

turns percentiles are nearest-rank (rank = ceil(p/100 * N), 1-based). An agent with no runs in the window returns 200 with runCount: 0 and an all-zero distribution (never a 404 or error).

Get Event Stream¶

GET /api/events/stream?id=dashecorp/rig-conductor%2342

Returns all events for a specific stream. The id parameter must be URL-encoded (use %23 for #).

Response:

[
  {
    "id": "uuid",
    "type": "IssueApproved",
    "data": { "repo": "...", "issueNumber": 42, "title": "..." },
    "timestamp": "2026-04-02T06:56:22Z"
  }
]

Token Usage¶

GET /api/usage
GET /api/usage?agentId=dev-e-node
GET /api/usage?repo=dashecorp%2Frig-conductor
GET /api/usage?days=7

Returns per-agent token usage totals including cache token counts.

Query parameters:

Param	Default	Description
`agentId`	(all)	Filter to a single agent
`repo`	(all repos)	Filter to a specific repo
`days`	(all time)	Rolling window in days. Required to compare with `/api/costs/summary` — without it the endpoint returns all-time projection totals which are naturally larger than any windowed cost summary.

Response:

[
  {
    "agentId": "dev-e-node",
    "totalInputTokens": 1200000,
    "totalOutputTokens": 480000,
    "totalCacheReadTokens": 5600000,
    "totalCacheCreationTokens": 920000,
    "totalCostUsd": 39.591,
    "byRepo": [
      {
        "repo": "dashecorp/rig-conductor",
        "inputTokens": 800000,
        "outputTokens": 300000,
        "cacheReadTokens": 3200000,
        "cacheCreationTokens": 600000,
        "costUsd": 25.4
      }
    ]
  }
]

Cost Summary¶

GET /api/costs/summary
GET /api/costs/summary?days=7

Returns agent-level cost breakdown by category for the rolling window. Uses the same raw-event source as /api/usage?days=N — both endpoints agree for the same window.

Cost computation (fixed in #148):

Events with zero tokens (inputTokens=outputTokens=cacheReadTokens=cacheCreationTokens=0) contribute $0, regardless of any costUsd value in the event. This prevents phantom idle costs.
Events with cache tokens (cacheReadTokens > 0 || cacheCreationTokens > 0) are recomputed using the Anthropic pricing table in AnthropicPricing rather than trusting the agent-reported costUsd, which historically excluded cache costs.
Legacy events (no cache token fields) continue to use the agent-reported costUsd.

TOKEN_USAGE event now accepts optional cache token fields:

{
  "type": "TOKEN_USAGE",
  "agentId": "dev-e-node",
  "repo": "dashecorp/rig-conductor",
  "issueNumber": 148,
  "model": "claude-sonnet-4-5",
  "inputTokens": 10,
  "outputTokens": 4994,
  "cacheReadTokens": 160855,
  "cacheCreationTokens": 28927,
  "costUsd": 0.242194,
  "category": "work"
}

See cost attribution for the full pricing model and fix details.

Daily Costs¶

GET /api/costs/daily?days=7

Returns a per-agent, per-day breakdown for the rolling window.

Response:

{
  "period": "7d",
  "entries": [
    { "date": "2026-04-23", "agentId": "dev-e-node", "costUsd": 12.30 }
  ]
}

Stream Status¶

GET /api/streams/status

Returns per-agent Valkey stream status. The primary field to watch is lag inside each consumer-group entry — it represents messages the consumer group has not yet delivered, i.e. the real backlog. xlen (total stream length) is included for diagnostics only; it does not decrease when messages are acknowledged and will accumulate until the stream is explicitly trimmed.

Response:

{
  "dev-e-dotnet": {
    "xlen": 17,
    "groups": {
      "agents": { "lag": 0, "pending": 0 }
    }
  },
  "dev-e-node": {
    "xlen": 27,
    "groups": {
      "agents": { "lag": 2, "pending": 1 }
    }
  }
}

lag — unread entries waiting to be delivered to the group (the real queue backlog)
pending — entries delivered but not yet acknowledged (in-flight)
xlen — total stream length (misleading as a backlog metric; stays high until trim)

When no consumer group has been created yet (stream never consumed), groups is empty and xlen reflects the raw message count.

List Tenants (Cockpit roster)¶

GET /api/admin/tenants

Operator-facing read of the rig_control tenant allowlist — the source the resolver consults — projected to a small DTO the Rig Cockpit (iOS app + web SPA via the rig-cockpit-worker /api/* proxy) renders as a Tenants tab (rc#1919). Fills the gap that GET /api/tenancy cannot: the census returns only a count + bare lower-cased tenant-ID strings, no name/status/type.

Auth: No in-app token. Same Cloudflare-Access posture as the sibling POST/DELETE /api/admin/tenants write paths and the rest of /api/admin/*. Do not expose via public ingress.

Query params: none. Returns the full roster (no owner-scoping today — effectively 1 active tenant = invotek; owner-scoping is future work if the roster grows).

Response (200):

{
  "tenants": [
    {
      "id": "invotek",
      "name": "Invotek",
      "githubOrg": "dashecorp",
      "status": "active",
      "type": "first-party",
      "installationId": 12345678
    }
  ]
}

Per-row fields:

Field	Type	Description
`id`	string	Stable lower-case tenant slug (Marten document identity)
`name`	string	Human-readable name
`githubOrg`	string	GitHub org that maps to this tenant (lower-case)
`status`	string	`"active"` or `"suspended"` — only `"active"` tenants resolve at the ingress boundary
`type`	string	Operator-policy classification, raw — `"first-party"` or `"b2c"` (rc#1873). Surfaced as-is; clients render their own label, no server-side B2B/B2C mapping
`installationId`	long?	GitHub App installation id when known; `null` until per-tenant App installs are mapped

The projection is closed — adding a field to the Tenant Marten document does NOT silently leak through this endpoint. Internal-only fields stay internal.

Force-Done (Epic override)¶

POST /api/admin/issues/force-done?repo=owner%2Frepo&issueNumber=N

Operator escape hatch for a parent epic that was blocked by the epic-completion guard (rc#459). Use when the team has deliberately abandoned remaining sub-issues and wants to declare the epic done without waiting for all subs to reach done/production.

Auth: No token required. Relies on cluster-network trust (internal ClusterIP only — the same model used by all /api/admin/* endpoints). Do not expose via public ingress.

Query params:

Param	Type	Required	Description
`repo`	string	✅	Repository slug, e.g. `dashecorp/rig-conductor` (URL-encode the `/`)
`issueNumber`	int	✅	GitHub issue number of the parent epic

Responses:

Status	Body	Meaning
200	`{"forcedDone": true, "previousState": "deploying", "repo": "...", "issueNumber": N}`	Transition emitted
200	`{"alreadyDone": true, "state": "done", ...}`	Epic was already done/production — no-op
404	`{"error": "Issue ... not found in conductor"}`	Issue not tracked in conductor

Example:

curl -X POST \
  "http://rig-conductor-api:8080/api/admin/issues/force-done?repo=dashecorp%2Frig-conductor&issueNumber=433"

Architecture¶

POST /api/events → SubmitEvent (Use Case) → IEventStore (Port) → MartenEventStore (Adapter) → PostgreSQL

GET /api/queue   → IIssueQuery (Port) → MartenIssueQuery (Adapter) → PostgreSQL (Marten projection)

Clean Architecture: endpoints delegate to use cases/ports, never touch Marten directly.