Observability — OpenTelemetry, Langfuse, Prometheus, SLOs¶

TL;DR

Two observability domains joined at rig-conductor: agent (any OTel-emitting runtime — Claude Code, Codex CLI, or Gemini CLI on the default path — into self-hosted Langfuse) and infrastructure (OTel Collector → Grafana Cloud Free + local Prometheus). Instrument with OpenTelemetry GenAI semantic conventions, not vendor SDKs, so the backend and the provider are both swappable (see provider-portability.md). The hybrid keeps decision-making local (Flagger analysis) while pushing long-term storage to managed. Total added memory: ~1.5 GB.

Measurement precedes trust (principle 1). This document covers what the rig measures, how it surfaces the data, and what thresholds become decisions.

The stack¶

graph TB
    subgraph "Agent runtime"
        CC[Claude Code CLI<br/>CLAUDE_CODE_ENABLE_TELEMETRY=1]
        A[Custom agent code]
    end

    subgraph "OTel Collector (one per cluster)"
        OC[OpenTelemetry Collector]
    end

    subgraph "Agent observability — self-hosted"
        LF[Langfuse<br/>Postgres + ClickHouse]
    end

    subgraph "Infra observability — managed"
        GC[Grafana Cloud<br/>Prometheus, Loki, Tempo]
    end

    subgraph "Infra observability — local"
        LP[Local Prometheus<br/>Flagger analysis source]
    end

    subgraph "Control plane"
        CE[rig-conductor<br/>reads aggregate signals]
    end

    subgraph "Dashboards"
        LD[Langfuse UI<br/>agent quality, cost]
        GD[Grafana dashboards<br/>services, SLOs, drift]
    end

    CC -->|OTLP spans + metrics| OC
    A -->|OTLP| OC
    OC -->|LLM traces| LF
    OC -->|infra traces + logs| GC
    OC -->|service metrics| GC
    OC -->|service metrics| LP
    LP -->|analysis queries| FL[Flagger]
    LF -->|cost/quality aggregates| CE
    GC -->|SLO/error-budget| CE
    LF --> LD
    GC --> GD

Self-hosted + managed hybrid is a deliberate choice. Self-hosting everything on an 8GB VM starves the rig under load. Managing everything externally removes local source of truth for Flagger analysis. The hybrid gets local decision-making for the hot path and managed long-term storage for the rest.

Native OpenTelemetry across agent runtimes¶

Every major agent runtime we support emits OpenTelemetry spans natively as of late 2025 — Claude Code, Codex CLI, Gemini CLI all ship with OTel integration and agree on the GenAI semantic conventions. That agreement is the key: swap the runtime, swap the provider, spans keep flowing into the same backend. See provider-portability.md for the full multi-runtime story. Example below uses Claude Code env vars; the equivalents for Codex CLI (CODEX_OTEL_ENDPOINT) and Gemini CLI (GEMINI_OTEL_EXPORTER_OTLP_ENDPOINT) work identically.

Enable in Claude Code via:

export CLAUDE_CODE_ENABLE_TELEMETRY=1
export OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector.observability.svc:4318
export OTEL_SERVICE_NAME=dev-e-dotnet
export OTEL_RESOURCE_ATTRIBUTES="agent.class=dev-e,agent.stack=dotnet,agent.pod=${HOSTNAME}"

Spans emitted per:

Model request (attributes: gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.finish_reasons)
Tool execution (attributes: tool name, duration, error)
Session lifecycle (start, compact, end)

Metadata is always captured. Prompt text and tool-result content are opt-in via CLAUDE_CODE_TELEMETRY_PROMPTS=1 and CLAUDE_CODE_TELEMETRY_TOOL_RESULTS=1. For privacy-sensitive environments, metadata-only suffices for cost attribution and tool-call accounting.

OpenTelemetry GenAI semantic conventions¶

The conventions for LLM spans are still experimental in 2026. We dual-emit via OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai/dup to survive future stabilization. Convention:

Span name: {gen_ai.operation.name} {gen_ai.request.model} (e.g., chat claude-sonnet-4-6)
Tool call as child span: name execute_tool {tool_name}
Error capture via span status + exception events

LLM observability — Langfuse or Phoenix, decided by VM size¶

Changed from original whitepaper

The original whitepaper named Langfuse unconditionally. Honest re-evaluation (see tool-choices.md) concludes:

Langfuse v3 officially requires 4 CPU / 16 GB RAM for the app + separate ClickHouse cluster — too heavy for our 8 GB VM
Arize Phoenix (ELv2, OTel-native, SQLite/Postgres, no ClickHouse) is the honest self-host pick at our scale
Instrument with OpenTelemetry GenAI semantic conventions, not vendor SDKs — the observability backend becomes swappable

The pick below assumes we either (a) stay on 8 GB → use Phoenix, or (b) scale to 16 GB+ → use Langfuse.

Langfuse is the self-hosted OSS choice when the VM supports it (MIT core, ClickHouse + Postgres + Redis). Free tier supports 50k billable units/month on SaaS; no tier limits on-prem. Strongest OSS prompt versioning + evals + cost attribution.

Arize Phoenix is the lighter alternative (ELv2 source-available, OTel-native, SQLite or Postgres). Better fit for our 8 GB VM today. Eval-first product; less polish on team workflows than Langfuse.

See tool-choices.md for the full comparison including Helicone, LangSmith, Braintrust, W&B Traces, and Cloudflare AI Gateway (which we run as a free secondary observability layer regardless).

See tool-choices.md for the full per-option evaluation (Langfuse vs Phoenix vs Helicone vs LangSmith vs Braintrust vs Cloudflare AI Gateway) including license, owner, pricing, lock-in, migration cost.

What Langfuse tracks¶

For each agent session:

Cost — tokens × model price, attributed per agent × task × repo
Latency — per model call, per tool call
Quality signals — tool-call error rate, schema-validation failure rate, session success/fail
Prompt versioning — every system prompt change creates a new version, enables A/B across versions
Eval results — nightly harness writes scores here

The Langfuse UI is the primary dashboard for agent-quality questions. "Is Dev-E getting better this week?" is a Langfuse query.

Prometheus (local) — Flagger analysis source¶

Flagger's AnalysisTemplate queries Prometheus during canary promotion. A query like:

analysis:
  interval: 1m
  threshold: 5
  maxWeight: 50
  stepWeight: 10
  metrics:
  - name: success-rate
    thresholdRange: { min: 99 }
    interval: 1m
    templateRef:
      name: success-rate
      namespace: observability
  - name: latency-p99
    thresholdRange: { max: 500 }
    interval: 1m
    templateRef:
      name: latency-p99

must work even if egress to Grafana Cloud is blipping. Local Prometheus is the source of truth for deploy decisions. Resource budget: 1GB RAM, 10GB disk, 14-day retention.

What Prometheus scrapes¶

kubelet cadvisor metrics (pod health)
kube-state-metrics (cluster state)
Service endpoints via ServiceMonitor (every service exposes /metrics)
Flagger's own metrics

Everything else (logs, traces, long-term metrics) goes to Grafana Cloud.

Grafana Cloud Free tier — managed long-term¶

Grafana Cloud Free (April 2026 limits): - 10k series metrics - 50GB logs - 50GB traces - 14-day retention - 3 users

Suffices for a small rig. Upgrade path is linear cost-per-ingest when traffic grows.

What goes to Grafana Cloud¶

All traces (service + agent)
All logs (service + agent)
Long-term metrics (retention beyond Prometheus local's 14 days, via remote_write)

Grafana dashboards are built for:

Service-level: rate, errors, duration (RED), SLO burn rate, error budget remaining
Agent-level (cross-referenced with Langfuse): tokens per task per agent, cost per agent per day
Drift: model output hash delta, prompt-eval regression alerts

SLOs and error budgets¶

SLO = target reliability over a rolling window. Error budget = 100% − SLO × allowed window.

SLOs for the rig's own services — to be set, not asserted¶

Honest retraction

Earlier drafts listed specific SLO numbers (99.9% for rig-conductorPOST, 99.5% for the GET endpoint, etc.) as though they were commitments. The honest position: we don't serve customer traffic, so no external promise is being made, and the right SLO targets should be derived from operating data, not chosen up front. Asserting "99.9%" before we've run the service is plant-a-flag framing, not measurement.

What we commit to today:

Each service has a SLI defined (success rate, latency, goal accuracy — concrete metric, dashboarded)
Each service has a provisional SLO set to a conservative default and revisited monthly
Budget-gated rollout discipline applies regardless of the specific target — if budget is burning, only fixes merge
The ceiling for any single-VM service is ~99.9% (a single 8 GB VM cannot credibly claim four 9s without multi-region failover we don't have)

Concrete per-service targets are filled in by the on-call engineer after the service has enough real traffic to calibrate — not in this document. Example template the engineer fills in:

Service	SLI	Provisional SLO	Status
rig-conductor`/api/events` POST	success rate over 28d	provisional 99.5%	under observation
rig-conductor`/api/assignments/next` GET	success rate over 28d	provisional 99%	under observation
Dev-E session → merged PR	goal accuracy	provisional 80%	under observation
Review-E session → approve/reject	consistency with human-judged (sampled)	provisional 85%	under observation
Dev-E repair-dispatch auto-fix	fix-survives-24h	provisional 70%	not yet deployed
Flagger canary promotion	false-positive rollback rate	provisional <10%	under observation

"Provisional" means "we're measuring this and will re-set the number after 60 days of operating data." That's the discipline — not the specific numerical target.

Burn-rate alerts (Honeycomb pattern)¶

Page when the current burn rate, if continued for the next 4 hours, would exhaust the monthly budget. This is the right shape for an AI agent to consume — it's forward-looking and rate-based, not threshold-based.

# Prometheus alert rule
- alert: SLOBurnFast
  expr: |
    (
      sum(rate(http_requests_total{code=~"5.."}[5m])) /
      sum(rate(http_requests_total[5m]))
    ) > ( (1 - 0.999) * 14.4 )  # 14.4 = 4h burn consuming full 28d budget
  for: 2m
  labels: { severity: P1 }
  annotations:
    summary: "Error budget will be exhausted in 4h at current rate"

Burn-rate alerts fire EscalationRequired events into rig-conductor; the escalation router decides routing by severity.

What works vs what fails at small scale¶

SLOs at sub-1-QPS are statistical noise. A single 500 burns ~10% of an hourly budget. Use 7-day windows minimum and define SLIs over synthetic probes (constant QPS) rather than user traffic. Honeycomb-style burn-rate alerts break down below ~10 QPS.
Multi-replica required for true canary. Canary with 1 replica is a blue/green. Either run 2 replicas per canaried service (fine on k3s for Go/.NET) or use blue/green with traffic mirroring instead of percentage canary.
Chaos engineering on one node is cargo-cult. Kill the pod, it restarts, you learned nothing about cross-node failure. Defer until multi-node.

The closed loop: metrics → decisions¶

sequenceDiagram
    participant P as Prometheus
    participant G as Grafana (alert)
    participant CE as rig-conductor
    participant R as Router
    participant F as Flagger / flagd
    participant D as Discord
    participant A as Repair-E

    Note over P: Burn rate exceeds threshold
    P->>G: SLOBurnFast alert fires
    G->>CE: EscalationRequired event
    CE->>R: Route by severity P1
    R->>D: Post to #admin channel
    R->>F: Enable flagd kill switch<br/>for affected feature
    R->>A: Dispatch Repair-E for diagnosis
    A->>P: Query top-N slow/error traces
    A->>A: Extract code location + commit
    A->>CE: Propose fix PR with attestation
    CE->>F: Run fix through canary pipeline
    F-->>CE: Canary analysis result
    alt canary passes
        F->>CE: Promote, clear flag kill
    else canary fails
        F->>CE: Abort, escalate to P0
    end

Every arrow in this diagram is an event + a metric. The loop is measurable end-to-end. Mean-time-to-detect, mean-time-to-escalate, mean-time-to-propose, mean-time-to-canary, mean-time-to-resolve — all queryable.

Cost observability¶

Cost is a first-class metric. The cost framework (cost-framework.md) enforces limits; observability surfaces the data.

Dashboards show:

$ per agent per day — line chart, week-over-week
$ per merged PR — unit economics
$ per SWE-bench-Pro pass — efficiency metric
Cache hit rate — prompt caching effectiveness (≥80% target on long system prompts)
Token budget utilization — per-agent slice, for rate-limit planning
Cost per (model × task-class) — route optimization data

Alerts:

Daily budget exhaustion projected — any agent on track to exceed daily cap by >80% before end of day
Anomalous spend — per-agent spend >2× week-over-week rolling average

Agent quality observability¶

Metrics beyond cost:

Goal accuracy — % of dispatched tasks ending in merged PR without human rework
Hallucination rate — schema rejections + citation validation failures per session
Token efficiency — tokens per successful completion, per task class
Change Failure Rate (CFR) — rollback rate, DORA metric adapted
Rework rate — commits added to PR after initial draft, excluding Review-E-requested changes
Time-to-merge — p50, p99
Eval regression count — nightly harness regressions, weekly
Refusal accuracy — % of "unanswerable" prompts correctly refused

Stanford/NIST AI Agent Standards (February 2026) formalize the first four. We adopt them plus the rig-specific ones.

What "quality" means operationally¶

A tier promotion from T1 → T2 requires: - goal_accuracy > 85% over last 30d for that task class - CFR < 2% over last 30d - rework_rate < 10% over last 30d

Measurable, not vibes.

Alert rationalization¶

Alerts must be:

Actionable — the on-call knows what to do
Novel — not a dupe of a currently-firing alert
Correlated to an event we care about — not "metric X moved"

The trusted rig's alert taxonomy:

Alert	Severity	Action
`SLOBurnFast`	P1	Flip flag kill switch, dispatch Repair-E
`AdmissionRejectionSpike`	P1	Investigate compromised agent or misconfigured policy
`AgentStuckRateHigh`	P2	Check if new model deployed; review recent prompt changes
`BudgetExhaustion`	P2	Suspend dispatch, human investigation
`ModelOutputDriftDetected`	P2	Run full eval suite; compare to baseline
`DependencyMalwareDetected`	P1	Quarantine affected service; rotate secrets
`T3AdmissionWithoutHumanCosign`	P0	Never expected; emergency investigation
`CanaryFalsePositiveRateHigh`	P3	Tune analysis template; longer `consecutiveSuccessLimit`

P0 is DM + @mention (Discord). P1 is #admin channel. P2 is per-issue thread. P3 is dashboard-only. See self-healing.md for severity routing.

Logs¶

Logs flow through the OTel Collector to Grafana Cloud Loki. Convention:

JSON structured log lines
Correlation with trace ID via trace_id field
Every agent log line carries agent.id, task.id, event.id
Every service log line carries service.name, version, request_id

Log retention: 14 days (Grafana Cloud Free limit). Persistent audit trail goes through rig-conductor's event store (Postgres), not logs.

Traces¶

OTel traces end-to-end:

[user submits issue #42]
  → Spec-E span (infer tier)
    → LLM call span
    →  rig-conductorTaskSpec commit span
  → Dispatch span
    → Dev-E assignment span
      → Tool call span (Read)
      → Tool call span (Grep)
      → ...
      → Tool call span (gh pr create)
        → GitHub API span
  → Review-E span
    → LLM call span
    → PR comment span
  → Kyverno admission span
  → Flagger canary span
    → Prometheus analysis span (×N)
  → Promotion span

Sampling: 100% for T2/T3 tasks, 10% for T1, 1% for T0. Traces are the replay substrate for any post-incident investigation.

Dashboards users care about¶

Dashboard	Audience	Primary questions
Rig Quality	Humans, weekly review	"Are the agents getting better?" "What's the trend on goal accuracy per agent?"
Cost	Humans, daily	"Are we within budget?" "Which agent is expensive this week?"
Production Health (SREs)	Humans, on-call	"Are any services burning budget?" "What's the current error budget remaining?"
Agent Liveness	Both	"Is each agent online and making progress?" "What's the stuck rate?"
Attestation Chain	Humans, audit	"Is every prod change attested?" "Which admissions were rejected and why?"
Drift	Humans, weekly	"Did the model behavior change this week?" "Any prompt regressions in CI?"

All stored as code (dashboards/ in rig-gitops), deployed via Flux.

The minimum-viable observability for an 8GB VM¶

If the full stack is the target, the first cut is smaller:

OTel Collector (200MB)
Langfuse self-hosted (~800MB with Postgres + ClickHouse)
Local Prometheus (1GB)
Grafana Cloud Free (0MB local)

Total: ~2GB. Sufficient for a 1-2 person rig. Grows linearly with service count.

If even this is too much, drop to:

OTel Collector (200MB)
Grafana Cloud free for everything including Prometheus-compatible metrics (0MB local)
No Langfuse — ingest LLM traces into Grafana Tempo with GenAI semantic conventions

Trade-off: Grafana Tempo is not as strong for LLM-specific analysis (prompt versioning, eval scoring) but keeps the memory footprint minimal.

What not to do¶

Ship without observability. Phase 2 of the roadmap (index.md) is explicit: measurement before autonomy. Violating this is violating principle 1.
Self-host the full LGTM stack on 8GB. Memory-starves the rig under load. Hybrid is the sane choice.
Sample at 1% everywhere. T2/T3 tasks need 100% sampling — they're the audit-critical ones.
Alert on every metric deviation. Alert fatigue is the single biggest failure mode of observability programs. Every alert must be actionable.
Build dashboards after the fact. Dashboards are built as the metric is added. "We'll add dashboards later" means nobody looks at the metric.