Observability — OpenTelemetry, Langfuse, Prometheus, SLOs¶
TL;DR
Two observability domains joined at Conductor-E: agent (any OTel-emitting runtime — Claude Code, Codex CLI, or Gemini CLI on the default path — into self-hosted Langfuse) and infrastructure (OTel Collector → Grafana Cloud Free + local Prometheus). Instrument with OpenTelemetry GenAI semantic conventions, not vendor SDKs, so the backend and the provider are both swappable (see provider-portability.md). The hybrid keeps decision-making local (Flagger analysis) while pushing long-term storage to managed. Total added memory: ~1.5 GB.
Measurement precedes trust (principle 1). This document covers what the rig measures, how it surfaces the data, and what thresholds become decisions.
The stack¶
graph TB
subgraph "Agent runtime"
CC[Claude Code CLI<br/>CLAUDE_CODE_ENABLE_TELEMETRY=1]
A[Custom agent code]
end
subgraph "OTel Collector (one per cluster)"
OC[OpenTelemetry Collector]
end
subgraph "Agent observability — self-hosted"
LF[Langfuse<br/>Postgres + ClickHouse]
end
subgraph "Infra observability — managed"
GC[Grafana Cloud<br/>Prometheus, Loki, Tempo]
end
subgraph "Infra observability — local"
LP[Local Prometheus<br/>Flagger analysis source]
end
subgraph "Control plane"
CE[Conductor-E<br/>reads aggregate signals]
end
subgraph "Dashboards"
LD[Langfuse UI<br/>agent quality, cost]
GD[Grafana dashboards<br/>services, SLOs, drift]
end
CC -->|OTLP spans + metrics| OC
A -->|OTLP| OC
OC -->|LLM traces| LF
OC -->|infra traces + logs| GC
OC -->|service metrics| GC
OC -->|service metrics| LP
LP -->|analysis queries| FL[Flagger]
LF -->|cost/quality aggregates| CE
GC -->|SLO/error-budget| CE
LF --> LD
GC --> GD
Self-hosted + managed hybrid is a deliberate choice. Self-hosting everything on an 8GB VM starves the rig under load. Managing everything externally removes local source of truth for Flagger analysis. The hybrid gets local decision-making for the hot path and managed long-term storage for the rest.
Native OpenTelemetry across agent runtimes¶
Every major agent runtime we support emits OpenTelemetry spans natively as of late 2025 — Claude Code, Codex CLI, Gemini CLI all ship with OTel integration and agree on the GenAI semantic conventions. That agreement is the key: swap the runtime, swap the provider, spans keep flowing into the same backend. See provider-portability.md for the full multi-runtime story. Example below uses Claude Code env vars; the equivalents for Codex CLI (CODEX_OTEL_ENDPOINT) and Gemini CLI (GEMINI_OTEL_EXPORTER_OTLP_ENDPOINT) work identically.
Enable in Claude Code via:
export CLAUDE_CODE_ENABLE_TELEMETRY=1
export OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector.observability.svc:4318
export OTEL_SERVICE_NAME=dev-e-dotnet
export OTEL_RESOURCE_ATTRIBUTES="agent.class=dev-e,agent.stack=dotnet,agent.pod=${HOSTNAME}"
Spans emitted per:
- Model request (attributes:
gen_ai.system,gen_ai.request.model,gen_ai.usage.input_tokens,gen_ai.usage.output_tokens,gen_ai.response.finish_reasons) - Tool execution (attributes: tool name, duration, error)
- Session lifecycle (start, compact, end)
Metadata is always captured. Prompt text and tool-result content are opt-in via CLAUDE_CODE_TELEMETRY_PROMPTS=1 and CLAUDE_CODE_TELEMETRY_TOOL_RESULTS=1. For privacy-sensitive environments, metadata-only suffices for cost attribution and tool-call accounting.
OpenTelemetry GenAI semantic conventions¶
The conventions for LLM spans are still experimental in 2026. We dual-emit via OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai/dup to survive future stabilization. Convention:
- Span name:
{gen_ai.operation.name} {gen_ai.request.model}(e.g.,chat claude-sonnet-4-6) - Tool call as child span: name
execute_tool {tool_name} - Error capture via span status + exception events
LLM observability — Langfuse or Phoenix, decided by VM size¶
Changed from original whitepaper
The original whitepaper named Langfuse unconditionally. Honest re-evaluation (see tool-choices.md) concludes:
- Langfuse v3 officially requires 4 CPU / 16 GB RAM for the app + separate ClickHouse cluster — too heavy for our 8 GB VM
- Arize Phoenix (ELv2, OTel-native, SQLite/Postgres, no ClickHouse) is the honest self-host pick at our scale
- Instrument with OpenTelemetry GenAI semantic conventions, not vendor SDKs — the observability backend becomes swappable
The pick below assumes we either (a) stay on 8 GB → use Phoenix, or (b) scale to 16 GB+ → use Langfuse.
Langfuse is the self-hosted OSS choice when the VM supports it (MIT core, ClickHouse + Postgres + Redis). Free tier supports 50k billable units/month on SaaS; no tier limits on-prem. Strongest OSS prompt versioning + evals + cost attribution.
Arize Phoenix is the lighter alternative (ELv2 source-available, OTel-native, SQLite or Postgres). Better fit for our 8 GB VM today. Eval-first product; less polish on team workflows than Langfuse.
See tool-choices.md for the full comparison including Helicone, LangSmith, Braintrust, W&B Traces, and Cloudflare AI Gateway (which we run as a free secondary observability layer regardless).
See tool-choices.md for the full per-option evaluation (Langfuse vs Phoenix vs Helicone vs LangSmith vs Braintrust vs Cloudflare AI Gateway) including license, owner, pricing, lock-in, migration cost.
What Langfuse tracks¶
For each agent session:
- Cost — tokens × model price, attributed per agent × task × repo
- Latency — per model call, per tool call
- Quality signals — tool-call error rate, schema-validation failure rate, session success/fail
- Prompt versioning — every system prompt change creates a new version, enables A/B across versions
- Eval results — nightly harness writes scores here
The Langfuse UI is the primary dashboard for agent-quality questions. "Is Dev-E getting better this week?" is a Langfuse query.
Prometheus (local) — Flagger analysis source¶
Flagger's AnalysisTemplate queries Prometheus during canary promotion. A query like:
analysis:
interval: 1m
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: success-rate
thresholdRange: { min: 99 }
interval: 1m
templateRef:
name: success-rate
namespace: observability
- name: latency-p99
thresholdRange: { max: 500 }
interval: 1m
templateRef:
name: latency-p99
must work even if egress to Grafana Cloud is blipping. Local Prometheus is the source of truth for deploy decisions. Resource budget: 1GB RAM, 10GB disk, 14-day retention.
What Prometheus scrapes¶
kubeletcadvisor metrics (pod health)kube-state-metrics(cluster state)- Service endpoints via ServiceMonitor (every service exposes
/metrics) - Flagger's own metrics
Everything else (logs, traces, long-term metrics) goes to Grafana Cloud.
Grafana Cloud Free tier — managed long-term¶
Grafana Cloud Free (April 2026 limits): - 10k series metrics - 50GB logs - 50GB traces - 14-day retention - 3 users
Suffices for a small rig. Upgrade path is linear cost-per-ingest when traffic grows.
What goes to Grafana Cloud¶
- All traces (service + agent)
- All logs (service + agent)
- Long-term metrics (retention beyond Prometheus local's 14 days, via
remote_write)
Grafana dashboards are built for:
- Service-level: rate, errors, duration (RED), SLO burn rate, error budget remaining
- Agent-level (cross-referenced with Langfuse): tokens per task per agent, cost per agent per day
- Drift: model output hash delta, prompt-eval regression alerts
SLOs and error budgets¶
SLO = target reliability over a rolling window. Error budget = 100% − SLO × allowed window.
SLOs for the rig's own services — to be set, not asserted¶
Honest retraction
Earlier drafts listed specific SLO numbers (99.9% for Conductor-E POST, 99.5% for the GET endpoint, etc.) as though they were commitments. The honest position: we don't serve customer traffic, so no external promise is being made, and the right SLO targets should be derived from operating data, not chosen up front. Asserting "99.9%" before we've run the service is plant-a-flag framing, not measurement.
What we commit to today:
- Each service has a SLI defined (success rate, latency, goal accuracy — concrete metric, dashboarded)
- Each service has a provisional SLO set to a conservative default and revisited monthly
- Budget-gated rollout discipline applies regardless of the specific target — if budget is burning, only fixes merge
- The ceiling for any single-VM service is ~99.9% (a single 8 GB VM cannot credibly claim four 9s without multi-region failover we don't have)
Concrete per-service targets are filled in by the on-call engineer after the service has enough real traffic to calibrate — not in this document. Example template the engineer fills in:
| Service | SLI | Provisional SLO | Status |
|---|---|---|---|
Conductor-E /api/events POST |
success rate over 28d | provisional 99.5% | under observation |
Conductor-E /api/assignments/next GET |
success rate over 28d | provisional 99% | under observation |
| Dev-E session → merged PR | goal accuracy | provisional 80% | under observation |
| Review-E session → approve/reject | consistency with human-judged (sampled) | provisional 85% | under observation |
| Dev-E repair-dispatch auto-fix | fix-survives-24h | provisional 70% | not yet deployed |
| Flagger canary promotion | false-positive rollback rate | provisional <10% | under observation |
"Provisional" means "we're measuring this and will re-set the number after 60 days of operating data." That's the discipline — not the specific numerical target.
Burn-rate alerts (Honeycomb pattern)¶
Page when the current burn rate, if continued for the next 4 hours, would exhaust the monthly budget. This is the right shape for an AI agent to consume — it's forward-looking and rate-based, not threshold-based.
# Prometheus alert rule
- alert: SLOBurnFast
expr: |
(
sum(rate(http_requests_total{code=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
) > ( (1 - 0.999) * 14.4 ) # 14.4 = 4h burn consuming full 28d budget
for: 2m
labels: { severity: P1 }
annotations:
summary: "Error budget will be exhausted in 4h at current rate"
Burn-rate alerts fire EscalationRequired events into Conductor-E; the escalation router decides routing by severity.
What works vs what fails at small scale¶
- SLOs at sub-1-QPS are statistical noise. A single 500 burns ~10% of an hourly budget. Use 7-day windows minimum and define SLIs over synthetic probes (constant QPS) rather than user traffic. Honeycomb-style burn-rate alerts break down below ~10 QPS.
- Multi-replica required for true canary. Canary with 1 replica is a blue/green. Either run 2 replicas per canaried service (fine on k3s for Go/.NET) or use blue/green with traffic mirroring instead of percentage canary.
- Chaos engineering on one node is cargo-cult. Kill the pod, it restarts, you learned nothing about cross-node failure. Defer until multi-node.
The closed loop: metrics → decisions¶
sequenceDiagram
participant P as Prometheus
participant G as Grafana (alert)
participant CE as Conductor-E
participant R as Router
participant F as Flagger / flagd
participant D as Discord
participant A as Repair-E
Note over P: Burn rate exceeds threshold
P->>G: SLOBurnFast alert fires
G->>CE: EscalationRequired event
CE->>R: Route by severity P1
R->>D: Post to #admin channel
R->>F: Enable flagd kill switch<br/>for affected feature
R->>A: Dispatch Repair-E for diagnosis
A->>P: Query top-N slow/error traces
A->>A: Extract code location + commit
A->>CE: Propose fix PR with attestation
CE->>F: Run fix through canary pipeline
F-->>CE: Canary analysis result
alt canary passes
F->>CE: Promote, clear flag kill
else canary fails
F->>CE: Abort, escalate to P0
end
Every arrow in this diagram is an event + a metric. The loop is measurable end-to-end. Mean-time-to-detect, mean-time-to-escalate, mean-time-to-propose, mean-time-to-canary, mean-time-to-resolve — all queryable.
Cost observability¶
Cost is a first-class metric. The cost framework (cost-framework.md) enforces limits; observability surfaces the data.
Dashboards show:
- $ per agent per day — line chart, week-over-week
- $ per merged PR — unit economics
- $ per SWE-bench-Pro pass — efficiency metric
- Cache hit rate — prompt caching effectiveness (≥80% target on long system prompts)
- Token budget utilization — per-agent slice, for rate-limit planning
- Cost per (model × task-class) — route optimization data
Alerts:
- Daily budget exhaustion projected — any agent on track to exceed daily cap by >80% before end of day
- Anomalous spend — per-agent spend >2× week-over-week rolling average
Agent quality observability¶
Metrics beyond cost:
- Goal accuracy — % of dispatched tasks ending in merged PR without human rework
- Hallucination rate — schema rejections + citation validation failures per session
- Token efficiency — tokens per successful completion, per task class
- Change Failure Rate (CFR) — rollback rate, DORA metric adapted
- Rework rate — commits added to PR after initial draft, excluding Review-E-requested changes
- Time-to-merge — p50, p99
- Eval regression count — nightly harness regressions, weekly
- Refusal accuracy — % of "unanswerable" prompts correctly refused
Stanford/NIST AI Agent Standards (February 2026) formalize the first four. We adopt them plus the rig-specific ones.
What "quality" means operationally¶
A tier promotion from T1 → T2 requires: - goal_accuracy > 85% over last 30d for that task class - CFR < 2% over last 30d - rework_rate < 10% over last 30d
Measurable, not vibes.
Alert rationalization¶
Alerts must be:
- Actionable — the on-call knows what to do
- Novel — not a dupe of a currently-firing alert
- Correlated to an event we care about — not "metric X moved"
The trusted rig's alert taxonomy:
| Alert | Severity | Action |
|---|---|---|
SLOBurnFast |
P1 | Flip flag kill switch, dispatch Repair-E |
AdmissionRejectionSpike |
P1 | Investigate compromised agent or misconfigured policy |
AgentStuckRateHigh |
P2 | Check if new model deployed; review recent prompt changes |
BudgetExhaustion |
P2 | Suspend dispatch, human investigation |
ModelOutputDriftDetected |
P2 | Run full eval suite; compare to baseline |
DependencyMalwareDetected |
P1 | Quarantine affected service; rotate secrets |
T3AdmissionWithoutHumanCosign |
P0 | Never expected; emergency investigation |
CanaryFalsePositiveRateHigh |
P3 | Tune analysis template; longer consecutiveSuccessLimit |
P0 is DM + @mention (Discord). P1 is #admin channel. P2 is per-issue thread. P3 is dashboard-only. See self-healing.md for severity routing.
Logs¶
Logs flow through the OTel Collector to Grafana Cloud Loki. Convention:
- JSON structured log lines
- Correlation with trace ID via
trace_idfield - Every agent log line carries
agent.id,task.id,event.id - Every service log line carries
service.name,version,request_id
Log retention: 14 days (Grafana Cloud Free limit). Persistent audit trail goes through Conductor-E's event store (Postgres), not logs.
Traces¶
OTel traces end-to-end:
[user submits issue #42]
→ Spec-E span (infer tier)
→ LLM call span
→ Conductor-E TaskSpec commit span
→ Dispatch span
→ Dev-E assignment span
→ Tool call span (Read)
→ Tool call span (Grep)
→ ...
→ Tool call span (gh pr create)
→ GitHub API span
→ Review-E span
→ LLM call span
→ PR comment span
→ Kyverno admission span
→ Flagger canary span
→ Prometheus analysis span (×N)
→ Promotion span
Sampling: 100% for T2/T3 tasks, 10% for T1, 1% for T0. Traces are the replay substrate for any post-incident investigation.
Dashboards users care about¶
| Dashboard | Audience | Primary questions |
|---|---|---|
| Rig Quality | Humans, weekly review | "Are the agents getting better?" "What's the trend on goal accuracy per agent?" |
| Cost | Humans, daily | "Are we within budget?" "Which agent is expensive this week?" |
| Production Health (SREs) | Humans, on-call | "Are any services burning budget?" "What's the current error budget remaining?" |
| Agent Liveness | Both | "Is each agent online and making progress?" "What's the stuck rate?" |
| Attestation Chain | Humans, audit | "Is every prod change attested?" "Which admissions were rejected and why?" |
| Drift | Humans, weekly | "Did the model behavior change this week?" "Any prompt regressions in CI?" |
All stored as code (dashboards/ in rig-gitops), deployed via Flux.
The minimum-viable observability for an 8GB VM¶
If the full stack is the target, the first cut is smaller:
- OTel Collector (200MB)
- Langfuse self-hosted (~800MB with Postgres + ClickHouse)
- Local Prometheus (1GB)
- Grafana Cloud Free (0MB local)
Total: ~2GB. Sufficient for a 1-2 person rig. Grows linearly with service count.
If even this is too much, drop to:
- OTel Collector (200MB)
- Grafana Cloud free for everything including Prometheus-compatible metrics (0MB local)
- No Langfuse — ingest LLM traces into Grafana Tempo with GenAI semantic conventions
Trade-off: Grafana Tempo is not as strong for LLM-specific analysis (prompt versioning, eval scoring) but keeps the memory footprint minimal.
What not to do¶
- Ship without observability. Phase 2 of the roadmap (index.md) is explicit: measurement before autonomy. Violating this is violating principle 1.
- Self-host the full LGTM stack on 8GB. Memory-starves the rig under load. Hybrid is the sane choice.
- Sample at 1% everywhere. T2/T3 tasks need 100% sampling — they're the audit-critical ones.
- Alert on every metric deviation. Alert fatigue is the single biggest failure mode of observability programs. Every alert must be actionable.
- Build dashboards after the fact. Dashboards are built as the metric is added. "We'll add dashboards later" means nobody looks at the metric.
See also¶
- index.md
- principles.md — principle 1 (measurable) operationalized here
- self-healing.md — the loop observability feeds into
- cost-framework.md — how cost observability becomes cost enforcement
- quality-and-evaluation.md — how quality observability feeds the eval harness
- drift-detection.md — drift signals derived from observability