Skip to content

Observability — OpenTelemetry, Langfuse, Prometheus, SLOs

TL;DR

Two observability domains joined at Conductor-E: agent (any OTel-emitting runtime — Claude Code, Codex CLI, or Gemini CLI on the default path — into self-hosted Langfuse) and infrastructure (OTel Collector → Grafana Cloud Free + local Prometheus). Instrument with OpenTelemetry GenAI semantic conventions, not vendor SDKs, so the backend and the provider are both swappable (see provider-portability.md). The hybrid keeps decision-making local (Flagger analysis) while pushing long-term storage to managed. Total added memory: ~1.5 GB.

Measurement precedes trust (principle 1). This document covers what the rig measures, how it surfaces the data, and what thresholds become decisions.

The stack

graph TB
    subgraph "Agent runtime"
        CC[Claude Code CLI<br/>CLAUDE_CODE_ENABLE_TELEMETRY=1]
        A[Custom agent code]
    end

    subgraph "OTel Collector (one per cluster)"
        OC[OpenTelemetry Collector]
    end

    subgraph "Agent observability — self-hosted"
        LF[Langfuse<br/>Postgres + ClickHouse]
    end

    subgraph "Infra observability — managed"
        GC[Grafana Cloud<br/>Prometheus, Loki, Tempo]
    end

    subgraph "Infra observability — local"
        LP[Local Prometheus<br/>Flagger analysis source]
    end

    subgraph "Control plane"
        CE[Conductor-E<br/>reads aggregate signals]
    end

    subgraph "Dashboards"
        LD[Langfuse UI<br/>agent quality, cost]
        GD[Grafana dashboards<br/>services, SLOs, drift]
    end

    CC -->|OTLP spans + metrics| OC
    A -->|OTLP| OC
    OC -->|LLM traces| LF
    OC -->|infra traces + logs| GC
    OC -->|service metrics| GC
    OC -->|service metrics| LP
    LP -->|analysis queries| FL[Flagger]
    LF -->|cost/quality aggregates| CE
    GC -->|SLO/error-budget| CE
    LF --> LD
    GC --> GD

Self-hosted + managed hybrid is a deliberate choice. Self-hosting everything on an 8GB VM starves the rig under load. Managing everything externally removes local source of truth for Flagger analysis. The hybrid gets local decision-making for the hot path and managed long-term storage for the rest.

Native OpenTelemetry across agent runtimes

Every major agent runtime we support emits OpenTelemetry spans natively as of late 2025 — Claude Code, Codex CLI, Gemini CLI all ship with OTel integration and agree on the GenAI semantic conventions. That agreement is the key: swap the runtime, swap the provider, spans keep flowing into the same backend. See provider-portability.md for the full multi-runtime story. Example below uses Claude Code env vars; the equivalents for Codex CLI (CODEX_OTEL_ENDPOINT) and Gemini CLI (GEMINI_OTEL_EXPORTER_OTLP_ENDPOINT) work identically.

Enable in Claude Code via:

export CLAUDE_CODE_ENABLE_TELEMETRY=1
export OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector.observability.svc:4318
export OTEL_SERVICE_NAME=dev-e-dotnet
export OTEL_RESOURCE_ATTRIBUTES="agent.class=dev-e,agent.stack=dotnet,agent.pod=${HOSTNAME}"

Spans emitted per:

  • Model request (attributes: gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.finish_reasons)
  • Tool execution (attributes: tool name, duration, error)
  • Session lifecycle (start, compact, end)

Metadata is always captured. Prompt text and tool-result content are opt-in via CLAUDE_CODE_TELEMETRY_PROMPTS=1 and CLAUDE_CODE_TELEMETRY_TOOL_RESULTS=1. For privacy-sensitive environments, metadata-only suffices for cost attribution and tool-call accounting.

OpenTelemetry GenAI semantic conventions

The conventions for LLM spans are still experimental in 2026. We dual-emit via OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai/dup to survive future stabilization. Convention:

  • Span name: {gen_ai.operation.name} {gen_ai.request.model} (e.g., chat claude-sonnet-4-6)
  • Tool call as child span: name execute_tool {tool_name}
  • Error capture via span status + exception events

LLM observability — Langfuse or Phoenix, decided by VM size

Changed from original whitepaper

The original whitepaper named Langfuse unconditionally. Honest re-evaluation (see tool-choices.md) concludes:

  • Langfuse v3 officially requires 4 CPU / 16 GB RAM for the app + separate ClickHouse cluster — too heavy for our 8 GB VM
  • Arize Phoenix (ELv2, OTel-native, SQLite/Postgres, no ClickHouse) is the honest self-host pick at our scale
  • Instrument with OpenTelemetry GenAI semantic conventions, not vendor SDKs — the observability backend becomes swappable

The pick below assumes we either (a) stay on 8 GB → use Phoenix, or (b) scale to 16 GB+ → use Langfuse.

Langfuse is the self-hosted OSS choice when the VM supports it (MIT core, ClickHouse + Postgres + Redis). Free tier supports 50k billable units/month on SaaS; no tier limits on-prem. Strongest OSS prompt versioning + evals + cost attribution.

Arize Phoenix is the lighter alternative (ELv2 source-available, OTel-native, SQLite or Postgres). Better fit for our 8 GB VM today. Eval-first product; less polish on team workflows than Langfuse.

See tool-choices.md for the full comparison including Helicone, LangSmith, Braintrust, W&B Traces, and Cloudflare AI Gateway (which we run as a free secondary observability layer regardless).

See tool-choices.md for the full per-option evaluation (Langfuse vs Phoenix vs Helicone vs LangSmith vs Braintrust vs Cloudflare AI Gateway) including license, owner, pricing, lock-in, migration cost.

What Langfuse tracks

For each agent session:

  • Cost — tokens × model price, attributed per agent × task × repo
  • Latency — per model call, per tool call
  • Quality signals — tool-call error rate, schema-validation failure rate, session success/fail
  • Prompt versioning — every system prompt change creates a new version, enables A/B across versions
  • Eval results — nightly harness writes scores here

The Langfuse UI is the primary dashboard for agent-quality questions. "Is Dev-E getting better this week?" is a Langfuse query.

Prometheus (local) — Flagger analysis source

Flagger's AnalysisTemplate queries Prometheus during canary promotion. A query like:

analysis:
  interval: 1m
  threshold: 5
  maxWeight: 50
  stepWeight: 10
  metrics:
  - name: success-rate
    thresholdRange: { min: 99 }
    interval: 1m
    templateRef:
      name: success-rate
      namespace: observability
  - name: latency-p99
    thresholdRange: { max: 500 }
    interval: 1m
    templateRef:
      name: latency-p99

must work even if egress to Grafana Cloud is blipping. Local Prometheus is the source of truth for deploy decisions. Resource budget: 1GB RAM, 10GB disk, 14-day retention.

What Prometheus scrapes

  • kubelet cadvisor metrics (pod health)
  • kube-state-metrics (cluster state)
  • Service endpoints via ServiceMonitor (every service exposes /metrics)
  • Flagger's own metrics

Everything else (logs, traces, long-term metrics) goes to Grafana Cloud.

Grafana Cloud Free tier — managed long-term

Grafana Cloud Free (April 2026 limits): - 10k series metrics - 50GB logs - 50GB traces - 14-day retention - 3 users

Suffices for a small rig. Upgrade path is linear cost-per-ingest when traffic grows.

What goes to Grafana Cloud

  • All traces (service + agent)
  • All logs (service + agent)
  • Long-term metrics (retention beyond Prometheus local's 14 days, via remote_write)

Grafana dashboards are built for:

  • Service-level: rate, errors, duration (RED), SLO burn rate, error budget remaining
  • Agent-level (cross-referenced with Langfuse): tokens per task per agent, cost per agent per day
  • Drift: model output hash delta, prompt-eval regression alerts

SLOs and error budgets

SLO = target reliability over a rolling window. Error budget = 100% − SLO × allowed window.

SLOs for the rig's own services — to be set, not asserted

Honest retraction

Earlier drafts listed specific SLO numbers (99.9% for Conductor-E POST, 99.5% for the GET endpoint, etc.) as though they were commitments. The honest position: we don't serve customer traffic, so no external promise is being made, and the right SLO targets should be derived from operating data, not chosen up front. Asserting "99.9%" before we've run the service is plant-a-flag framing, not measurement.

What we commit to today:

  • Each service has a SLI defined (success rate, latency, goal accuracy — concrete metric, dashboarded)
  • Each service has a provisional SLO set to a conservative default and revisited monthly
  • Budget-gated rollout discipline applies regardless of the specific target — if budget is burning, only fixes merge
  • The ceiling for any single-VM service is ~99.9% (a single 8 GB VM cannot credibly claim four 9s without multi-region failover we don't have)

Concrete per-service targets are filled in by the on-call engineer after the service has enough real traffic to calibrate — not in this document. Example template the engineer fills in:

Service SLI Provisional SLO Status
Conductor-E /api/events POST success rate over 28d provisional 99.5% under observation
Conductor-E /api/assignments/next GET success rate over 28d provisional 99% under observation
Dev-E session → merged PR goal accuracy provisional 80% under observation
Review-E session → approve/reject consistency with human-judged (sampled) provisional 85% under observation
Dev-E repair-dispatch auto-fix fix-survives-24h provisional 70% not yet deployed
Flagger canary promotion false-positive rollback rate provisional <10% under observation

"Provisional" means "we're measuring this and will re-set the number after 60 days of operating data." That's the discipline — not the specific numerical target.

Burn-rate alerts (Honeycomb pattern)

Page when the current burn rate, if continued for the next 4 hours, would exhaust the monthly budget. This is the right shape for an AI agent to consume — it's forward-looking and rate-based, not threshold-based.

# Prometheus alert rule
- alert: SLOBurnFast
  expr: |
    (
      sum(rate(http_requests_total{code=~"5.."}[5m])) /
      sum(rate(http_requests_total[5m]))
    ) > ( (1 - 0.999) * 14.4 )  # 14.4 = 4h burn consuming full 28d budget
  for: 2m
  labels: { severity: P1 }
  annotations:
    summary: "Error budget will be exhausted in 4h at current rate"

Burn-rate alerts fire EscalationRequired events into Conductor-E; the escalation router decides routing by severity.

What works vs what fails at small scale

  • SLOs at sub-1-QPS are statistical noise. A single 500 burns ~10% of an hourly budget. Use 7-day windows minimum and define SLIs over synthetic probes (constant QPS) rather than user traffic. Honeycomb-style burn-rate alerts break down below ~10 QPS.
  • Multi-replica required for true canary. Canary with 1 replica is a blue/green. Either run 2 replicas per canaried service (fine on k3s for Go/.NET) or use blue/green with traffic mirroring instead of percentage canary.
  • Chaos engineering on one node is cargo-cult. Kill the pod, it restarts, you learned nothing about cross-node failure. Defer until multi-node.

The closed loop: metrics → decisions

sequenceDiagram
    participant P as Prometheus
    participant G as Grafana (alert)
    participant CE as Conductor-E
    participant R as Router
    participant F as Flagger / flagd
    participant D as Discord
    participant A as Repair-E

    Note over P: Burn rate exceeds threshold
    P->>G: SLOBurnFast alert fires
    G->>CE: EscalationRequired event
    CE->>R: Route by severity P1
    R->>D: Post to #admin channel
    R->>F: Enable flagd kill switch<br/>for affected feature
    R->>A: Dispatch Repair-E for diagnosis
    A->>P: Query top-N slow/error traces
    A->>A: Extract code location + commit
    A->>CE: Propose fix PR with attestation
    CE->>F: Run fix through canary pipeline
    F-->>CE: Canary analysis result
    alt canary passes
        F->>CE: Promote, clear flag kill
    else canary fails
        F->>CE: Abort, escalate to P0
    end

Every arrow in this diagram is an event + a metric. The loop is measurable end-to-end. Mean-time-to-detect, mean-time-to-escalate, mean-time-to-propose, mean-time-to-canary, mean-time-to-resolve — all queryable.

Cost observability

Cost is a first-class metric. The cost framework (cost-framework.md) enforces limits; observability surfaces the data.

Dashboards show:

  • $ per agent per day — line chart, week-over-week
  • $ per merged PR — unit economics
  • $ per SWE-bench-Pro pass — efficiency metric
  • Cache hit rate — prompt caching effectiveness (≥80% target on long system prompts)
  • Token budget utilization — per-agent slice, for rate-limit planning
  • Cost per (model × task-class) — route optimization data

Alerts:

  • Daily budget exhaustion projected — any agent on track to exceed daily cap by >80% before end of day
  • Anomalous spend — per-agent spend >2× week-over-week rolling average

Agent quality observability

Metrics beyond cost:

  • Goal accuracy — % of dispatched tasks ending in merged PR without human rework
  • Hallucination rate — schema rejections + citation validation failures per session
  • Token efficiency — tokens per successful completion, per task class
  • Change Failure Rate (CFR) — rollback rate, DORA metric adapted
  • Rework rate — commits added to PR after initial draft, excluding Review-E-requested changes
  • Time-to-merge — p50, p99
  • Eval regression count — nightly harness regressions, weekly
  • Refusal accuracy — % of "unanswerable" prompts correctly refused

Stanford/NIST AI Agent Standards (February 2026) formalize the first four. We adopt them plus the rig-specific ones.

What "quality" means operationally

A tier promotion from T1 → T2 requires: - goal_accuracy > 85% over last 30d for that task class - CFR < 2% over last 30d - rework_rate < 10% over last 30d

Measurable, not vibes.

Alert rationalization

Alerts must be:

  1. Actionable — the on-call knows what to do
  2. Novel — not a dupe of a currently-firing alert
  3. Correlated to an event we care about — not "metric X moved"

The trusted rig's alert taxonomy:

Alert Severity Action
SLOBurnFast P1 Flip flag kill switch, dispatch Repair-E
AdmissionRejectionSpike P1 Investigate compromised agent or misconfigured policy
AgentStuckRateHigh P2 Check if new model deployed; review recent prompt changes
BudgetExhaustion P2 Suspend dispatch, human investigation
ModelOutputDriftDetected P2 Run full eval suite; compare to baseline
DependencyMalwareDetected P1 Quarantine affected service; rotate secrets
T3AdmissionWithoutHumanCosign P0 Never expected; emergency investigation
CanaryFalsePositiveRateHigh P3 Tune analysis template; longer consecutiveSuccessLimit

P0 is DM + @mention (Discord). P1 is #admin channel. P2 is per-issue thread. P3 is dashboard-only. See self-healing.md for severity routing.

Logs

Logs flow through the OTel Collector to Grafana Cloud Loki. Convention:

  • JSON structured log lines
  • Correlation with trace ID via trace_id field
  • Every agent log line carries agent.id, task.id, event.id
  • Every service log line carries service.name, version, request_id

Log retention: 14 days (Grafana Cloud Free limit). Persistent audit trail goes through Conductor-E's event store (Postgres), not logs.

Traces

OTel traces end-to-end:

[user submits issue #42]
  → Spec-E span (infer tier)
    → LLM call span
    → Conductor-E TaskSpec commit span
  → Dispatch span
    → Dev-E assignment span
      → Tool call span (Read)
      → Tool call span (Grep)
      → ...
      → Tool call span (gh pr create)
        → GitHub API span
  → Review-E span
    → LLM call span
    → PR comment span
  → Kyverno admission span
  → Flagger canary span
    → Prometheus analysis span (×N)
  → Promotion span

Sampling: 100% for T2/T3 tasks, 10% for T1, 1% for T0. Traces are the replay substrate for any post-incident investigation.

Dashboards users care about

Dashboard Audience Primary questions
Rig Quality Humans, weekly review "Are the agents getting better?" "What's the trend on goal accuracy per agent?"
Cost Humans, daily "Are we within budget?" "Which agent is expensive this week?"
Production Health (SREs) Humans, on-call "Are any services burning budget?" "What's the current error budget remaining?"
Agent Liveness Both "Is each agent online and making progress?" "What's the stuck rate?"
Attestation Chain Humans, audit "Is every prod change attested?" "Which admissions were rejected and why?"
Drift Humans, weekly "Did the model behavior change this week?" "Any prompt regressions in CI?"

All stored as code (dashboards/ in rig-gitops), deployed via Flux.

The minimum-viable observability for an 8GB VM

If the full stack is the target, the first cut is smaller:

  1. OTel Collector (200MB)
  2. Langfuse self-hosted (~800MB with Postgres + ClickHouse)
  3. Local Prometheus (1GB)
  4. Grafana Cloud Free (0MB local)

Total: ~2GB. Sufficient for a 1-2 person rig. Grows linearly with service count.

If even this is too much, drop to:

  1. OTel Collector (200MB)
  2. Grafana Cloud free for everything including Prometheus-compatible metrics (0MB local)
  3. No Langfuse — ingest LLM traces into Grafana Tempo with GenAI semantic conventions

Trade-off: Grafana Tempo is not as strong for LLM-specific analysis (prompt versioning, eval scoring) but keeps the memory footprint minimal.

What not to do

  • Ship without observability. Phase 2 of the roadmap (index.md) is explicit: measurement before autonomy. Violating this is violating principle 1.
  • Self-host the full LGTM stack on 8GB. Memory-starves the rig under load. Hybrid is the sane choice.
  • Sample at 1% everywhere. T2/T3 tasks need 100% sampling — they're the audit-critical ones.
  • Alert on every metric deviation. Alert fatigue is the single biggest failure mode of observability programs. Every alert must be actionable.
  • Build dashboards after the fact. Dashboards are built as the metric is added. "We'll add dashboards later" means nobody looks at the metric.

See also