Self-Healing Production — Canary, SLO Gating, Kill Switches, Repair-E¶
TL;DR
Production bugs get detected, diagnosed, fixed, canaried, and promoted — with humans only at semantic boundaries — within minutes of first SLO impact. Five stages (know → rollback → diagnose → fix → learn). The trusted rig targets stages 0–3 for our services; stage 4 is aspirational. Most very well-engineered teams (Stripe, GitHub, Cloudflare) do not fully achieve stages 2–3 for logic bugs.
Terminology: Repair-E = Dev-E in repair-dispatch mode
This document uses the name "Repair-E" as shorthand for Dev-E dispatched by an SLO-burn alert with a repair-specific system prompt. It is not a separate agent class — same pod class, same model, different trigger + prompt. Earlier drafts framed it as a fifth agent role; honest re-evaluation (see glossary.md) found the event-shaped-boundary test isn't cleanly met. The name is kept as a convenient label for a dispatch mode, not a separate agent.
The ladder¶
The realistic self-healing ladder, restated from the conversation:
| Stage | Capability | Target |
|---|---|---|
| 0 | Know prod is broken | OTel + Prometheus + SLOs + error-budget math |
| 1 | Auto-rollback on SLO breach | Flagger + flagd, signed images, trustworthy rollback target |
| 2 | Auto-diagnose | Repair-E reads trace + deploy + git blame, proposes fix with confidence score |
| 3 | Auto-fix + canary + progressive rollout | Reproduction harness, DB migration safety, feedback loop |
| 4 | Learn from incidents | Post-incident projection, prior updates, preemptive detection |
Stages 0-1 are engineering that can ship. Stages 2-3 are frontier work where Cursor, Devin, Anthropic internal all have pieces but none publicly demonstrate full coverage. Stage 4 is research.
The trusted rig targets stages 0-3 for our services. Stage 4 is aspirational.
The canonical pipeline¶
sequenceDiagram
participant P as Prometheus
participant A as Alertmanager
participant CE as Conductor-E
participant R as Router
participant RE as Repair-E
participant FD as flagd
participant F as Flagger
participant KV as Kyverno
participant D as Discord
P->>A: SLO burn-rate exceeds threshold
A->>CE: EscalationRequired severity P1
CE->>R: Route by severity
R->>D: Post to admin channel
R->>FD: Flip kill switch for affected feature (~30s)
R->>RE: Dispatch with trace context
RE->>RE: Pull top-N slow/error traces
RE->>RE: Extract code.function + code.filepath
RE->>RE: git log with -S for changed function, last 24h
RE->>RE: Cross-reference recent deploys
alt clear diagnosis
RE->>CE: Propose fix PR (attestation chain)
CE->>F: Submit Canary
F->>P: Run AnalysisTemplate (success rate, p99 latency)
alt canary passes
F->>KV: Promote (attested)
KV->>KV: Verify signatures
KV-->>F: Admitted
F->>F: Progressive rollout 5% 25% 50% 100%
F->>CE: Promoted
CE->>FD: Clear kill switch
else canary fails
F->>CE: Aborted
CE->>R: Escalate to P0
end
else ambiguous
RE->>CE: Low confidence — escalate to human
R->>D: P0 DM with mention
end
Every arrow is an event. Every decision is attested. Every metric is in the dashboards.
Stage 0: Know production is broken¶
Signals¶
- Burn rate — current error rate projected forward; honeycomb-style 4h-forward-look triggers P1
- Latency p99 regression — 2× week-over-week baseline for 5 minutes
- Error rate spike — 3σ above rolling hourly baseline
- Synthetic probe failure — constant-QPS synthetic traffic catches what user traffic misses at low QPS
- Dependency failure — upstream service unreachable or 5xx spike
- Deployment correlation — within 15 min of a deploy, any of the above is elevated severity
Why synthetic probes matter at small scale¶
At < 10 QPS, organic traffic is statistical noise. A single 500 burns 10% of an hourly budget. Constant-rate synthetic probes (every 15s, say) provide a signal baseline that doesn't depend on user traffic. Prometheus Blackbox Exporter + scheduled probes hitting the service's health endpoints + key user journeys.
Error budget projection¶
Per service, compute:
budget_remaining = (1 - SLO_target) * total_window_events - failed_events
burn_rate = failed_events_current_rate / failed_events_budgeted_rate
Honeycomb's pattern: alert when burn_rate * 4h > budget_remaining (at current rate, we'd exhaust in 4h). Conductor-E projects this per service and exposes it as GET /api/services/{name}/budget.
Stage 1: Auto-rollback on SLO breach¶
Flagger as the default deploy path¶
Every service in the rig gets a Flagger Canary resource. No service deploys via raw Deployment apply.
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: payments-service
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: payments-service
service:
port: 8080
analysis:
interval: 1m
threshold: 5 # consecutive failures → abort
maxWeight: 50
stepWeight: 10
metrics:
- name: success-rate
thresholdRange: { min: 99 }
interval: 1m
- name: latency-p99
thresholdRange: { max: 500 }
interval: 1m
webhooks:
- name: conductor-e-notify
url: http://conductor-e.conductor-e.svc:8080/api/events
timeout: 5s
metadata:
type: CanaryPhase
Rollout: 5% canary for 1 minute → analysis passes → 15% → ... → 50% → promotion. Any failed analysis aborts; maxWeight: 50 means we never canary past half traffic before full promotion.
Why Flagger over Argo Rollouts¶
Flux-native. Wraps existing Deployment resources rather than requiring a swap to a new Rollout CRD. Webhook hooks at every phase (pre-rollout, confirm-promotion, post-rollout) are the natural place to plug Conductor-E decisions. Argo Rollouts is better if ArgoCD is the GitOps tool — it isn't for us, and the recurring Flux-vs-Rollouts field-drift fights confirm this. See tool-choices.md for full evaluation.
flagd as the faster kill switch¶
YAGNI caveat
Feature flags at our current scale (1-2 humans, few services, no A/B testing need) are arguably overkill — env vars + Kustomize overlays cover deploy-time toggles for zero operational cost. Adopt flagd when we have a concrete runtime-toggle or targeting need. See tool-choices.md for the honest YAGNI discussion and alternatives (Flipt, GrowthBook, PostHog flags). Note that Unleash reached OSS EOL 2025-12-31 — explicitly reject.
Rollback takes 5 minutes (canary re-promotion of the previous version). A feature flag flip takes 30 seconds. For incident response, flag-kill > rollback.
OpenFeature + flagd pattern:
# feature-flags.yaml (in Flux-managed repo)
apiVersion: core.openfeature.dev/v1beta1
kind: FeatureFlag
metadata:
name: payments-flags
spec:
flagSpec:
flags:
new-payment-path:
state: ENABLED
variants: { on: true, off: false }
defaultVariant: on
To kill: PR changes defaultVariant: off, Flux reconciles in ~30s, all pods see the new flag via the flagd sidecar, the feature is disabled globally. No deploy, no rollback.
DB migration safety: pgroll (with hedge)¶
The rule: every migration splits into expand (backward-compatible additive) → deploy dual-write code → contract (destructive) → deploy read-new code. Each as a separate deploy. No NOT NULL on first deploy. No column rename as a single step. No destructive DDL in the same release as code that depends on the new shape.
pgroll automates this for Postgres: creates shadow columns, backfills, installs triggers for dual-write, keeps both schema versions queryable via views. A migration YAML declares the intended final shape; pgroll generates and executes the safe intermediate steps.
Single-vendor bus factor hedge
pgroll is Apache-2.0 but Xata-driven (~27 employees, still operating). If Xata folds, there's no big-company co-maintainer. Hedge (corrected): pgroll migration files are pgroll-specific YAML, not portable SQL. The correct hedge is to commit a parallel SQL trail (pgroll can emit generated SQL) alongside each operation YAML, so schema history stays reconstructible if we ever have to migrate to Flyway or Atlas. See tool-choices.md#db-migration-safety.
Cloudflare Dec 5 2025: the emergency-fast-path lesson
Cloudflare's December 5, 2025 post-mortem: gradual rollouts for code, but their global config system bypassed gradual rollout by design for speed. A config change detonated globally in seconds — 25-minute global outage.
Our rule: every mutable surface (code, config, feature flags, Kyverno policies, AGENTS.md, SLA definitions) flows through the same staged rollout pipeline. No fast path. Enforceable by Kyverno admission policies that deny emergency paths.
Stage 2: Auto-diagnose (Repair-E)¶
The pipeline¶
sequenceDiagram
participant AL as Alert
participant RE as Repair-E
participant OT as OTel / Grafana
participant G as GitHub / git
participant CE as Conductor-E
AL->>RE: Invoke with service, alert_type, SLO_target
RE->>OT: Query top-N slow/error traces last 5min
OT-->>RE: Spans with code.function, code.filepath, service.version
RE->>G: git log -S for changed function on service.version
G-->>RE: Commit history touching that function
RE->>CE: Query recent deploy events for service
CE-->>RE: Deploy timestamps + commit SHAs
RE->>RE: Correlate alert time to deploy time to commit
RE->>RE: Propose fix (revert or forward-fix)
RE->>CE: DiagnosisComplete with commit, confidence, action
alt confidence high
RE->>G: Open PR with fix
else
RE->>CE: Escalate to human (ambiguous)
end
What Repair-E actually sees¶
Inputs:
- Alert metadata (service, SLO, burn rate, timestamp)
- Top-N traces from OTel (by error or latency)
- Code location from span attributes (code.function, code.filepath, code.namespace)
- Recent commits touching that location (git log -S)
- Recent deploy events from Conductor-E
- Related OpenTelemetry logs via trace_id correlation
- Recent error messages from Sentry/Loki
Outputs:
- Structured diagnosis: {root_cause, affected_commit, confidence}
- Proposed fix: PR or feature-flag-kill decision
- Attestation chain (Repair-E identity, trace IDs consulted, reasoning hash)
Confidence thresholds — derived, not self-reported¶
LLM self-reported confidence is uncalibrated
Earlier drafts quoted numeric thresholds ("> 0.8 auto-fix, 0.5–0.8 propose, < 0.5 human") as if the LLM could emit a meaningful self-confidence score. It cannot. LLM self-reported confidence is famously uncalibrated: the agent says "95% confident" with the same tone whether it's right or wrong. Confidence is a derived metric, not a self-report.
Confidence is computed from four measurable signals available at diagnosis time, each scored 0–1:
| Signal | How it's measured | Why it's a proxy for correctness |
|---|---|---|
| Deploy-to-alert correlation strength | Minutes between the most recent deploy and first error signal (from Conductor-E deploy events + Prometheus burn alert) | Shorter gap → deploy is more likely the root cause |
| Trace-to-commit precision | Does the offending span's code.function + code.filepath appear in the recent commit's diff? (git blame intersection) |
Direct topology-match = high precision |
| Test coverage of the affected path | Coverage report for the file/function — pulled from CI artifacts | High coverage means change is less likely a logic bug in covered territory |
| Historical same-signature fix success | Lookup in Conductor-E's incident-history projection: have we seen this trace-fingerprint before, and did prior fixes survive 24h? | Known pattern with known resolution |
These four signals combine (configurable weights, default equal) into a single score. Derived, not guessed.
Calibration — the score itself must be measured¶
Thresholds "auto-fix / propose / human" are not fixed numbers — they are tuned by measuring predicted confidence against actual fix-survives-24h outcomes over rolling N incidents. Process:
- Start conservative: high auto-fix threshold (e.g., 0.85), most incidents go to human
- After each incident, record (predicted score, outcome)
- After 20+ incidents with known outcomes, fit the threshold so the auto-fix bucket shows ≥95% fix-survives-24h
- Propose/human buckets tune similarly (propose bucket: 70-95% survival; human bucket: <70%)
- Re-tune quarterly — never freeze the thresholds, since the model, the codebase, and the failure mode distribution all drift
Until 20+ calibration incidents have landed, everything is human-driven regardless of the predicted score. The auto-fix bucket literally does not exist yet. This is a measurement-gated capability, not a day-one feature.
Current thresholds (provisional until calibration)¶
- All three buckets route to human until 20+ incidents have calibrated the scoring
- During calibration, Repair-E still proposes (and logs the predicted score), but never auto-fixes — the human either applies, modifies, or rejects
- After calibration, thresholds become real — initially conservative (e.g., auto-fix only above 0.85 if the 95% survival criterion holds)
This is honestly measurement-gated progress, not aspirational numbers treated as real.
Reproduction harness¶
Before a proposed fix is promoted past canary, it must reproduce the failure in a sandbox:
- Ephemeral namespace —
k create namespace repair-{incident_id} - Service deploy — the buggy version
- Traffic replay — recorded requests from the failing trace window, replayed via Envoy tap or service-specific replay tooling
- Assert failure — verify the bug manifests
- Apply fix — deploy Repair-E's proposed patch
- Re-run — verify the fix resolves
Only fixes that reproduce-then-resolve in the harness are dispatched to the real canary. The reproduction harness is the single most important artifact separating "AI-generated looks-like-a-fix" from "verified-to-work fix."
The state of the art — honesty¶
As of early 2026, no production system publicly demonstrates full auto-diagnose + auto-reproduce + auto-fix + auto-canary for logic bugs. Components exist:
- Datadog Bits AI SRE, Rootly AI, Resolve.ai, incident.io — AI-assisted diagnosis, human-approved fix
- Cursor Cloud Agents, Cognition Devin — AI-authored fix + PR, human review
- Harness Self-Healing — partial pipeline automation
The trusted rig's claim: we wire these components into a closed loop. The novelty is the integration, not the individual pieces.
Stage 3: Auto-fix + canary + progressive rollout¶
The feedback loop¶
Repair-E's fix follows the same canary pipeline as any other change:
- Attestation — Repair-E commits with gitsign, the image builds with SLSA provenance, cosign-signed
- Kyverno admission — verifies the attestation chain, admits to the canary namespace
- Flagger canary — 5% → analysis → 15% → ... → 100%
- Post-promotion monitoring — 3× the canary interval after promotion, alert still armed
- Observation — after 24h, Conductor-E queries the post-incident health and updates Repair-E's track record
The fix succeeds only if it survives 24h in production. "Deployed" is not "done."
T3 bypass: never¶
Even for urgent fixes, T3 changes never bypass the two-attestor policy. A destructive DB migration to fix a production bug requires human co-sign. The principle: production urgency is not a reason to weaken safety guarantees. Kill-switch first (no destructive migration needed), then careful human-driven repair.
What this does not fix¶
- Bugs in logic that manifest only at scale or under specific data conditions the sandbox doesn't reproduce
- Bugs in shared infrastructure (Conductor-E itself, Flux, cluster networking) — meta-bugs requiring human intervention
- Bugs whose fix requires new business-logic decisions — falls to human semantic judgment
- Novel failure modes with no prior-incident pattern to match — Repair-E's confidence drops below threshold, human-driven
Stage 4: Learn (aspirational)¶
After every auto-resolved incident:
- Structured incident record: SLI that fired, trace IDs, diff, decision log, time-to-resolve
- Open a GitHub Issue with a templated post-mortem (Rootly/incident.io pattern)
- Tag the fix PR with the incident ID; cross-link
- When a similar signature fires, Repair-E retrieves prior fixes first
Post-incident learning at small scale is a 200-line Conductor-E handler plus a Langfuse eval template that scores future Repair-E proposals against the historical resolution log.
Stage 4 is where the rig starts actively improving itself. It is the goal, not a near-term deliverable.
Blast radius of self-healing¶
Self-healing expands the rig's autonomy. That expansion must be bounded by tier policy:
| Action | Blast radius | Who decides |
|---|---|---|
| Flip kill switch | Contained (one feature flag) | Repair-E auto, with attested reason |
| Roll back to previous version | Contained (one service) | Repair-E auto |
| Forward-fix PR (code-only) | T1 | Repair-E auto, through canary |
| Forward-fix PR (config) | T1-T2 | Repair-E with Review-E gate; T2 if config spans services |
| Forward-fix PR (schema change) | T2 | Repair-E proposes, human approves interface |
| Forward-fix PR (auth/payments/destructive) | T3 | Human drives, Repair-E assists |
The tier classification at intake (Spec-E) applies at fix-time (Repair-E). The policy is unified.
Metrics that mark success¶
The weekly self-healing dashboard:
- Mean-time-to-detect (MTTD) — from production incident to alert firing
- Mean-time-to-escalate (MTTE) — from alert to Discord notification
- Mean-time-to-diagnose (MTTDiag) — from dispatch to Repair-E diagnosis committed
- Mean-time-to-fix (MTTF) — from diagnosis to canary-promoted fix
- Mean-time-to-resolve (MTTR) — total MTTD to budget-restoration
- False-positive rollback rate — canary aborts where no actual bug
- Fix-survives-24h rate — of auto-fixed incidents, % that don't revert within 24h
- Auto-resolve rate — % of incidents resolved without human intervention
- Human-override rate — % of Repair-E proposals humans rejected or modified
Target values¶
For the trusted rig (end state, not today):
- MTTD: < 1 minute (synthetic probe or burn-rate alert)
- MTTE: < 30 seconds (flagd kill switch flipped via git-commit-to-reconcile)
- MTTDiag: < 5 minutes for T1 bugs with clear trace-to-commit correlation
- MTTR: < 15 minutes for T1 bugs; < 1 hour for T2 requiring human approval
- Fix-survives-24h: > 80%
- Auto-resolve rate: > 60% of T1 incidents
These are aggressive but consistent with published pilot data from Stripe, Cursor, and Datadog Bits AI SRE.
The honest limits¶
- Zero downtime is aspirational, not absolute. Under catastrophic failure (full cluster outage, Postgres corruption), human intervention is mandatory.
- T3 incidents do not self-heal. Auth bugs, payment bugs, and destructive data issues require human decision.
- Novel bugs are slower. Without prior-incident patterns, Repair-E's confidence is low, and humans drive.
- Reproduction harness coverage is finite. If the bug only manifests under specific load or data, the sandbox may not reproduce it, and auto-fix cannot proceed.
- The feedback loop takes time. An auto-fix that "works" in canary but fails 2 days later is caught by the 24h survival metric, but during that 2 days it's not visible as a failure.
What NOT to do¶
- No emergency fast path. Even when SLO is burning, every change flows through the same gated pipeline. Cloudflare Dec 2025 is the lesson.
- No skipping canary for "obvious" fixes. The obvious-fix-that-breaks-everything is a documented failure class.
- No LLM-judged automatic promotion. Promotion is SLO-gated by Prometheus analysis, not LLM-reviewed. Deterministic gate.
- No auto-fix on T3. Never. Humans drive.
- No silent rollback. Every rollback emits events, updates dashboards, opens a post-mortem issue.
- No persistent staging environment that diverges from prod. Reproduction harness is ephemeral, created from recent prod state, destroyed after incident. Long-lived staging drifts.
Phase-by-phase exit criteria¶
Tied to the roadmap in index.md:
Phase 5 (self-healing) exit criteria: - [ ] Flagger canary operates on every production service - [ ] flagd feature flag sidecars injected via OpenFeature Operator - [ ] pgroll gates every DB migration; non-pgroll migrations rejected by CI - [ ] Error-budget projection live in Conductor-E with per-service breakdown - [ ] SLO burn-rate alerts route through Conductor-E to Discord with severity routing - [ ] Repair-E dispatches on P1 alerts, logs diagnosis with attestation - [ ] Reproduction harness ephemeral-namespace pattern works for at least one service end-to-end - [ ] Kill-switch latency measured < 60s from commit to pod-observed-change - [ ] 24h fix-survival rate measured on dashboard - [ ] Documented runbook for when self-healing fails (on-call procedure)
Only when every checkbox is checked does the phase close.
See also¶
- index.md
- principles.md — principles 3 (reversible), 4 (execute), 6 (progressive autonomy) in action
- trust-model.md — blast-radius policy applied to fixes
- security.md — attestation chain that self-healing depends on
- observability.md — metrics and alerts that trigger self-healing
- limitations.md — where self-healing stops and humans begin