Self-Healing Production — Canary, SLO Gating, Kill Switches, Repair-E¶

TL;DR

Production bugs get detected, diagnosed, fixed, canaried, and promoted — with humans only at semantic boundaries — within minutes of first SLO impact. Five stages (know → rollback → diagnose → fix → learn). The trusted rig targets stages 0–3 for our services; stage 4 is aspirational. Most very well-engineered teams (Stripe, GitHub, Cloudflare) do not fully achieve stages 2–3 for logic bugs.

Terminology: Repair-E = Dev-E in repair-dispatch mode

This document uses the name "Repair-E" as shorthand for Dev-E dispatched by an SLO-burn alert with a repair-specific system prompt. It is not a separate agent class — same pod class, same model, different trigger + prompt. Earlier drafts framed it as a fifth agent role; honest re-evaluation (see glossary.md) found the event-shaped-boundary test isn't cleanly met. The name is kept as a convenient label for a dispatch mode, not a separate agent.

The ladder¶

The realistic self-healing ladder, restated from the conversation:

Stage	Capability	Target
0	Know prod is broken	OTel + Prometheus + SLOs + error-budget math
1	Auto-rollback on SLO breach	Flagger + flagd, signed images, trustworthy rollback target
2	Auto-diagnose	Repair-E reads trace + deploy + git blame, proposes fix with confidence score
3	Auto-fix + canary + progressive rollout	Reproduction harness, DB migration safety, feedback loop
4	Learn from incidents	Post-incident projection, prior updates, preemptive detection

Stages 0-1 are engineering that can ship. Stages 2-3 are frontier work where Cursor, Devin, Anthropic internal all have pieces but none publicly demonstrate full coverage. Stage 4 is research.

The trusted rig targets stages 0-3 for our services. Stage 4 is aspirational.

The canonical pipeline¶

sequenceDiagram
    participant P as Prometheus
    participant A as Alertmanager
    participant CE as rig-conductor
    participant R as Router
    participant RE as Repair-E
    participant FD as flagd
    participant F as Flagger
    participant KV as Kyverno
    participant D as Discord

    P->>A: SLO burn-rate exceeds threshold
    A->>CE: EscalationRequired severity P1
    CE->>R: Route by severity
    R->>D: Post to admin channel
    R->>FD: Flip kill switch for affected feature (~30s)
    R->>RE: Dispatch with trace context
    RE->>RE: Pull top-N slow/error traces
    RE->>RE: Extract code.function + code.filepath
    RE->>RE: git log with -S for changed function, last 24h
    RE->>RE: Cross-reference recent deploys
    alt clear diagnosis
        RE->>CE: Propose fix PR (attestation chain)
        CE->>F: Submit Canary
        F->>P: Run AnalysisTemplate (success rate, p99 latency)
        alt canary passes
            F->>KV: Promote (attested)
            KV->>KV: Verify signatures
            KV-->>F: Admitted
            F->>F: Progressive rollout 5% 25% 50% 100%
            F->>CE: Promoted
            CE->>FD: Clear kill switch
        else canary fails
            F->>CE: Aborted
            CE->>R: Escalate to P0
        end
    else ambiguous
        RE->>CE: Low confidence — escalate to human
        R->>D: P0 DM with mention
    end

Every arrow is an event. Every decision is attested. Every metric is in the dashboards.

Stage 0: Know production is broken¶

Signals¶

Burn rate — current error rate projected forward; honeycomb-style 4h-forward-look triggers P1
Latency p99 regression — 2× week-over-week baseline for 5 minutes
Error rate spike — 3σ above rolling hourly baseline
Synthetic probe failure — constant-QPS synthetic traffic catches what user traffic misses at low QPS
Dependency failure — upstream service unreachable or 5xx spike
Deployment correlation — within 15 min of a deploy, any of the above is elevated severity

Why synthetic probes matter at small scale¶

At < 10 QPS, organic traffic is statistical noise. A single 500 burns 10% of an hourly budget. Constant-rate synthetic probes (every 15s, say) provide a signal baseline that doesn't depend on user traffic. Prometheus Blackbox Exporter + scheduled probes hitting the service's health endpoints + key user journeys.

Error budget projection¶

Per service, compute:

budget_remaining = (1 - SLO_target) * total_window_events - failed_events
burn_rate = failed_events_current_rate / failed_events_budgeted_rate

Honeycomb's pattern: alert when burn_rate * 4h > budget_remaining (at current rate, we'd exhaust in 4h). rig-conductor projects this per service and exposes it as GET /api/services/{name}/budget.

Stage 1: Auto-rollback on SLO breach¶

Flagger as the default deploy path¶

Every service in the rig gets a Flagger Canary resource. No service deploys via raw Deployment apply.

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: payments-service
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payments-service
  service:
    port: 8080
  analysis:
    interval: 1m
    threshold: 5           # consecutive failures → abort
    maxWeight: 50
    stepWeight: 10
    metrics:
    - name: success-rate
      thresholdRange: { min: 99 }
      interval: 1m
    - name: latency-p99
      thresholdRange: { max: 500 }
      interval: 1m
    webhooks:
    - name: rig-conductor-notify
      url: http://rig-conductor.rig-conductor.svc:8080/api/events
      timeout: 5s
      metadata:
        type: CanaryPhase

Rollout: 5% canary for 1 minute → analysis passes → 15% → ... → 50% → promotion. Any failed analysis aborts; maxWeight: 50 means we never canary past half traffic before full promotion.

Why Flagger over Argo Rollouts¶

Flux-native. Wraps existing Deployment resources rather than requiring a swap to a new Rollout CRD. Webhook hooks at every phase (pre-rollout, confirm-promotion, post-rollout) are the natural place to plug rig-conductordecisions. Argo Rollouts is better if ArgoCD is the GitOps tool — it isn't for us, and the recurring Flux-vs-Rollouts field-drift fights confirm this. See tool-choices.md for full evaluation.

flagd as the faster kill switch¶

YAGNI caveat

Feature flags at our current scale (1-2 humans, few services, no A/B testing need) are arguably overkill — env vars + Kustomize overlays cover deploy-time toggles for zero operational cost. Adopt flagd when we have a concrete runtime-toggle or targeting need. See tool-choices.md for the honest YAGNI discussion and alternatives (Flipt, GrowthBook, PostHog flags). Note that Unleash reached OSS EOL 2025-12-31 — explicitly reject.

Rollback takes 5 minutes (canary re-promotion of the previous version). A feature flag flip takes 30 seconds. For incident response, flag-kill > rollback.

OpenFeature + flagd pattern:

# feature-flags.yaml (in Flux-managed repo)
apiVersion: core.openfeature.dev/v1beta1
kind: FeatureFlag
metadata:
  name: payments-flags
spec:
  flagSpec:
    flags:
      new-payment-path:
        state: ENABLED
        variants: { on: true, off: false }
        defaultVariant: on

To kill: PR changes defaultVariant: off, Flux reconciles in ~30s, all pods see the new flag via the flagd sidecar, the feature is disabled globally. No deploy, no rollback.

DB migration safety: pgroll (with hedge)¶

The rule: every migration splits into expand (backward-compatible additive) → deploy dual-write code → contract (destructive) → deploy read-new code. Each as a separate deploy. No NOT NULL on first deploy. No column rename as a single step. No destructive DDL in the same release as code that depends on the new shape.

pgroll automates this for Postgres: creates shadow columns, backfills, installs triggers for dual-write, keeps both schema versions queryable via views. A migration YAML declares the intended final shape; pgroll generates and executes the safe intermediate steps.

Single-vendor bus factor hedge

pgroll is Apache-2.0 but Xata-driven (~27 employees, still operating). If Xata folds, there's no big-company co-maintainer. Hedge (corrected): pgroll migration files are pgroll-specific YAML, not portable SQL. The correct hedge is to commit a parallel SQL trail (pgroll can emit generated SQL) alongside each operation YAML, so schema history stays reconstructible if we ever have to migrate to Flyway or Atlas. See tool-choices.md#db-migration-safety.

Cloudflare Dec 5 2025: the emergency-fast-path lesson

Cloudflare's December 5, 2025 post-mortem: gradual rollouts for code, but their global config system bypassed gradual rollout by design for speed. A config change detonated globally in seconds — 25-minute global outage.

Our rule: every mutable surface (code, config, feature flags, Kyverno policies, AGENTS.md, SLA definitions) flows through the same staged rollout pipeline. No fast path. Enforceable by Kyverno admission policies that deny emergency paths.

Stage 2: Auto-diagnose (Repair-E)¶

The pipeline¶

sequenceDiagram
    participant AL as Alert
    participant RE as Repair-E
    participant OT as OTel / Grafana
    participant G as GitHub / git
    participant CE as rig-conductor

    AL->>RE: Invoke with service, alert_type, SLO_target
    RE->>OT: Query top-N slow/error traces last 5min
    OT-->>RE: Spans with code.function, code.filepath, service.version
    RE->>G: git log -S for changed function on service.version
    G-->>RE: Commit history touching that function
    RE->>CE: Query recent deploy events for service
    CE-->>RE: Deploy timestamps + commit SHAs
    RE->>RE: Correlate alert time to deploy time to commit
    RE->>RE: Propose fix (revert or forward-fix)
    RE->>CE: DiagnosisComplete with commit, confidence, action
    alt confidence high
        RE->>G: Open PR with fix
    else
        RE->>CE: Escalate to human (ambiguous)
    end

What Repair-E actually sees¶

Inputs: - Alert metadata (service, SLO, burn rate, timestamp) - Top-N traces from OTel (by error or latency) - Code location from span attributes (code.function, code.filepath, code.namespace) - Recent commits touching that location (git log -S) - Recent deploy events from rig-conductor - Related OpenTelemetry logs via trace_id correlation - Recent error messages from Sentry/Loki

Outputs: - Structured diagnosis: {root_cause, affected_commit, confidence} - Proposed fix: PR or feature-flag-kill decision - Attestation chain (Repair-E identity, trace IDs consulted, reasoning hash)

Confidence thresholds — derived, not self-reported¶

LLM self-reported confidence is uncalibrated

Earlier drafts quoted numeric thresholds ("> 0.8 auto-fix, 0.5–0.8 propose, < 0.5 human") as if the LLM could emit a meaningful self-confidence score. It cannot. LLM self-reported confidence is famously uncalibrated: the agent says "95% confident" with the same tone whether it's right or wrong. Confidence is a derived metric, not a self-report.

Confidence is computed from four measurable signals available at diagnosis time, each scored 0–1:

Signal	How it's measured	Why it's a proxy for correctness
Deploy-to-alert correlation strength	Minutes between the most recent deploy and first error signal (from rig-conductor deploy events + Prometheus burn alert)	Shorter gap → deploy is more likely the root cause
Trace-to-commit precision	Does the offending span's `code.function` + `code.filepath` appear in the recent commit's diff? (git blame intersection)	Direct topology-match = high precision
Test coverage of the affected path	Coverage report for the file/function — pulled from CI artifacts	High coverage means change is less likely a logic bug in covered territory
Historical same-signature fix success	Lookup in rig-conductor's incident-history projection: have we seen this trace-fingerprint before, and did prior fixes survive 24h?	Known pattern with known resolution

These four signals combine (configurable weights, default equal) into a single score. Derived, not guessed.

Calibration — the score itself must be measured¶

Thresholds "auto-fix / propose / human" are not fixed numbers — they are tuned by measuring predicted confidence against actual fix-survives-24h outcomes over rolling N incidents. Process:

Start conservative: high auto-fix threshold (e.g., 0.85), most incidents go to human
After each incident, record (predicted score, outcome)
After 20+ incidents with known outcomes, fit the threshold so the auto-fix bucket shows ≥95% fix-survives-24h
Propose/human buckets tune similarly (propose bucket: 70-95% survival; human bucket: <70%)
Re-tune quarterly — never freeze the thresholds, since the model, the codebase, and the failure mode distribution all drift

Until 20+ calibration incidents have landed, everything is human-driven regardless of the predicted score. The auto-fix bucket literally does not exist yet. This is a measurement-gated capability, not a day-one feature.

Current thresholds (provisional until calibration)¶

All three buckets route to human until 20+ incidents have calibrated the scoring
During calibration, Repair-E still proposes (and logs the predicted score), but never auto-fixes — the human either applies, modifies, or rejects
After calibration, thresholds become real — initially conservative (e.g., auto-fix only above 0.85 if the 95% survival criterion holds)

This is honestly measurement-gated progress, not aspirational numbers treated as real.

Reproduction harness¶

Before a proposed fix is promoted past canary, it must reproduce the failure in a sandbox:

Ephemeral namespace — k create namespace repair-{incident_id}
Service deploy — the buggy version
Traffic replay — recorded requests from the failing trace window, replayed via Envoy tap or service-specific replay tooling
Assert failure — verify the bug manifests
Apply fix — deploy Repair-E's proposed patch
Re-run — verify the fix resolves

Only fixes that reproduce-then-resolve in the harness are dispatched to the real canary. The reproduction harness is the single most important artifact separating "AI-generated looks-like-a-fix" from "verified-to-work fix."

The state of the art — honesty¶

As of early 2026, no production system publicly demonstrates full auto-diagnose + auto-reproduce + auto-fix + auto-canary for logic bugs. Components exist:

Datadog Bits AI SRE, Rootly AI, Resolve.ai, incident.io — AI-assisted diagnosis, human-approved fix
Cursor Cloud Agents, Cognition Devin — AI-authored fix + PR, human review
Harness Self-Healing — partial pipeline automation

The trusted rig's claim: we wire these components into a closed loop. The novelty is the integration, not the individual pieces.

Stage 3: Auto-fix + canary + progressive rollout¶

The feedback loop¶

Repair-E's fix follows the same canary pipeline as any other change:

Attestation — Repair-E commits with gitsign, the image builds with SLSA provenance, cosign-signed
Kyverno admission — verifies the attestation chain, admits to the canary namespace
Flagger canary — 5% → analysis → 15% → ... → 100%
Post-promotion monitoring — 3× the canary interval after promotion, alert still armed
Observation — after 24h, rig-conductorqueries the post-incident health and updates Repair-E's track record

The fix succeeds only if it survives 24h in production. "Deployed" is not "done."

T3 bypass: never¶

Even for urgent fixes, T3 changes never bypass the two-attestor policy. A destructive DB migration to fix a production bug requires human co-sign. The principle: production urgency is not a reason to weaken safety guarantees. Kill-switch first (no destructive migration needed), then careful human-driven repair.

What this does not fix¶

Bugs in logic that manifest only at scale or under specific data conditions the sandbox doesn't reproduce
Bugs in shared infrastructure ( rig-conductoritself, Flux, cluster networking) — meta-bugs requiring human intervention
Bugs whose fix requires new business-logic decisions — falls to human semantic judgment
Novel failure modes with no prior-incident pattern to match — Repair-E's confidence drops below threshold, human-driven

Stage 4: Learn (aspirational)¶

After every auto-resolved incident:

Structured incident record: SLI that fired, trace IDs, diff, decision log, time-to-resolve
Open a GitHub Issue with a templated post-mortem (Rootly/incident.io pattern)
Tag the fix PR with the incident ID; cross-link
When a similar signature fires, Repair-E retrieves prior fixes first

Post-incident learning at small scale is a 200-line rig-conductorhandler plus a Langfuse eval template that scores future Repair-E proposals against the historical resolution log.

Stage 4 is where the rig starts actively improving itself. It is the goal, not a near-term deliverable.

Blast radius of self-healing¶

Self-healing expands the rig's autonomy. That expansion must be bounded by tier policy:

Action	Blast radius	Who decides
Flip kill switch	Contained (one feature flag)	Repair-E auto, with attested reason
Roll back to previous version	Contained (one service)	Repair-E auto
Forward-fix PR (code-only)	T1	Repair-E auto, through canary
Forward-fix PR (config)	T1-T2	Repair-E with Review-E gate; T2 if config spans services
Forward-fix PR (schema change)	T2	Repair-E proposes, human approves interface
Forward-fix PR (auth/payments/destructive)	T3	Human drives, Repair-E assists

The tier classification at intake (Spec-E) applies at fix-time (Repair-E). The policy is unified.

Metrics that mark success¶

The weekly self-healing dashboard:

Mean-time-to-detect (MTTD) — from production incident to alert firing
Mean-time-to-escalate (MTTE) — from alert to Discord notification
Mean-time-to-diagnose (MTTDiag) — from dispatch to Repair-E diagnosis committed
Mean-time-to-fix (MTTF) — from diagnosis to canary-promoted fix
Mean-time-to-resolve (MTTR) — total MTTD to budget-restoration
False-positive rollback rate — canary aborts where no actual bug
Fix-survives-24h rate — of auto-fixed incidents, % that don't revert within 24h
Auto-resolve rate — % of incidents resolved without human intervention
Human-override rate — % of Repair-E proposals humans rejected or modified

Target values¶

For the trusted rig (end state, not today):

MTTD: < 1 minute (synthetic probe or burn-rate alert)
MTTE: < 30 seconds (flagd kill switch flipped via git-commit-to-reconcile)
MTTDiag: < 5 minutes for T1 bugs with clear trace-to-commit correlation
MTTR: < 15 minutes for T1 bugs; < 1 hour for T2 requiring human approval
Fix-survives-24h: > 80%
Auto-resolve rate: > 60% of T1 incidents

These are aggressive but consistent with published pilot data from Stripe, Cursor, and Datadog Bits AI SRE.

The honest limits¶

Zero downtime is aspirational, not absolute. Under catastrophic failure (full cluster outage, Postgres corruption), human intervention is mandatory.
T3 incidents do not self-heal. Auth bugs, payment bugs, and destructive data issues require human decision.
Novel bugs are slower. Without prior-incident patterns, Repair-E's confidence is low, and humans drive.
Reproduction harness coverage is finite. If the bug only manifests under specific load or data, the sandbox may not reproduce it, and auto-fix cannot proceed.
The feedback loop takes time. An auto-fix that "works" in canary but fails 2 days later is caught by the 24h survival metric, but during that 2 days it's not visible as a failure.

What NOT to do¶

No emergency fast path. Even when SLO is burning, every change flows through the same gated pipeline. Cloudflare Dec 2025 is the lesson.
No skipping canary for "obvious" fixes. The obvious-fix-that-breaks-everything is a documented failure class.
No LLM-judged automatic promotion. Promotion is SLO-gated by Prometheus analysis, not LLM-reviewed. Deterministic gate.
No auto-fix on T3. Never. Humans drive.
No silent rollback. Every rollback emits events, updates dashboards, opens a post-mortem issue.
No persistent staging environment that diverges from prod. Reproduction harness is ephemeral, created from recent prod state, destroyed after incident. Long-lived staging drifts.

Phase-by-phase exit criteria¶

Tied to the roadmap in index.md:

Phase 5 (self-healing) exit criteria: - [ ] Flagger canary operates on every production service - [ ] flagd feature flag sidecars injected via OpenFeature Operator - [ ] pgroll gates every DB migration; non-pgroll migrations rejected by CI - [ ] Error-budget projection live in rig-conductorwith per-service breakdown - [ ] SLO burn-rate alerts route through rig-conductorto Discord with severity routing - [ ] Repair-E dispatches on P1 alerts, logs diagnosis with attestation - [ ] Reproduction harness ephemeral-namespace pattern works for at least one service end-to-end - [ ] Kill-switch latency measured < 60s from commit to pod-observed-change - [ ] 24h fix-survival rate measured on dashboard - [ ] Documented runbook for when self-healing fails (on-call procedure)

Only when every checkbox is checked does the phase close.