Skip to content

Quality and Evaluation — Nightly Harness, SWE-bench Pro, Property Tests, DORA Metrics

TL;DR

Quality is measured, not asserted. Nightly eval harness runs SWE-bench Pro subset + internal golden suite + regression cases (budget-sensitive — see caveat below). Property-based testing (Hypothesis) on labeled or high-risk changes, not every PR — the original "every non-trivial change" policy was too expensive. DORA metrics adapted to agents. The measurements feed directly into autonomy-tier promotion, prompt-change regression gates, and model-upgrade reset policy.

"The agents are doing well" is not evidence; a dashboard line is.

Quality signals feeding tier promotion

graph LR
    classDef sig fill:#e3f2fd,color:#000
    classDef gate fill:#fff3e0,color:#000
    classDef out fill:#e8f5e9,color:#000

    S1[Nightly SWE-bench Pro<br/>30-task subset]:::sig
    S2[Internal golden suite<br/>10 tasks]:::sig
    S3[Regression cases<br/>per-incident]:::sig
    S4[Property-based tests<br/>per PR]:::sig
    S5[LLM-as-judge sampling<br/>T2 100%, T1 10%]:::sig
    S6[DORA metrics<br/>CFR, lead time, MTTR]:::sig

    S1 & S2 & S3 --> G1[Weekly dashboard<br/>regression gate]:::gate
    S4 --> G2[Per-PR gate]:::gate
    S5 --> G3[Disagreement flag]:::gate
    S6 --> G4[Tier promotion projection]:::gate

    G1 & G2 & G3 & G4 --> O[Autonomy tier<br/>raise / hold / demote]:::out

What quality means

A trusted rig's output passes five tests:

  1. It compiles / type-checks / lints clean. Baseline; non-negotiable.
  2. Tests pass. Unit tests, integration tests, and property-based tests.
  3. It preserves semantic invariants. A different-model LLM-as-judge agrees the diff matches the TaskSpec intent.
  4. It survives the canary gate. Production metrics don't regress.
  5. It survives production for 24+ hours without rollback. The long-tail check.

Code that passes #1 + #2 but fails #3 is the "works but subtly wrong" signal. Code that passes #1-4 but fails #5 is a measurement failure — our canary or tests didn't catch something. Both are tracked.

The Stanford/NIST AI Agent Standards

The February 2026 Stanford/NIST AI Agent Standards consolidate four dimensions:

Dimension Definition Target for trusted rig
Goal accuracy % of dispatched tasks ending in the intended outcome (merged PR without human rework) >85% for T1, >75% for T2
Hallucination rate % of outputs containing fabricated content (hallucinated APIs, invalid citations, nonexistent files) <2%
Token efficiency Cost per successful goal completion Decreasing week-over-week, weekly
Change Failure Rate % of merged PRs requiring rollback or hotfix within 7d <5%

Plus two rig-specific metrics:

Metric Definition Target
Rework rate % of commits added to a PR after initial draft, excluding Review-E-requested changes <10%
Refusal accuracy % of "unanswerable" tasks correctly escalated rather than fabricated >95%

The evaluation harness — split cadence

Earlier drafts had the cost math wrong

An earlier draft proposed "nightly, ~50 tasks, $20-40/night" framed as "5-10% of direct production spend." That ratio is only true if total LLM spend is >$100k/year. For a 1-2 person rig with total annual LLM spend of ~$10-30k, $20-40 × 365 = $7.3-14.6k/year, which is 25-75% of total spend — unsustainable. Corrected to a split cadence below.

Split-cadence target setup

Two scheduled K8s Jobs, not one:

Nightly (lightweight) — the regression gate

  1. Checks out rig-gitops at current main
  2. Runs agents against the golden suite (10 tasks) + accumulated regression cases
  3. Uploads results to Langfuse
  4. Posts Grafana dashboard update
  5. Fails the pipeline (emits alert) if regression > 10% on any metric

Approximate cost: ~$3-8/night × 365 = $1.1-2.9k/year. Runs fast (~30-60 min wall-clock), catches actual regressions in our own task set.

Weekly (benchmark) — the trend line

  1. Runs agents against SWE-bench Pro 30-task subset (the contamination-resistant benchmark)
  2. Uploads to Langfuse
  3. Updates weekly trend in the dashboard
  4. No CI pipeline failure — this is a trend, not a gate

Approximate cost: ~$20-40/week × 52 = $1.0-2.1k/year. Runs overnight once a week, ~8-hour wall-clock.

Total evaluation budget

Combined: ~$2.1-5.0k/year. Roughly 15-20% of a $10-30k small-rig LLM budget — expensive but sustainable, not the "25-75%" a nightly-everything design would cost. Budgeted explicitly in the cost framework.

The evaluation suite (same cohorts, different cadences)

Cohort Size Cadence Purpose
Internal golden suite 10 tasks Nightly Catches regressions in our own task distribution
Regression cases N (grows per incident) Nightly Prevents re-introducing past bugs
SWE-bench Pro subset 30 tasks Weekly Trend vs. general benchmark, compares to published numbers
LiveCodeBench 50-task subset Quarterly Contamination-resistant secondary signal

SWE-bench Verified is contaminated (late 2025+)

Top models across vendors are within 1 point of each other on Verified (Anthropic Opus 4.6: 80.8%, Anthropic Sonnet 4.6: 79.6%, Google Gemini 3.1 Pro: 80.6%, OpenAI GPT-5.2: 80.0%) — the benchmark no longer discriminates. The fact that four-vendor numbers cluster this tightly is itself evidence for the portability thesis in provider-portability.md. SWE-bench Pro drops the same models to 46–57%. We use Pro, not Verified. LiveCodeBench is contamination-resistant but measures raw model quality, not agent-scaffolding quality — we run it quarterly as a secondary signal.

Eval pipeline

sequenceDiagram
    participant S as Scheduled Job
    participant G as Git checkout
    participant A as Agent
    participant V as Verifier
    participant L as Langfuse
    participant GR as Grafana
    participant AL as Alertmanager

    S->>G: Pull main + agent HelmRelease
    S->>A: For each task, dispatch
    A->>A: Runs task (claims, commits, PR)
    A->>V: Report outcome
    V->>V: Run tests, lint, type-check
    V->>V: Run property tests (Hypothesis)
    V->>V: LLM-as-judge semantic check
    V->>L: Upload per-task result
    S->>GR: Update nightly dashboard
    alt regression detected
        S->>AL: Fire QualityRegressionAlert
    end

Dashboard

The nightly dashboard shows:

  • Pass rate per cohort per agent (line, 30d)
  • Tokens per successful task per agent (line, 30d)
  • Wall-clock per successful task per agent (line, 30d)
  • Cost per successful task per agent (line, 30d)
  • Regression count week-over-week (bar)
  • New-regression-case adds per week (bar)

Alerts: >10% regression in any cohort triggers P2 (per-issue thread); >25% triggers P1 (#admin).

Property-based testing

From arXiv:2510.09907 (October 2025): LLM-generated property tests find bugs beyond unit-test coverage. Originally the whitepaper proposed running this on every non-trivial agent-authored change. Honest re-evaluation: that is too expensive for our scale — one extra LLM invocation per PR plus CI runtime per PR. Property tests shine for algorithmic code with real invariants, not routine CRUD features.

Revised gating

A subagent runs the property-test generator only when a change is explicitly marked or matches high-risk heuristics:

  • PR has property-tests label (explicit author opt-in)
  • File touched is in an allowlist (e.g., src/core/**, projections/**, migration scripts)
  • Change is a fix for a production bug (regression insurance — always runs)
  • Change adds a new pure function (detected by AST: no mutation, no I/O)

When it does run, the subagent prompt:

Your task: read the diff, identify invariants, write 5-10 Hypothesis property tests.
Run them. Report any failures. If all pass, write them as permanent regression tests into the repo.

Hypothesis runs bounded (default 100 examples per property, 60s max).

Trivial changes skip the phase

Renames, import reordering, comment updates, doc-only PRs. Most T0 and T1 changes fall here.

Integration with CI

Hypothesis tests run in CI alongside regular tests. Failures block merge. Sanity limit: Hypothesis runs bounded (default 100 examples per property, 60s max) to prevent CI from becoming agent-generated-test-heavy.

Adoption trajectory

Phase 1: Property tests generated, run locally, reported but not enforced. Collect data on false-positive rate.

Phase 2: Property tests enforced for new files (not yet for legacy files). Lower-risk rollout.

Phase 3: Property tests enforced repo-wide.

DORA metrics adapted to agents

DORA (deployment frequency, lead time, MTTR, change-failure rate) adapts directly:

DORA metric Agent equivalent Measured via
Deployment frequency PRs merged per week per agent GitHub API
Lead time Issue-created to PR-merged Conductor-E event log
MTTR Incident-detected to SLO-restored Self-healing pipeline
Change failure rate % of merged PRs requiring rollback in 7d Rollback events ∩ PR list

Target ranges (Google's DORA 2025 "Elite" criteria, adapted):

  • Deployment frequency: multiple per day per active agent
  • Lead time: < 1 hour for T0, < 1 day for T1, < 1 week for T2
  • MTTR: < 1 hour
  • Change failure rate: < 15% (Elite), < 5% (our aspirational)

LLM-as-judge for semantic quality

LLM-as-judge is useful for one thing: detecting that agent-authored code matches the stated intent. It's not a replacement for execution-based verification.

Pattern (default: a bigger / cross-family model judges the implementer's diff — e.g., Opus 4.7 judging Sonnet 4.6 output, or GPT-5.2 judging Sonnet 4.6 output as a cross-family check; configurable per provider-portability.md):

Judge model reviews implementer's diff:

Given TaskSpec.acceptance_criteria:
- Criterion 1: ...
- Criterion 2: ...
And the diff:
---
[diff content]
---
Does the diff satisfy each acceptance criterion?
Output: JSON { criterion_1_met: bool, criterion_2_met: bool, reasoning: string, overall_confidence: float }

Applied to:

  • Every merged PR (sampled 10%)
  • Every T2 PR (100% sampled, blocking on disagreement)
  • Every Repair-E auto-fix (100% sampled)

Disagreements between Review-E and the judge are flagged for human review. Over time, judge-human disagreement rate is itself a metric (quality of Review-E's judgment).

Prompt regression testing

When an agent's prompt changes, a CI job runs:

  1. Replay a golden suite of 20 prior tasks with the new prompt
  2. Compare outcomes against the old-prompt baseline
  3. Fail the PR if any golden task regresses beyond tolerance (e.g., pass → fail)

Golden suite is small enough to run in CI (~5 minutes, ~$5). Captures the "I tweaked the prompt to fix X but it broke Y" failure mode.

Braintrust's distinguishing pattern: every production trace → one-click convert to eval case. Our lighter version: a weekly script grep-searches Langfuse for traces Review-E flagged as poor-quality, suggests them as golden-suite additions, human approves.

Integration with autonomy tiers

Quality metrics drive autonomy promotion (trust-model.md). Concretely:

  • T0 → T1 promotion (for a task class): requires goal_accuracy > 85% over 20 most recent T0 runs of that class, zero rollbacks
  • T1 → T2 promotion: requires goal_accuracy > 85% over 20 most recent T1 runs, zero canary aborts, zero SLO-budget depletions
  • Demotion: any rollback attributable to agent's work on that class → immediate demotion

Measurable. Automatic. Audit-trailed.

The "subtly wrong" signal

Tests pass + lint passes + types check + canary analysis passes → but the code is subtly wrong. Known failure class.

Detection signals:

  • LLM-as-judge disagrees with Review-E — semantic check flag
  • Property tests find failures — bug was in an invariant not checked by unit tests
  • Increased bug-report rate on the affected code — signal from production monitoring
  • Similar-pattern rollback — incidents tagged with code location + pattern matcher

Each of these is a separate metric. Correlations between them refine the detection.

What we consciously don't measure

  • Subjective code quality scores (complexity heuristics, smell counts, architectural purity) — too noisy, game-able, doesn't correlate with production outcomes.
  • Absolute speed of agents — speed is only meaningful relative to task difficulty; we measure throughput instead (successful task rate per unit time).
  • "Agent happiness" metrics — anthropomorphizing leads to misplaced priorities.
  • Per-commit AST diffs — the diff-is-correct check is handled by tests + canary; adding AST analysis is operational overhead for marginal signal.

Continuous vs. scheduled

  • Continuous (per-PR): test pass, lint, type-check, property tests, LLM-as-judge on T2/T3
  • Scheduled (nightly): SWE-bench Pro subset, internal golden suite, regression cases
  • Scheduled (weekly): DORA aggregates, autonomy-tier review, model drift canary
  • Scheduled (quarterly): LiveCodeBench, dashboard audit, eval-case curation

Evaluation runs are attested too

Each eval run produces:

  • A signed attestation (Sigstore) binding the run to the specific agent-config-commit × model-version
  • A Langfuse trace with all per-task outcomes
  • A dashboard update

If someone argues "the promotion was unfair," the attestation + traces are replayable proof of exactly what was evaluated and what the score was.

Cost of quality — honest numbers at small-rig scale

Quality measurement itself costs tokens. Numbers assume default Anthropic routing at current Sonnet 4.6 / Opus 4.7 pricing; same shape applies under OpenAI or Gemini routing with shifted dollar amounts — see provider-portability.md.

At our scale (1-2 person rig, ~$10-30k/year total LLM spend):

  • Nightly golden-suite + regression eval: ~$3-8 × 365 = $1.1-2.9k/year
  • Weekly SWE-bench Pro subset (30 tasks): ~$20-40 × 52 = $1.0-2.1k/year
  • Quarterly LiveCodeBench subset: ~$80 × 4 = $320/year
  • Per-PR LLM-as-judge sampling (10% T0, 100% T2): small, ~$0.10-$1 per sample
  • Property-test generation (label-gated, not every PR): ~$5 per non-trivial change
  • Prompt regression CI: ~$5 per prompt change

Combined quality-measurement budget: ~$2.5-5.5k/year, roughly 15-20% of a small-rig total LLM budget. Significant but sustainable.

This is a floor, not a ceiling

Skimping on measurement erodes trust, which was the entire point. If total LLM spend is tight, cut the quarterly LiveCodeBench first, then the weekly SWE-bench Pro (to biweekly). Never cut the nightly regression gate — that's the cheapest check and the one preventing known-incident re-introductions.

At larger scale

Once total LLM spend crosses ~$60k/year, the same quality-measurement budget is 5-10% of spend — the ratio that earlier drafts incorrectly claimed for all scales. The absolute dollar cost scales roughly linearly with per-provider pricing; the percentage shifts with total spend. Review this budget quarterly against realized spend.

The quality dashboard (public)

Every human on the rig can see:

  • Per-agent goal accuracy (30d trend)
  • Per-agent cost-per-successful-task (30d trend)
  • Per-agent change failure rate (30d)
  • Rollback rate (7d, 30d)
  • New regression cases added this week
  • Open quality-regression alerts

Transparent. No hiding bad numbers.

When quality metrics disagree

Concrete case: LLM-as-judge disagrees with Review-E on a merged PR. Resolution:

  1. Both opinions captured as events
  2. Sampling review by human (weekly)
  3. If judge right, Review-E gets a "training" attestation; if Review-E right, judge metric adjusted
  4. Disagreements themselves are a tracked metric (ideally decreasing over time)

The meta-rule: disagreements between quality signals are data, not noise. They refine each other.

See also