Quality and Evaluation — Nightly Harness, SWE-bench Pro, Property Tests, DORA Metrics¶
TL;DR
Quality is measured, not asserted. Nightly eval harness runs SWE-bench Pro subset + internal golden suite + regression cases (budget-sensitive — see caveat below). Property-based testing (Hypothesis) on labeled or high-risk changes, not every PR — the original "every non-trivial change" policy was too expensive. DORA metrics adapted to agents. The measurements feed directly into autonomy-tier promotion, prompt-change regression gates, and model-upgrade reset policy.
"The agents are doing well" is not evidence; a dashboard line is.
Quality signals feeding tier promotion¶
graph LR
classDef sig fill:#e3f2fd,color:#000
classDef gate fill:#fff3e0,color:#000
classDef out fill:#e8f5e9,color:#000
S1[Nightly SWE-bench Pro<br/>30-task subset]:::sig
S2[Internal golden suite<br/>10 tasks]:::sig
S3[Regression cases<br/>per-incident]:::sig
S4[Property-based tests<br/>per PR]:::sig
S5[LLM-as-judge sampling<br/>T2 100%, T1 10%]:::sig
S6[DORA metrics<br/>CFR, lead time, MTTR]:::sig
S1 & S2 & S3 --> G1[Weekly dashboard<br/>regression gate]:::gate
S4 --> G2[Per-PR gate]:::gate
S5 --> G3[Disagreement flag]:::gate
S6 --> G4[Tier promotion projection]:::gate
G1 & G2 & G3 & G4 --> O[Autonomy tier<br/>raise / hold / demote]:::out
What quality means¶
A trusted rig's output passes five tests:
- It compiles / type-checks / lints clean. Baseline; non-negotiable.
- Tests pass. Unit tests, integration tests, and property-based tests.
- It preserves semantic invariants. A different-model LLM-as-judge agrees the diff matches the
TaskSpecintent. - It survives the canary gate. Production metrics don't regress.
- It survives production for 24+ hours without rollback. The long-tail check.
Code that passes #1 + #2 but fails #3 is the "works but subtly wrong" signal. Code that passes #1-4 but fails #5 is a measurement failure — our canary or tests didn't catch something. Both are tracked.
The Stanford/NIST AI Agent Standards¶
The February 2026 Stanford/NIST AI Agent Standards consolidate four dimensions:
| Dimension | Definition | Target for trusted rig |
|---|---|---|
| Goal accuracy | % of dispatched tasks ending in the intended outcome (merged PR without human rework) | >85% for T1, >75% for T2 |
| Hallucination rate | % of outputs containing fabricated content (hallucinated APIs, invalid citations, nonexistent files) | <2% |
| Token efficiency | Cost per successful goal completion | Decreasing week-over-week, weekly |
| Change Failure Rate | % of merged PRs requiring rollback or hotfix within 7d | <5% |
Plus two rig-specific metrics:
| Metric | Definition | Target |
|---|---|---|
| Rework rate | % of commits added to a PR after initial draft, excluding Review-E-requested changes | <10% |
| Refusal accuracy | % of "unanswerable" tasks correctly escalated rather than fabricated | >95% |
The evaluation harness — split cadence¶
Earlier drafts had the cost math wrong
An earlier draft proposed "nightly, ~50 tasks, $20-40/night" framed as "5-10% of direct production spend." That ratio is only true if total LLM spend is >$100k/year. For a 1-2 person rig with total annual LLM spend of ~$10-30k, $20-40 × 365 = $7.3-14.6k/year, which is 25-75% of total spend — unsustainable. Corrected to a split cadence below.
Split-cadence target setup¶
Two scheduled K8s Jobs, not one:
Nightly (lightweight) — the regression gate
- Checks out rig-gitops at current
main - Runs agents against the golden suite (10 tasks) + accumulated regression cases
- Uploads results to Langfuse
- Posts Grafana dashboard update
- Fails the pipeline (emits alert) if regression > 10% on any metric
Approximate cost: ~$3-8/night × 365 = $1.1-2.9k/year. Runs fast (~30-60 min wall-clock), catches actual regressions in our own task set.
Weekly (benchmark) — the trend line
- Runs agents against SWE-bench Pro 30-task subset (the contamination-resistant benchmark)
- Uploads to Langfuse
- Updates weekly trend in the dashboard
- No CI pipeline failure — this is a trend, not a gate
Approximate cost: ~$20-40/week × 52 = $1.0-2.1k/year. Runs overnight once a week, ~8-hour wall-clock.
Total evaluation budget¶
Combined: ~$2.1-5.0k/year. Roughly 15-20% of a $10-30k small-rig LLM budget — expensive but sustainable, not the "25-75%" a nightly-everything design would cost. Budgeted explicitly in the cost framework.
The evaluation suite (same cohorts, different cadences)¶
| Cohort | Size | Cadence | Purpose |
|---|---|---|---|
| Internal golden suite | 10 tasks | Nightly | Catches regressions in our own task distribution |
| Regression cases | N (grows per incident) | Nightly | Prevents re-introducing past bugs |
| SWE-bench Pro subset | 30 tasks | Weekly | Trend vs. general benchmark, compares to published numbers |
| LiveCodeBench | 50-task subset | Quarterly | Contamination-resistant secondary signal |
SWE-bench Verified is contaminated (late 2025+)
Top models across vendors are within 1 point of each other on Verified (Anthropic Opus 4.6: 80.8%, Anthropic Sonnet 4.6: 79.6%, Google Gemini 3.1 Pro: 80.6%, OpenAI GPT-5.2: 80.0%) — the benchmark no longer discriminates. The fact that four-vendor numbers cluster this tightly is itself evidence for the portability thesis in provider-portability.md. SWE-bench Pro drops the same models to 46–57%. We use Pro, not Verified. LiveCodeBench is contamination-resistant but measures raw model quality, not agent-scaffolding quality — we run it quarterly as a secondary signal.
Eval pipeline¶
sequenceDiagram
participant S as Scheduled Job
participant G as Git checkout
participant A as Agent
participant V as Verifier
participant L as Langfuse
participant GR as Grafana
participant AL as Alertmanager
S->>G: Pull main + agent HelmRelease
S->>A: For each task, dispatch
A->>A: Runs task (claims, commits, PR)
A->>V: Report outcome
V->>V: Run tests, lint, type-check
V->>V: Run property tests (Hypothesis)
V->>V: LLM-as-judge semantic check
V->>L: Upload per-task result
S->>GR: Update nightly dashboard
alt regression detected
S->>AL: Fire QualityRegressionAlert
end
Dashboard¶
The nightly dashboard shows:
- Pass rate per cohort per agent (line, 30d)
- Tokens per successful task per agent (line, 30d)
- Wall-clock per successful task per agent (line, 30d)
- Cost per successful task per agent (line, 30d)
- Regression count week-over-week (bar)
- New-regression-case adds per week (bar)
Alerts: >10% regression in any cohort triggers P2 (per-issue thread); >25% triggers P1 (#admin).
Property-based testing¶
From arXiv:2510.09907 (October 2025): LLM-generated property tests find bugs beyond unit-test coverage. Originally the whitepaper proposed running this on every non-trivial agent-authored change. Honest re-evaluation: that is too expensive for our scale — one extra LLM invocation per PR plus CI runtime per PR. Property tests shine for algorithmic code with real invariants, not routine CRUD features.
Revised gating¶
A subagent runs the property-test generator only when a change is explicitly marked or matches high-risk heuristics:
- PR has
property-testslabel (explicit author opt-in) - File touched is in an allowlist (e.g.,
src/core/**,projections/**, migration scripts) - Change is a fix for a production bug (regression insurance — always runs)
- Change adds a new pure function (detected by AST: no mutation, no I/O)
When it does run, the subagent prompt:
Your task: read the diff, identify invariants, write 5-10 Hypothesis property tests.
Run them. Report any failures. If all pass, write them as permanent regression tests into the repo.
Hypothesis runs bounded (default 100 examples per property, 60s max).
Trivial changes skip the phase¶
Renames, import reordering, comment updates, doc-only PRs. Most T0 and T1 changes fall here.
Integration with CI¶
Hypothesis tests run in CI alongside regular tests. Failures block merge. Sanity limit: Hypothesis runs bounded (default 100 examples per property, 60s max) to prevent CI from becoming agent-generated-test-heavy.
Adoption trajectory¶
Phase 1: Property tests generated, run locally, reported but not enforced. Collect data on false-positive rate.
Phase 2: Property tests enforced for new files (not yet for legacy files). Lower-risk rollout.
Phase 3: Property tests enforced repo-wide.
DORA metrics adapted to agents¶
DORA (deployment frequency, lead time, MTTR, change-failure rate) adapts directly:
| DORA metric | Agent equivalent | Measured via |
|---|---|---|
| Deployment frequency | PRs merged per week per agent | GitHub API |
| Lead time | Issue-created to PR-merged | Conductor-E event log |
| MTTR | Incident-detected to SLO-restored | Self-healing pipeline |
| Change failure rate | % of merged PRs requiring rollback in 7d | Rollback events ∩ PR list |
Target ranges (Google's DORA 2025 "Elite" criteria, adapted):
- Deployment frequency: multiple per day per active agent
- Lead time: < 1 hour for T0, < 1 day for T1, < 1 week for T2
- MTTR: < 1 hour
- Change failure rate: < 15% (Elite), < 5% (our aspirational)
LLM-as-judge for semantic quality¶
LLM-as-judge is useful for one thing: detecting that agent-authored code matches the stated intent. It's not a replacement for execution-based verification.
Pattern (default: a bigger / cross-family model judges the implementer's diff — e.g., Opus 4.7 judging Sonnet 4.6 output, or GPT-5.2 judging Sonnet 4.6 output as a cross-family check; configurable per provider-portability.md):
Judge model reviews implementer's diff:
Given TaskSpec.acceptance_criteria:
- Criterion 1: ...
- Criterion 2: ...
And the diff:
---
[diff content]
---
Does the diff satisfy each acceptance criterion?
Output: JSON { criterion_1_met: bool, criterion_2_met: bool, reasoning: string, overall_confidence: float }
Applied to:
- Every merged PR (sampled 10%)
- Every T2 PR (100% sampled, blocking on disagreement)
- Every Repair-E auto-fix (100% sampled)
Disagreements between Review-E and the judge are flagged for human review. Over time, judge-human disagreement rate is itself a metric (quality of Review-E's judgment).
Prompt regression testing¶
When an agent's prompt changes, a CI job runs:
- Replay a golden suite of 20 prior tasks with the new prompt
- Compare outcomes against the old-prompt baseline
- Fail the PR if any golden task regresses beyond tolerance (e.g., pass → fail)
Golden suite is small enough to run in CI (~5 minutes, ~$5). Captures the "I tweaked the prompt to fix X but it broke Y" failure mode.
Braintrust's distinguishing pattern: every production trace → one-click convert to eval case. Our lighter version: a weekly script grep-searches Langfuse for traces Review-E flagged as poor-quality, suggests them as golden-suite additions, human approves.
Integration with autonomy tiers¶
Quality metrics drive autonomy promotion (trust-model.md). Concretely:
- T0 → T1 promotion (for a task class): requires goal_accuracy > 85% over 20 most recent T0 runs of that class, zero rollbacks
- T1 → T2 promotion: requires goal_accuracy > 85% over 20 most recent T1 runs, zero canary aborts, zero SLO-budget depletions
- Demotion: any rollback attributable to agent's work on that class → immediate demotion
Measurable. Automatic. Audit-trailed.
The "subtly wrong" signal¶
Tests pass + lint passes + types check + canary analysis passes → but the code is subtly wrong. Known failure class.
Detection signals:
- LLM-as-judge disagrees with Review-E — semantic check flag
- Property tests find failures — bug was in an invariant not checked by unit tests
- Increased bug-report rate on the affected code — signal from production monitoring
- Similar-pattern rollback — incidents tagged with code location + pattern matcher
Each of these is a separate metric. Correlations between them refine the detection.
What we consciously don't measure¶
- Subjective code quality scores (complexity heuristics, smell counts, architectural purity) — too noisy, game-able, doesn't correlate with production outcomes.
- Absolute speed of agents — speed is only meaningful relative to task difficulty; we measure throughput instead (successful task rate per unit time).
- "Agent happiness" metrics — anthropomorphizing leads to misplaced priorities.
- Per-commit AST diffs — the diff-is-correct check is handled by tests + canary; adding AST analysis is operational overhead for marginal signal.
Continuous vs. scheduled¶
- Continuous (per-PR): test pass, lint, type-check, property tests, LLM-as-judge on T2/T3
- Scheduled (nightly): SWE-bench Pro subset, internal golden suite, regression cases
- Scheduled (weekly): DORA aggregates, autonomy-tier review, model drift canary
- Scheduled (quarterly): LiveCodeBench, dashboard audit, eval-case curation
Evaluation runs are attested too¶
Each eval run produces:
- A signed attestation (Sigstore) binding the run to the specific agent-config-commit × model-version
- A Langfuse trace with all per-task outcomes
- A dashboard update
If someone argues "the promotion was unfair," the attestation + traces are replayable proof of exactly what was evaluated and what the score was.
Cost of quality — honest numbers at small-rig scale¶
Quality measurement itself costs tokens. Numbers assume default Anthropic routing at current Sonnet 4.6 / Opus 4.7 pricing; same shape applies under OpenAI or Gemini routing with shifted dollar amounts — see provider-portability.md.
At our scale (1-2 person rig, ~$10-30k/year total LLM spend):
- Nightly golden-suite + regression eval: ~$3-8 × 365 = $1.1-2.9k/year
- Weekly SWE-bench Pro subset (30 tasks): ~$20-40 × 52 = $1.0-2.1k/year
- Quarterly LiveCodeBench subset: ~$80 × 4 = $320/year
- Per-PR LLM-as-judge sampling (10% T0, 100% T2): small, ~$0.10-$1 per sample
- Property-test generation (label-gated, not every PR): ~$5 per non-trivial change
- Prompt regression CI: ~$5 per prompt change
Combined quality-measurement budget: ~$2.5-5.5k/year, roughly 15-20% of a small-rig total LLM budget. Significant but sustainable.
This is a floor, not a ceiling
Skimping on measurement erodes trust, which was the entire point. If total LLM spend is tight, cut the quarterly LiveCodeBench first, then the weekly SWE-bench Pro (to biweekly). Never cut the nightly regression gate — that's the cheapest check and the one preventing known-incident re-introductions.
At larger scale
Once total LLM spend crosses ~$60k/year, the same quality-measurement budget is 5-10% of spend — the ratio that earlier drafts incorrectly claimed for all scales. The absolute dollar cost scales roughly linearly with per-provider pricing; the percentage shifts with total spend. Review this budget quarterly against realized spend.
The quality dashboard (public)¶
Every human on the rig can see:
- Per-agent goal accuracy (30d trend)
- Per-agent cost-per-successful-task (30d trend)
- Per-agent change failure rate (30d)
- Rollback rate (7d, 30d)
- New regression cases added this week
- Open quality-regression alerts
Transparent. No hiding bad numbers.
When quality metrics disagree¶
Concrete case: LLM-as-judge disagrees with Review-E on a merged PR. Resolution:
- Both opinions captured as events
- Sampling review by human (weekly)
- If judge right, Review-E gets a "training" attestation; if Review-E right, judge metric adjusted
- Disagreements themselves are a tracked metric (ideally decreasing over time)
The meta-rule: disagreements between quality signals are data, not noise. They refine each other.
See also¶
- index.md
- principles.md — principle 1 (measurable) enforced here
- trust-model.md — how quality scores drive tier promotion
- observability.md — where quality metrics live
- drift-detection.md — how quality tracks model/prompt drift
- cost-framework.md — the cost of quality measurement itself