Quality and Evaluation — Nightly Harness, SWE-bench Pro, Property Tests, DORA Metrics¶

TL;DR

Quality is measured, not asserted. Nightly eval harness runs SWE-bench Pro subset + internal golden suite + regression cases (budget-sensitive — see caveat below). Property-based testing (Hypothesis) on labeled or high-risk changes, not every PR — the original "every non-trivial change" policy was too expensive. DORA metrics adapted to agents. The measurements feed directly into autonomy-tier promotion, prompt-change regression gates, and model-upgrade reset policy.

"The agents are doing well" is not evidence; a dashboard line is.

Quality signals feeding tier promotion¶

graph LR
    classDef sig fill:#e3f2fd,color:#000
    classDef gate fill:#fff3e0,color:#000
    classDef out fill:#e8f5e9,color:#000

    S1[Nightly SWE-bench Pro<br/>30-task subset]:::sig
    S2[Internal golden suite<br/>10 tasks]:::sig
    S3[Regression cases<br/>per-incident]:::sig
    S4[Property-based tests<br/>per PR]:::sig
    S5[LLM-as-judge sampling<br/>T2 100%, T1 10%]:::sig
    S6[DORA metrics<br/>CFR, lead time, MTTR]:::sig

    S1 & S2 & S3 --> G1[Weekly dashboard<br/>regression gate]:::gate
    S4 --> G2[Per-PR gate]:::gate
    S5 --> G3[Disagreement flag]:::gate
    S6 --> G4[Tier promotion projection]:::gate

    G1 & G2 & G3 & G4 --> O[Autonomy tier<br/>raise / hold / demote]:::out

What quality means¶

A trusted rig's output passes five tests:

It compiles / type-checks / lints clean. Baseline; non-negotiable.
Tests pass. Unit tests, integration tests, and property-based tests.
It preserves semantic invariants. A different-model LLM-as-judge agrees the diff matches the TaskSpec intent.
It survives the canary gate. Production metrics don't regress.
It survives production for 24+ hours without rollback. The long-tail check.

Code that passes #1 + #2 but fails #3 is the "works but subtly wrong" signal. Code that passes #1-4 but fails #5 is a measurement failure — our canary or tests didn't catch something. Both are tracked.

The Stanford/NIST AI Agent Standards¶

The February 2026 Stanford/NIST AI Agent Standards consolidate four dimensions:

Dimension	Definition	Target for trusted rig
Goal accuracy	% of dispatched tasks ending in the intended outcome (merged PR without human rework)	>85% for T1, >75% for T2
Hallucination rate	% of outputs containing fabricated content (hallucinated APIs, invalid citations, nonexistent files)	<2%
Token efficiency	Cost per successful goal completion	Decreasing week-over-week, weekly
Change Failure Rate	% of merged PRs requiring rollback or hotfix within 7d	<5%

Plus two rig-specific metrics:

Metric	Definition	Target
Rework rate	% of commits added to a PR after initial draft, excluding Review-E-requested changes	<10%
Refusal accuracy	% of "unanswerable" tasks correctly escalated rather than fabricated	>95%

The evaluation harness — split cadence¶

Earlier drafts had the cost math wrong

An earlier draft proposed "nightly, ~50 tasks, $20-40/night" framed as "5-10% of direct production spend." That ratio is only true if total LLM spend is >$100k/year. For a 1-2 person rig with total annual LLM spend of ~$10-30k, $20-40 × 365 = $7.3-14.6k/year, which is 25-75% of total spend — unsustainable. Corrected to a split cadence below.

Split-cadence target setup¶

Two scheduled K8s Jobs, not one:

Nightly (lightweight) — the regression gate

Checks out rig-gitops at current main
Runs agents against the golden suite (10 tasks) + accumulated regression cases
Uploads results to Langfuse
Posts Grafana dashboard update
Fails the pipeline (emits alert) if regression > 10% on any metric

Approximate cost: ~$3-8/night × 365 = $1.1-2.9k/year. Runs fast (~30-60 min wall-clock), catches actual regressions in our own task set.

Weekly (benchmark) — the trend line

Runs agents against SWE-bench Pro 30-task subset (the contamination-resistant benchmark)
Uploads to Langfuse
Updates weekly trend in the dashboard
No CI pipeline failure — this is a trend, not a gate

Approximate cost: ~$20-40/week × 52 = $1.0-2.1k/year. Runs overnight once a week, ~8-hour wall-clock.

Total evaluation budget¶

Combined: ~$2.1-5.0k/year. Roughly 15-20% of a $10-30k small-rig LLM budget — expensive but sustainable, not the "25-75%" a nightly-everything design would cost. Budgeted explicitly in the cost framework.

The evaluation suite (same cohorts, different cadences)¶

Cohort	Size	Cadence	Purpose
Internal golden suite	10 tasks	Nightly	Catches regressions in our own task distribution
Regression cases	N (grows per incident)	Nightly	Prevents re-introducing past bugs
SWE-bench Pro subset	30 tasks	Weekly	Trend vs. general benchmark, compares to published numbers
LiveCodeBench	50-task subset	Quarterly	Contamination-resistant secondary signal

SWE-bench Verified is contaminated (late 2025+)

Top models across vendors are within 1 point of each other on Verified (Anthropic Opus 4.6: 80.8%, Anthropic Sonnet 4.6: 79.6%, Google Gemini 3.1 Pro: 80.6%, OpenAI GPT-5.2: 80.0%) — the benchmark no longer discriminates. The fact that four-vendor numbers cluster this tightly is itself evidence for the portability thesis in provider-portability.md. SWE-bench Pro drops the same models to 46–57%. We use Pro, not Verified. LiveCodeBench is contamination-resistant but measures raw model quality, not agent-scaffolding quality — we run it quarterly as a secondary signal.

Eval pipeline¶

sequenceDiagram
    participant S as Scheduled Job
    participant G as Git checkout
    participant A as Agent
    participant V as Verifier
    participant L as Langfuse
    participant GR as Grafana
    participant AL as Alertmanager

    S->>G: Pull main + agent HelmRelease
    S->>A: For each task, dispatch
    A->>A: Runs task (claims, commits, PR)
    A->>V: Report outcome
    V->>V: Run tests, lint, type-check
    V->>V: Run property tests (Hypothesis)
    V->>V: LLM-as-judge semantic check
    V->>L: Upload per-task result
    S->>GR: Update nightly dashboard
    alt regression detected
        S->>AL: Fire QualityRegressionAlert
    end

Dashboard¶

The nightly dashboard shows:

Pass rate per cohort per agent (line, 30d)
Tokens per successful task per agent (line, 30d)
Wall-clock per successful task per agent (line, 30d)
Cost per successful task per agent (line, 30d)
Regression count week-over-week (bar)
New-regression-case adds per week (bar)

Alerts: >10% regression in any cohort triggers P2 (per-issue thread); >25% triggers P1 (#admin).

Property-based testing¶

From arXiv:2510.09907 (October 2025): LLM-generated property tests find bugs beyond unit-test coverage. Originally the whitepaper proposed running this on every non-trivial agent-authored change. Honest re-evaluation: that is too expensive for our scale — one extra LLM invocation per PR plus CI runtime per PR. Property tests shine for algorithmic code with real invariants, not routine CRUD features.

Revised gating¶

A subagent runs the property-test generator only when a change is explicitly marked or matches high-risk heuristics:

PR has property-tests label (explicit author opt-in)
File touched is in an allowlist (e.g., src/core/**, projections/**, migration scripts)
Change is a fix for a production bug (regression insurance — always runs)
Change adds a new pure function (detected by AST: no mutation, no I/O)

When it does run, the subagent prompt:

Your task: read the diff, identify invariants, write 5-10 Hypothesis property tests.
Run them. Report any failures. If all pass, write them as permanent regression tests into the repo.

Hypothesis runs bounded (default 100 examples per property, 60s max).

Trivial changes skip the phase¶

Renames, import reordering, comment updates, doc-only PRs. Most T0 and T1 changes fall here.

Integration with CI¶

Hypothesis tests run in CI alongside regular tests. Failures block merge. Sanity limit: Hypothesis runs bounded (default 100 examples per property, 60s max) to prevent CI from becoming agent-generated-test-heavy.

Adoption trajectory¶

Phase 1: Property tests generated, run locally, reported but not enforced. Collect data on false-positive rate.

Phase 2: Property tests enforced for new files (not yet for legacy files). Lower-risk rollout.

Phase 3: Property tests enforced repo-wide.

DORA metrics adapted to agents¶

DORA (deployment frequency, lead time, MTTR, change-failure rate) adapts directly:

DORA metric	Agent equivalent	Measured via
Deployment frequency	PRs merged per week per agent	GitHub API
Lead time	Issue-created to PR-merged	rig-conductor event log
MTTR	Incident-detected to SLO-restored	Self-healing pipeline
Change failure rate	% of merged PRs requiring rollback in 7d	Rollback events ∩ PR list

Target ranges (Google's DORA 2025 "Elite" criteria, adapted):

Deployment frequency: multiple per day per active agent
Lead time: < 1 hour for T0, < 1 day for T1, < 1 week for T2
MTTR: < 1 hour
Change failure rate: < 15% (Elite), < 5% (our aspirational)

LLM-as-judge for semantic quality¶

LLM-as-judge is useful for one thing: detecting that agent-authored code matches the stated intent. It's not a replacement for execution-based verification.

Pattern (default: a bigger / cross-family model judges the implementer's diff — e.g., Opus 4.7 judging Sonnet 4.6 output, or GPT-5.2 judging Sonnet 4.6 output as a cross-family check; configurable per provider-portability.md):

Judge model reviews implementer's diff:

Given TaskSpec.acceptance_criteria:
- Criterion 1: ...
- Criterion 2: ...
And the diff:
---
[diff content]
---
Does the diff satisfy each acceptance criterion?
Output: JSON { criterion_1_met: bool, criterion_2_met: bool, reasoning: string, overall_confidence: float }

Applied to:

Every merged PR (sampled 10%)
Every T2 PR (100% sampled, blocking on disagreement)
Every Repair-E auto-fix (100% sampled)

Disagreements between Review-E and the judge are flagged for human review. Over time, judge-human disagreement rate is itself a metric (quality of Review-E's judgment).

Prompt regression testing¶

When an agent's prompt changes, a CI job runs:

Replay a golden suite of 20 prior tasks with the new prompt
Compare outcomes against the old-prompt baseline
Fail the PR if any golden task regresses beyond tolerance (e.g., pass → fail)

Golden suite is small enough to run in CI (~5 minutes, ~$5). Captures the "I tweaked the prompt to fix X but it broke Y" failure mode.

Braintrust's distinguishing pattern: every production trace → one-click convert to eval case. Our lighter version: a weekly script grep-searches Langfuse for traces Review-E flagged as poor-quality, suggests them as golden-suite additions, human approves.

Integration with autonomy tiers¶

Quality metrics drive autonomy promotion (trust-model.md). Concretely:

T0 → T1 promotion (for a task class): requires goal_accuracy > 85% over 20 most recent T0 runs of that class, zero rollbacks
T1 → T2 promotion: requires goal_accuracy > 85% over 20 most recent T1 runs, zero canary aborts, zero SLO-budget depletions
Demotion: any rollback attributable to agent's work on that class → immediate demotion

Measurable. Automatic. Audit-trailed.

The "subtly wrong" signal¶

Tests pass + lint passes + types check + canary analysis passes → but the code is subtly wrong. Known failure class.

Detection signals:

LLM-as-judge disagrees with Review-E — semantic check flag
Property tests find failures — bug was in an invariant not checked by unit tests
Increased bug-report rate on the affected code — signal from production monitoring
Similar-pattern rollback — incidents tagged with code location + pattern matcher

Each of these is a separate metric. Correlations between them refine the detection.

What we consciously don't measure¶

Subjective code quality scores (complexity heuristics, smell counts, architectural purity) — too noisy, game-able, doesn't correlate with production outcomes.
Absolute speed of agents — speed is only meaningful relative to task difficulty; we measure throughput instead (successful task rate per unit time).
"Agent happiness" metrics — anthropomorphizing leads to misplaced priorities.
Per-commit AST diffs — the diff-is-correct check is handled by tests + canary; adding AST analysis is operational overhead for marginal signal.

Continuous vs. scheduled¶

Continuous (per-PR): test pass, lint, type-check, property tests, LLM-as-judge on T2/T3
Scheduled (nightly): SWE-bench Pro subset, internal golden suite, regression cases
Scheduled (weekly): DORA aggregates, autonomy-tier review, model drift canary
Scheduled (quarterly): LiveCodeBench, dashboard audit, eval-case curation

Evaluation runs are attested too¶

Each eval run produces:

A signed attestation (Sigstore) binding the run to the specific agent-config-commit × model-version
A Langfuse trace with all per-task outcomes
A dashboard update

If someone argues "the promotion was unfair," the attestation + traces are replayable proof of exactly what was evaluated and what the score was.

Cost of quality — honest numbers at small-rig scale¶

Quality measurement itself costs tokens. Numbers assume default Anthropic routing at current Sonnet 4.6 / Opus 4.7 pricing; same shape applies under OpenAI or Gemini routing with shifted dollar amounts — see provider-portability.md.

At our scale (1-2 person rig, ~$10-30k/year total LLM spend):

Nightly golden-suite + regression eval: ~$3-8 × 365 = $1.1-2.9k/year
Weekly SWE-bench Pro subset (30 tasks): ~$20-40 × 52 = $1.0-2.1k/year
Quarterly LiveCodeBench subset: ~$80 × 4 = $320/year
Per-PR LLM-as-judge sampling (10% T0, 100% T2): small, ~$0.10-$1 per sample
Property-test generation (label-gated, not every PR): ~$5 per non-trivial change
Prompt regression CI: ~$5 per prompt change

Combined quality-measurement budget: ~$2.5-5.5k/year, roughly 15-20% of a small-rig total LLM budget. Significant but sustainable.

This is a floor, not a ceiling

Skimping on measurement erodes trust, which was the entire point. If total LLM spend is tight, cut the quarterly LiveCodeBench first, then the weekly SWE-bench Pro (to biweekly). Never cut the nightly regression gate — that's the cheapest check and the one preventing known-incident re-introductions.

At larger scale

Once total LLM spend crosses ~$60k/year, the same quality-measurement budget is 5-10% of spend — the ratio that earlier drafts incorrectly claimed for all scales. The absolute dollar cost scales roughly linearly with per-provider pricing; the percentage shifts with total spend. Review this budget quarterly against realized spend.

The quality dashboard (public)¶

Every human on the rig can see:

Per-agent goal accuracy (30d trend)
Per-agent cost-per-successful-task (30d trend)
Per-agent change failure rate (30d)
Rollback rate (7d, 30d)
New regression cases added this week
Open quality-regression alerts

Transparent. No hiding bad numbers.

When quality metrics disagree¶

Concrete case: LLM-as-judge disagrees with Review-E on a merged PR. Resolution:

Both opinions captured as events
Sampling review by human (weekly)
If judge right, Review-E gets a "training" attestation; if Review-E right, judge metric adjusted
Disagreements themselves are a tracked metric (ideally decreasing over time)

The meta-rule: disagreements between quality signals are data, not noise. They refine each other.