rc#947 — SelfImprovementService design proposal¶
Issue: rig-conductor#947
TL;DR¶
Add a SelfImprovementService to rig-conductor that watches the rate, frequency, and distribution of signals existing detectors are already emitting, and files gap-analysis issues when a recurring intervention has crossed from "incident" to "systemic gap." Observation + intake only — no auto-fixes, no dispatches.
Why now¶
During the 2026-05-15→16 dashe-website compliance audit, the orchestrator manually filed 7 follow-up issues (rar#420/#421/#423, rc#942/#943/#944/#946) describing rig gaps. Each was directly derivable from event-store signals the conductor was already recording — there is no service that aggregates these into actionable patterns and files them. This service closes that loop.
What changed from the rc#947 issue body¶
The issue body proposes 7 standalone watchers (EscalatedFlagSticky, ProviderQuotaStuck, SiblingMergeConflict, etc.) implemented as independent projection-pattern matches. This proposal re-frames them as META queries over what existing detectors already record, because:
SloEnforcerServicealready manages SLO state + theEscalatedLivederived propertyStuckIssueDetectorServicealready detects per-issue stuck cyclesStuckWatcherServicealready aggregates stuck fingerprintsMainGuardServicealready handles main-branch CI failures + the review-e self-PR class via existing guardsAdminBypassAuditServicealready records every admin-merge
Standalone watchers would duplicate this detection work. The right level is one layer up: watch how often each detector trips and where.
Goal¶
When a recurring intervention pattern crosses a configured threshold, file a gap-analysis issue with linked occurrences, a draft fix proposal, and the
self-improvementlabel — so the orchestrator's pattern-noticing work is automated.
Non-goals (from the issue body, preserved verbatim):
- NOT auto-merging rig-internal PRs (still orchestrator-only)
- NOT auto-coding fixes (HARD RULE preserved — rig agents don't work on the rig)
- NOT dispatching dev-e on rig-internal repos (operator-only via admin-merge)
Architecture¶
Files¶
src/ConductorE.Core/SelfImprovement/
├── ISelfImprovementWatcher.cs # interface — Evaluate(IServiceProvider, CancellationToken) → IReadOnlyList<Occurrence>
├── SignatureOccurrence.cs # value record passed back to the service
└── Watchers/ # one file per signature
├── EscalatedFlagStickyWatcher.cs
├── ProviderQuotaSaturationWatcher.cs
├── SiblingMergeConflictWatcher.cs
├── DevEStaleDismissalWatcher.cs
├── ReviewESpuriousPrWatcher.cs
├── PlannerNoLargePrOkWatcher.cs
└── AdminBypassRateWatcher.cs
src/ConductorE.Api/Services/
└── SelfImprovementService.cs # BackgroundService — periodic scan, dedup, file issues
src/ConductorE.Core/Domain/
└── ReadModels.cs # + SelfImprovementSignatureState record (Marten doc)
tests/ConductorE.Core.Tests/SelfImprovement/
├── EscalatedFlagStickyWatcherTests.cs
├── ProviderQuotaSaturationWatcherTests.cs
├── ... (one per watcher)
└── SelfImprovementServiceFlowTests.cs
docs/
└── self-improvement-service.md # operator runbook — "how to add a new watcher"
Watcher contract¶
public interface ISelfImprovementWatcher
{
/// <summary>Stable signature name — used as Marten document id and issue title prefix.</summary>
string SignatureName { get; }
/// <summary>Default threshold + window — overridable via config.</summary>
SignatureThreshold DefaultThreshold { get; }
/// <summary>
/// Examine the current state and return any occurrences observed in this scan.
/// Service handles dedup, threshold-check, issue-filing — watcher just observes.
/// </summary>
Task<IReadOnlyList<SignatureOccurrence>> EvaluateAsync(IServiceProvider scope, CancellationToken ct);
}
public record SignatureOccurrence(
string Repo,
int? IssueNumber,
int? PrNumber,
string ContextSummary, // 1-line "why this tripped"
DateTimeOffset ObservedAt,
string Permalink // GitHub permalink to the evidence
);
public record SignatureThreshold(int Count, TimeSpan Window);
Watchers are pure observers — they query Marten + maybe the event store and return occurrences. The service orchestrates dedup, persistence, threshold checking, and issue filing. This keeps watchers trivial to unit-test.
Service flow¶
- Scan interval: 15 minutes (matches
ReconciliationServicecadence) - Per watcher:
var occurrences = await watcher.EvaluateAsync(scope, ct);- Load
SelfImprovementSignatureStatedoc keyed bySignatureName - Append occurrences, prune those outside the watcher's window
- If
count ≥ threshold AND no open gap-analysis issue exists:- Mint App token via
IGitHubAppTokenProvider - Open issue in the watcher's target repo (default:
dashecorp/rig-conductor) - Save the open-issue-number on the state doc
- Mint App token via
- If
count ≥ threshold AND open gap-analysis issue exists:- Add a comment with new occurrences (no re-file)
- Persist state via Marten session.
Dedup invariants¶
- One
SignatureStatedoc perSignatureName. Idempotent on restart. - One open gap-analysis issue per signature at a time. If closed, the next threshold crossing files a new one.
- Occurrences are pruned by window — long-lived state docs stay small.
- Non-recurrence tracking (forward-compat for auto-close): the state doc records
LastOccurrenceAtindependent of issue lifecycle, so a future auto-close PR can observe "≥7d since last occurrence while issue is open" without a schema change. PR-1 reads this field only for the dashboard; operator handles close.
Initial 8 watchers¶
Each one is re-framed as a query over what already exists:
| Watcher | Signal source | Default threshold |
|---|---|---|
EscalatedFlagStickyWatcher |
IssueStatus rows where SloEscalated=true AND !EscalatedLive AND LastUpdated > escalationTime + 1h. Note: the raw flag is by-design sticky — what we're detecting is downstream consumers that ignore the live invariant. |
3 per 30d |
ProviderQuotaSaturationWatcher |
AgentStatus.QuotaFiveHourPct ≥ 95 AND IssueStatus.state=failed AND last failure event contains "usage_limit_reached" |
3 per 7d |
SiblingMergeConflictWatcher |
Live GitHub query: PR.mergeable=CONFLICTING AND a sibling PR in same repo merged within 30m touching ≥1 overlapping file |
3 per 30d |
DevEStaleDismissalWatcher |
Review.state=CHANGES_REQUESTED events where commit_id is an ancestor of HEAD AND no dismissal event since the HEAD push |
3 per 30d |
ReviewESpuriousPrWatcher |
PullRequestOpened events where author=review-e-bot AND the PR closes an issue currently assigned to dev-e-*. Should already be impossible after rc#946 — watcher persists as a regression detector. Regression canary: this watcher should never trip; if it does, rc#946's server-side reject regressed and the on-call response is "investigate rc#946 immediately." |
1 per 30d (zero-tolerance) |
PlannerNoLargePrOkWatcher |
IssueCreated events with body matching /every (page|file)|site[- ]?wide|across all/i AND no large-pr-ok label |
3 per 30d |
AdminBypassRateWatcher |
OperatorOverrideRecord rows of Kind=AdminMerge in the last 30d, grouped by repo. Caveat: admin-bypass is the operator escape hatch by design — raw count alone is noisy. Watcher groups by OperatorOverrideRecord.Reason and only files when one reason dominates (≥60% of the 5 bypasses share a reason category, e.g. "CI flake" or "review-e bottleneck"); the filed issue then proposes the specific fix for that dominant reason (e.g. "stabilize CI X" or "tune review-e routing for repo Y"). If no reason dominates, the watcher stays quiet — high admin-merge rate without a single dominant cause is plausibly correct usage. |
5 per 30d per repo |
StreamConsumerWithoutHeartbeatWatcher |
XINFO CONSUMERS assignments:{agentId} shows a consumer with recent activity (idle < 5min) AND /api/agents reports the corresponding agent as liveness=offline for > 1h. Detects the rc#959 codex-stream-black-hole pattern (consumer alive enough to claim stream messages but provider quota-exhausted and not heartbeating, leading to silent XACK without delivery). |
1 per 7d (any occurrence is load-bearing) |
Watchers can be added incrementally — the service registers them via DI scanning, so a new watcher = one file + one DI line.
GitHub App write path¶
Reuses IGitHubAppTokenProvider exactly as ImagePinDispatchService does:
var token = await _tokenProvider.GetInstallationTokenAsync(ct);
if (token is null) { /* log + skip this tick */ return; }
var client = _httpFactory.CreateClient("github");
client.DefaultRequestHeaders.Authorization = new("Bearer", token);
var body = new {
title = $"[self-improvement] {watcher.SignatureName}: {occurrences.Count} occurrences in {threshold.Window.TotalDays}d",
body = renderedIssueBody,
labels = new[] { "self-improvement" }
};
await client.PostAsJsonAsync($"https://api.github.com/repos/{targetRepo}/issues", body, ct);
App identity: dev-e-bot (the existing token producer for ImagePinDispatchService). No new App needed.
Operator triage consequence: because we reuse dev-e-bot, GitHub's UI will show the same author for both image-pin chore PRs and self-improvement issues. Discriminate by label:self-improvement (issues) vs label:image-pin (PRs) when triaging — author alone is ambiguous.
Dashboard surface¶
- New stat: "Self-improvement 🔁" — count of tripped signatures (i.e. signatures with ≥threshold occurrences in window), click filters Issues tab to
label:self-improvement is:open. - New tab: "Self-Improvement" — table with one row per signature: name, current count, threshold, window, open issue link.
- Endpoint:
GET /api/self-improvement/signatures→ array of{ name, count, threshold, window, openIssue, lastOccurrence }.
Testing¶
- Unit: one test file per watcher, mocked Marten + event store fixtures. Coverage: trips at threshold, doesn't trip below, dedups across runs.
- Integration:
SelfImprovementServiceFlowTests.cs— full pipeline (in-memory Marten, mocked GitHub API) verifying issue-file + comment paths + state doc persistence.
Implementation plan¶
Single PR. Order within the PR:
ISelfImprovementWatcher.cs+ records- 8 watcher classes (per-watcher tests as I write each)
SelfImprovementService.cs(BackgroundService, flow logic, integration test)- DI registration in
Program.cs /api/self-improvement/signaturesendpoint- Dashboard widget + tab
docs/self-improvement-service.mdrunbook
Estimated size: ~700–900 LOC including tests. Estimated time: half a day.
CI: standard. Admin-merge after Copilot review (orchestrator-authored — review-e doesn't review user-authored PRs per the recurring halt class).
Out of scope¶
- Drafting fix code (orchestrator-only, by HARD RULE)
- Dispatching dev-e (orchestrator-only)
- Auto-resolving the existing 4 stuck-pattern issues on dashe-website — those need separate triage
- Cross-repo gap analysis (e.g. "rar + rgo trip the same signature together") — possible follow-up
Open questions — resolved in review¶
- Issue body author identity — Resolved (review): keep
dev-e-bot, label-discriminate. Operator triage useslabel:self-improvementvslabel:image-pin. Documented above in the GitHub App write path section. - Threshold tuning — Resolved (review): hardcoded for PR-1. TODO (follow-up): file an issue to make thresholds configurable via env vars once we have ≥1 month of trip-rate data and at least one watcher needs re-tuning. Cross-link from that follow-up issue back to this proposal.
- Auto-close on convergence — Resolved (review): operator-only for PR-1. State doc still tracks
LastOccurrenceAt(see Dedup invariants) so a future auto-close PR is a small change rather than a schema migration.
Decision requested¶
- Approve META framing? (or revert to literal rc#947 body)
- Approve dev-e-bot App identity reuse? (leaning yes per review; see GitHub App write path)
- Approve hardcoded thresholds for PR-1? (leaning yes per review; follow-up TODO captured)
- Greenlight to implement?