rc#947 — SelfImprovementService design proposal¶

TL;DR¶

Add a SelfImprovementService to rig-conductor that watches the rate, frequency, and distribution of signals existing detectors are already emitting, and files gap-analysis issues when a recurring intervention has crossed from "incident" to "systemic gap." Observation + intake only — no auto-fixes, no dispatches.

Why now¶

During the 2026-05-15→16 dashe-website compliance audit, the orchestrator manually filed 7 follow-up issues (rar#420/#421/#423, rc#942/#943/#944/#946) describing rig gaps. Each was directly derivable from event-store signals the conductor was already recording — there is no service that aggregates these into actionable patterns and files them. This service closes that loop.

What changed from the rc#947 issue body¶

The issue body proposes 7 standalone watchers (EscalatedFlagSticky, ProviderQuotaStuck, SiblingMergeConflict, etc.) implemented as independent projection-pattern matches. This proposal re-frames them as META queries over what existing detectors already record, because:

SloEnforcerService already manages SLO state + the EscalatedLive derived property
StuckIssueDetectorService already detects per-issue stuck cycles
StuckWatcherService already aggregates stuck fingerprints
MainGuardService already handles main-branch CI failures + the review-e self-PR class via existing guards
AdminBypassAuditService already records every admin-merge

Standalone watchers would duplicate this detection work. The right level is one layer up: watch how often each detector trips and where.

Goal¶

When a recurring intervention pattern crosses a configured threshold, file a gap-analysis issue with linked occurrences, a draft fix proposal, and the self-improvement label — so the orchestrator's pattern-noticing work is automated.

Non-goals (from the issue body, preserved verbatim):

NOT auto-merging rig-internal PRs (still orchestrator-only)
NOT auto-coding fixes (HARD RULE preserved — rig agents don't work on the rig)
NOT dispatching dev-e on rig-internal repos (operator-only via admin-merge)

Architecture¶

Files¶

src/ConductorE.Core/SelfImprovement/
├── ISelfImprovementWatcher.cs        # interface — Evaluate(IServiceProvider, CancellationToken) → IReadOnlyList<Occurrence>
├── SignatureOccurrence.cs            # value record passed back to the service
└── Watchers/                         # one file per signature
    ├── EscalatedFlagStickyWatcher.cs
    ├── ProviderQuotaSaturationWatcher.cs
    ├── SiblingMergeConflictWatcher.cs
    ├── DevEStaleDismissalWatcher.cs
    ├── ReviewESpuriousPrWatcher.cs
    ├── PlannerNoLargePrOkWatcher.cs
    └── AdminBypassRateWatcher.cs

src/ConductorE.Api/Services/
└── SelfImprovementService.cs         # BackgroundService — periodic scan, dedup, file issues

src/ConductorE.Core/Domain/
└── ReadModels.cs                     # + SelfImprovementSignatureState record (Marten doc)

tests/ConductorE.Core.Tests/SelfImprovement/
├── EscalatedFlagStickyWatcherTests.cs
├── ProviderQuotaSaturationWatcherTests.cs
├── ... (one per watcher)
└── SelfImprovementServiceFlowTests.cs

docs/
└── self-improvement-service.md       # operator runbook — "how to add a new watcher"

Watcher contract¶

public interface ISelfImprovementWatcher
{
    /// <summary>Stable signature name — used as Marten document id and issue title prefix.</summary>
    string SignatureName { get; }

    /// <summary>Default threshold + window — overridable via config.</summary>
    SignatureThreshold DefaultThreshold { get; }

    /// <summary>
    /// Examine the current state and return any occurrences observed in this scan.
    /// Service handles dedup, threshold-check, issue-filing — watcher just observes.
    /// </summary>
    Task<IReadOnlyList<SignatureOccurrence>> EvaluateAsync(IServiceProvider scope, CancellationToken ct);
}

public record SignatureOccurrence(
    string Repo,
    int? IssueNumber,
    int? PrNumber,
    string ContextSummary,            // 1-line "why this tripped"
    DateTimeOffset ObservedAt,
    string Permalink                  // GitHub permalink to the evidence
);

public record SignatureThreshold(int Count, TimeSpan Window);

Watchers are pure observers — they query Marten + maybe the event store and return occurrences. The service orchestrates dedup, persistence, threshold checking, and issue filing. This keeps watchers trivial to unit-test.

Service flow¶

Scan interval: 15 minutes (matches ReconciliationService cadence)
Per watcher:
var occurrences = await watcher.EvaluateAsync(scope, ct);
Load SelfImprovementSignatureState doc keyed by SignatureName
Append occurrences, prune those outside the watcher's window
If count ≥ threshold AND no open gap-analysis issue exists:
- Mint App token via IGitHubAppTokenProvider
- Open issue in the watcher's target repo (default: dashecorp/rig-conductor)
- Save the open-issue-number on the state doc
If count ≥ threshold AND open gap-analysis issue exists:
- Add a comment with new occurrences (no re-file)
Persist state via Marten session.

Dedup invariants¶

One SignatureState doc per SignatureName. Idempotent on restart.
One open gap-analysis issue per signature at a time. If closed, the next threshold crossing files a new one.
Occurrences are pruned by window — long-lived state docs stay small.
Non-recurrence tracking (forward-compat for auto-close): the state doc records LastOccurrenceAt independent of issue lifecycle, so a future auto-close PR can observe "≥7d since last occurrence while issue is open" without a schema change. PR-1 reads this field only for the dashboard; operator handles close.

Initial 8 watchers¶

Each one is re-framed as a query over what already exists:

Watcher	Signal source	Default threshold
`EscalatedFlagStickyWatcher`	`IssueStatus` rows where `SloEscalated=true AND !EscalatedLive AND LastUpdated > escalationTime + 1h`. Note: the raw flag is by-design sticky — what we're detecting is downstream consumers that ignore the live invariant.	3 per 30d
`ProviderQuotaSaturationWatcher`	`AgentStatus.QuotaFiveHourPct ≥ 95 AND IssueStatus.state=failed AND last failure event contains "usage_limit_reached"`	3 per 7d
`SiblingMergeConflictWatcher`	Live GitHub query: `PR.mergeable=CONFLICTING AND a sibling PR in same repo merged within 30m touching ≥1 overlapping file`	3 per 30d
`DevEStaleDismissalWatcher`	`Review.state=CHANGES_REQUESTED` events where `commit_id` is an ancestor of `HEAD` AND no `dismissal` event since the HEAD push	3 per 30d
`ReviewESpuriousPrWatcher`	`PullRequestOpened` events where `author=review-e-bot` AND the PR closes an issue currently assigned to `dev-e-`. Should already be impossible after rc#946 — watcher persists as a regression detector. Regression canary: this watcher should never* trip; if it does, rc#946's server-side reject regressed and the on-call response is "investigate rc#946 immediately."	1 per 30d (zero-tolerance)
`PlannerNoLargePrOkWatcher`	`IssueCreated` events with body matching `/every (page\|file)\|site[- ]?wide\|across all/i` AND no `large-pr-ok` label	3 per 30d
`AdminBypassRateWatcher`	`OperatorOverrideRecord` rows of `Kind=AdminMerge` in the last 30d, grouped by repo. Caveat: admin-bypass is the operator escape hatch by design — raw count alone is noisy. Watcher groups by `OperatorOverrideRecord.Reason` and only files when one reason dominates (≥60% of the 5 bypasses share a reason category, e.g. "CI flake" or "review-e bottleneck"); the filed issue then proposes the specific fix for that dominant reason (e.g. "stabilize CI X" or "tune review-e routing for repo Y"). If no reason dominates, the watcher stays quiet — high admin-merge rate without a single dominant cause is plausibly correct usage.	5 per 30d per repo
`StreamConsumerWithoutHeartbeatWatcher`	`XINFO CONSUMERS assignments:{agentId}` shows a consumer with recent activity (idle < 5min) AND `/api/agents` reports the corresponding agent as `liveness=offline` for > 1h. Detects the rc#959 codex-stream-black-hole pattern (consumer alive enough to claim stream messages but provider quota-exhausted and not heartbeating, leading to silent XACK without delivery).	1 per 7d (any occurrence is load-bearing)

Watchers can be added incrementally — the service registers them via DI scanning, so a new watcher = one file + one DI line.

GitHub App write path¶

Reuses IGitHubAppTokenProvider exactly as ImagePinDispatchService does:

var token = await _tokenProvider.GetInstallationTokenAsync(ct);
if (token is null) { /* log + skip this tick */ return; }

var client = _httpFactory.CreateClient("github");
client.DefaultRequestHeaders.Authorization = new("Bearer", token);
var body = new {
    title = $"[self-improvement] {watcher.SignatureName}: {occurrences.Count} occurrences in {threshold.Window.TotalDays}d",
    body = renderedIssueBody,
    labels = new[] { "self-improvement" }
};
await client.PostAsJsonAsync($"https://api.github.com/repos/{targetRepo}/issues", body, ct);

App identity: dev-e-bot (the existing token producer for ImagePinDispatchService). No new App needed.

Operator triage consequence: because we reuse dev-e-bot, GitHub's UI will show the same author for both image-pin chore PRs and self-improvement issues. Discriminate by label:self-improvement (issues) vs label:image-pin (PRs) when triaging — author alone is ambiguous.

Dashboard surface¶

New stat: "Self-improvement 🔁" — count of tripped signatures (i.e. signatures with ≥threshold occurrences in window), click filters Issues tab to label:self-improvement is:open.
New tab: "Self-Improvement" — table with one row per signature: name, current count, threshold, window, open issue link.
Endpoint: GET /api/self-improvement/signatures → array of { name, count, threshold, window, openIssue, lastOccurrence }.

Testing¶

Unit: one test file per watcher, mocked Marten + event store fixtures. Coverage: trips at threshold, doesn't trip below, dedups across runs.
Integration: SelfImprovementServiceFlowTests.cs — full pipeline (in-memory Marten, mocked GitHub API) verifying issue-file + comment paths + state doc persistence.

Implementation plan¶

Single PR. Order within the PR:

ISelfImprovementWatcher.cs + records
8 watcher classes (per-watcher tests as I write each)
SelfImprovementService.cs (BackgroundService, flow logic, integration test)
DI registration in Program.cs
/api/self-improvement/signatures endpoint
Dashboard widget + tab
docs/self-improvement-service.md runbook

Estimated size: ~700–900 LOC including tests. Estimated time: half a day.

CI: standard. Admin-merge after Copilot review (orchestrator-authored — review-e doesn't review user-authored PRs per the recurring halt class).

Out of scope¶

Drafting fix code (orchestrator-only, by HARD RULE)
Dispatching dev-e (orchestrator-only)
Auto-resolving the existing 4 stuck-pattern issues on dashe-website — those need separate triage
Cross-repo gap analysis (e.g. "rar + rgo trip the same signature together") — possible follow-up

Open questions — resolved in review¶

Issue body author identity — Resolved (review): keep dev-e-bot, label-discriminate. Operator triage uses label:self-improvement vs label:image-pin. Documented above in the GitHub App write path section.
Threshold tuning — Resolved (review): hardcoded for PR-1. TODO (follow-up): file an issue to make thresholds configurable via env vars once we have ≥1 month of trip-rate data and at least one watcher needs re-tuning. Cross-link from that follow-up issue back to this proposal.
Auto-close on convergence — Resolved (review): operator-only for PR-1. State doc still tracks LastOccurrenceAt (see Dedup invariants) so a future auto-close PR is a small change rather than a schema migration.

Decision requested¶

Approve META framing? (or revert to literal rc#947 body)
Approve dev-e-bot App identity reuse? (leaning yes per review; see GitHub App write path)
Approve hardcoded thresholds for PR-1? (leaning yes per review; follow-up TODO captured)
Greenlight to implement?