Skip to content

rc#947 — SelfImprovementService design proposal

Issue: rig-conductor#947

TL;DR

Add a SelfImprovementService to rig-conductor that watches the rate, frequency, and distribution of signals existing detectors are already emitting, and files gap-analysis issues when a recurring intervention has crossed from "incident" to "systemic gap." Observation + intake only — no auto-fixes, no dispatches.

Why now

During the 2026-05-15→16 dashe-website compliance audit, the orchestrator manually filed 7 follow-up issues (rar#420/#421/#423, rc#942/#943/#944/#946) describing rig gaps. Each was directly derivable from event-store signals the conductor was already recording — there is no service that aggregates these into actionable patterns and files them. This service closes that loop.

What changed from the rc#947 issue body

The issue body proposes 7 standalone watchers (EscalatedFlagSticky, ProviderQuotaStuck, SiblingMergeConflict, etc.) implemented as independent projection-pattern matches. This proposal re-frames them as META queries over what existing detectors already record, because:

  • SloEnforcerService already manages SLO state + the EscalatedLive derived property
  • StuckIssueDetectorService already detects per-issue stuck cycles
  • StuckWatcherService already aggregates stuck fingerprints
  • MainGuardService already handles main-branch CI failures + the review-e self-PR class via existing guards
  • AdminBypassAuditService already records every admin-merge

Standalone watchers would duplicate this detection work. The right level is one layer up: watch how often each detector trips and where.

Goal

When a recurring intervention pattern crosses a configured threshold, file a gap-analysis issue with linked occurrences, a draft fix proposal, and the self-improvement label — so the orchestrator's pattern-noticing work is automated.

Non-goals (from the issue body, preserved verbatim):

  • NOT auto-merging rig-internal PRs (still orchestrator-only)
  • NOT auto-coding fixes (HARD RULE preserved — rig agents don't work on the rig)
  • NOT dispatching dev-e on rig-internal repos (operator-only via admin-merge)

Architecture

Files

src/ConductorE.Core/SelfImprovement/
├── ISelfImprovementWatcher.cs        # interface — Evaluate(IServiceProvider, CancellationToken) → IReadOnlyList<Occurrence>
├── SignatureOccurrence.cs            # value record passed back to the service
└── Watchers/                         # one file per signature
    ├── EscalatedFlagStickyWatcher.cs
    ├── ProviderQuotaSaturationWatcher.cs
    ├── SiblingMergeConflictWatcher.cs
    ├── DevEStaleDismissalWatcher.cs
    ├── ReviewESpuriousPrWatcher.cs
    ├── PlannerNoLargePrOkWatcher.cs
    └── AdminBypassRateWatcher.cs

src/ConductorE.Api/Services/
└── SelfImprovementService.cs         # BackgroundService — periodic scan, dedup, file issues

src/ConductorE.Core/Domain/
└── ReadModels.cs                     # + SelfImprovementSignatureState record (Marten doc)

tests/ConductorE.Core.Tests/SelfImprovement/
├── EscalatedFlagStickyWatcherTests.cs
├── ProviderQuotaSaturationWatcherTests.cs
├── ... (one per watcher)
└── SelfImprovementServiceFlowTests.cs

docs/
└── self-improvement-service.md       # operator runbook — "how to add a new watcher"

Watcher contract

public interface ISelfImprovementWatcher
{
    /// <summary>Stable signature name — used as Marten document id and issue title prefix.</summary>
    string SignatureName { get; }

    /// <summary>Default threshold + window — overridable via config.</summary>
    SignatureThreshold DefaultThreshold { get; }

    /// <summary>
    /// Examine the current state and return any occurrences observed in this scan.
    /// Service handles dedup, threshold-check, issue-filing — watcher just observes.
    /// </summary>
    Task<IReadOnlyList<SignatureOccurrence>> EvaluateAsync(IServiceProvider scope, CancellationToken ct);
}

public record SignatureOccurrence(
    string Repo,
    int? IssueNumber,
    int? PrNumber,
    string ContextSummary,            // 1-line "why this tripped"
    DateTimeOffset ObservedAt,
    string Permalink                  // GitHub permalink to the evidence
);

public record SignatureThreshold(int Count, TimeSpan Window);

Watchers are pure observers — they query Marten + maybe the event store and return occurrences. The service orchestrates dedup, persistence, threshold checking, and issue filing. This keeps watchers trivial to unit-test.

Service flow

  1. Scan interval: 15 minutes (matches ReconciliationService cadence)
  2. Per watcher:
  3. var occurrences = await watcher.EvaluateAsync(scope, ct);
  4. Load SelfImprovementSignatureState doc keyed by SignatureName
  5. Append occurrences, prune those outside the watcher's window
  6. If count ≥ threshold AND no open gap-analysis issue exists:
    • Mint App token via IGitHubAppTokenProvider
    • Open issue in the watcher's target repo (default: dashecorp/rig-conductor)
    • Save the open-issue-number on the state doc
  7. If count ≥ threshold AND open gap-analysis issue exists:
    • Add a comment with new occurrences (no re-file)
  8. Persist state via Marten session.

Dedup invariants

  • One SignatureState doc per SignatureName. Idempotent on restart.
  • One open gap-analysis issue per signature at a time. If closed, the next threshold crossing files a new one.
  • Occurrences are pruned by window — long-lived state docs stay small.
  • Non-recurrence tracking (forward-compat for auto-close): the state doc records LastOccurrenceAt independent of issue lifecycle, so a future auto-close PR can observe "≥7d since last occurrence while issue is open" without a schema change. PR-1 reads this field only for the dashboard; operator handles close.

Initial 8 watchers

Each one is re-framed as a query over what already exists:

Watcher Signal source Default threshold
EscalatedFlagStickyWatcher IssueStatus rows where SloEscalated=true AND !EscalatedLive AND LastUpdated > escalationTime + 1h. Note: the raw flag is by-design sticky — what we're detecting is downstream consumers that ignore the live invariant. 3 per 30d
ProviderQuotaSaturationWatcher AgentStatus.QuotaFiveHourPct ≥ 95 AND IssueStatus.state=failed AND last failure event contains "usage_limit_reached" 3 per 7d
SiblingMergeConflictWatcher Live GitHub query: PR.mergeable=CONFLICTING AND a sibling PR in same repo merged within 30m touching ≥1 overlapping file 3 per 30d
DevEStaleDismissalWatcher Review.state=CHANGES_REQUESTED events where commit_id is an ancestor of HEAD AND no dismissal event since the HEAD push 3 per 30d
ReviewESpuriousPrWatcher PullRequestOpened events where author=review-e-bot AND the PR closes an issue currently assigned to dev-e-*. Should already be impossible after rc#946 — watcher persists as a regression detector. Regression canary: this watcher should never trip; if it does, rc#946's server-side reject regressed and the on-call response is "investigate rc#946 immediately." 1 per 30d (zero-tolerance)
PlannerNoLargePrOkWatcher IssueCreated events with body matching /every (page|file)|site[- ]?wide|across all/i AND no large-pr-ok label 3 per 30d
AdminBypassRateWatcher OperatorOverrideRecord rows of Kind=AdminMerge in the last 30d, grouped by repo. Caveat: admin-bypass is the operator escape hatch by design — raw count alone is noisy. Watcher groups by OperatorOverrideRecord.Reason and only files when one reason dominates (≥60% of the 5 bypasses share a reason category, e.g. "CI flake" or "review-e bottleneck"); the filed issue then proposes the specific fix for that dominant reason (e.g. "stabilize CI X" or "tune review-e routing for repo Y"). If no reason dominates, the watcher stays quiet — high admin-merge rate without a single dominant cause is plausibly correct usage. 5 per 30d per repo
StreamConsumerWithoutHeartbeatWatcher XINFO CONSUMERS assignments:{agentId} shows a consumer with recent activity (idle < 5min) AND /api/agents reports the corresponding agent as liveness=offline for > 1h. Detects the rc#959 codex-stream-black-hole pattern (consumer alive enough to claim stream messages but provider quota-exhausted and not heartbeating, leading to silent XACK without delivery). 1 per 7d (any occurrence is load-bearing)

Watchers can be added incrementally — the service registers them via DI scanning, so a new watcher = one file + one DI line.

GitHub App write path

Reuses IGitHubAppTokenProvider exactly as ImagePinDispatchService does:

var token = await _tokenProvider.GetInstallationTokenAsync(ct);
if (token is null) { /* log + skip this tick */ return; }

var client = _httpFactory.CreateClient("github");
client.DefaultRequestHeaders.Authorization = new("Bearer", token);
var body = new {
    title = $"[self-improvement] {watcher.SignatureName}: {occurrences.Count} occurrences in {threshold.Window.TotalDays}d",
    body = renderedIssueBody,
    labels = new[] { "self-improvement" }
};
await client.PostAsJsonAsync($"https://api.github.com/repos/{targetRepo}/issues", body, ct);

App identity: dev-e-bot (the existing token producer for ImagePinDispatchService). No new App needed.

Operator triage consequence: because we reuse dev-e-bot, GitHub's UI will show the same author for both image-pin chore PRs and self-improvement issues. Discriminate by label:self-improvement (issues) vs label:image-pin (PRs) when triaging — author alone is ambiguous.

Dashboard surface

  • New stat: "Self-improvement 🔁" — count of tripped signatures (i.e. signatures with ≥threshold occurrences in window), click filters Issues tab to label:self-improvement is:open.
  • New tab: "Self-Improvement" — table with one row per signature: name, current count, threshold, window, open issue link.
  • Endpoint: GET /api/self-improvement/signatures → array of { name, count, threshold, window, openIssue, lastOccurrence }.

Testing

  • Unit: one test file per watcher, mocked Marten + event store fixtures. Coverage: trips at threshold, doesn't trip below, dedups across runs.
  • Integration: SelfImprovementServiceFlowTests.cs — full pipeline (in-memory Marten, mocked GitHub API) verifying issue-file + comment paths + state doc persistence.

Implementation plan

Single PR. Order within the PR:

  1. ISelfImprovementWatcher.cs + records
  2. 8 watcher classes (per-watcher tests as I write each)
  3. SelfImprovementService.cs (BackgroundService, flow logic, integration test)
  4. DI registration in Program.cs
  5. /api/self-improvement/signatures endpoint
  6. Dashboard widget + tab
  7. docs/self-improvement-service.md runbook

Estimated size: ~700–900 LOC including tests. Estimated time: half a day.

CI: standard. Admin-merge after Copilot review (orchestrator-authored — review-e doesn't review user-authored PRs per the recurring halt class).

Out of scope

  • Drafting fix code (orchestrator-only, by HARD RULE)
  • Dispatching dev-e (orchestrator-only)
  • Auto-resolving the existing 4 stuck-pattern issues on dashe-website — those need separate triage
  • Cross-repo gap analysis (e.g. "rar + rgo trip the same signature together") — possible follow-up

Open questions — resolved in review

  1. Issue body author identityResolved (review): keep dev-e-bot, label-discriminate. Operator triage uses label:self-improvement vs label:image-pin. Documented above in the GitHub App write path section.
  2. Threshold tuningResolved (review): hardcoded for PR-1. TODO (follow-up): file an issue to make thresholds configurable via env vars once we have ≥1 month of trip-rate data and at least one watcher needs re-tuning. Cross-link from that follow-up issue back to this proposal.
  3. Auto-close on convergenceResolved (review): operator-only for PR-1. State doc still tracks LastOccurrenceAt (see Dedup invariants) so a future auto-close PR is a small change rather than a schema migration.

Decision requested

  • Approve META framing? (or revert to literal rc#947 body)
  • Approve dev-e-bot App identity reuse? (leaning yes per review; see GitHub App write path)
  • Approve hardcoded thresholds for PR-1? (leaning yes per review; follow-up TODO captured)
  • Greenlight to implement?