Safety — Guards, Stuck Detection, Hallucination Mitigation, Prompt Injection¶

TL;DR

Runtime safety has five independent, cumulative pillars: (1) dangerous-command guard, (2) StuckGuard loop detection, (3) schema-gated tool use, (4) hallucination mitigation via execute-everything + package allowlist, (5) CaMeL prompt-injection separation. Each one catches a different failure class; all five raise the floor far higher than any single defense.

Safety in this whitepaper is runtime safety: the middleware that sits between agent reasoning and tool execution. Not to be confused with security.md (supply chain + cryptographic integrity) or trust-model.md (who can decide what).

Pillar 1: Dangerous-command guard¶

What it blocks¶

A PreToolUse hook in Claude Code settings reads the tool-call JSON on stdin, matches tool_input.command against a blocklist, and either exits 0 (allow) or exits 2 (block + reason).

Pattern	Rationale
`sudo` (any)	Privilege escalation outside agent context
`rm -rf /` or `rm -rf /*`	Filesystem destruction. Local paths (`rm -rf ./build/`) are allowed.
`git push --force` (without `--force-with-lease` or `--force-if-includes`)	Destroys remote history with no recovery
`git reset --hard`	Loses uncommitted work
`git clean -f`	Loses untracked work
`drop table`, `drop database`, `truncate table`	Data loss
`kubectl delete namespace`, `kubectl delete clusterrolebinding`	Cluster-scope destruction
`apt / apt-get / dnf / yum / pacman / brew install`	Should go through devcontainer image; unknown install = supply chain risk
`chmod 777`, `chmod -R 000`	Security regression
`curl ... \| sh`	Unverified remote execution

Design decisions¶

No override flag. Gastown's tap_guard_dangerous has no override, and we adopt that. Escape hatch: the human runs the command manually outside the agent loop. Prevents the failure mode where an agent learns "just add --confirm-dangerous and it works."

Allow --force-with-lease. Explicitly distinguished from --force. --force-with-lease fails if the remote has moved, which is the exact behavior we want.

Smart path matching. rm -rf /etc blocks; rm -rf ./tmp allows. The blocklist uses anchored regex, not substring.

Best-effort event emission. Every block emits a GuardBlocked event to rig-conductor (non-blocking; hook reliability spool retains on failure). Makes block counts visible in the metrics dashboard — a spike means a prompt-injection attempt or an agent bug.

What it does not catch¶

Destructive actions via MCP servers (not shell tools). Mitigation: per-MCP allowlisting at the admission layer.
Destructive actions expressed as write operations (e.g., writing a YAML file that itself is a destructive Kyverno policy). Mitigation: Kyverno's admission gate catches this at cluster level.
Subtle data corruption (writing to the wrong file, incorrect sed/awk substitution). Mitigation: tests + canary analysis + git-diff review.

Pillar 2: StuckGuard — deterministic loop detection¶

Strongest convergence signal in the multi-agent research

Three independent codebases — OpenHands StuckDetector, Goose RepetitionInspector, Sweep AI visited_set — all converged on cheap deterministic loop detection at the tool-call layer without copying each other. When three independent teams solve the same problem the same way, build it. See research-multi-agent-platforms.md.

The five patterns¶

StuckGuard watches the last N tool calls (N=20 default) and detects:

Pattern	Threshold	Meaning
Identical (tool, args) repeated	4× in last 10 calls	Agent is spinning on the same call
Same tool returning same error	3× in last 10 calls	Agent doesn't understand the failure
Multiple consecutive agent messages with no tool calls	3×	Agent is monologuing, not progressing
ABAB alternation (tool A → tool B → tool A → tool B)	6 consecutive steps	Oscillating without progress
`AgentCondensationObservation` or compaction-marker repeated	2×	Falling out of context window

On any pattern match:

Emit AgentStuck { agentId, repo, issueNumber, pattern, recentCalls[] } to rig-conductor
Exit the agent loop cleanly (do not attempt further tool calls)
Escalation router (see self-healing.md) picks it up

Why deterministic, not LLM-judged¶

LLM-as-judge for stuck detection has two failures:

Cost — every step now costs 2× inference (the worker LLM call + the judge LLM call)
Self-assessment paradox — the agent being asked "are you stuck" is the same agent that is stuck. Judgment may be compromised by the same failure mode.

Pattern-counting is O(N) per tool call with a constant window, runs in the agent's own process, adds no latency, no cost, and cannot be subverted by the LLM's own reasoning.

What it does not catch¶

Making slow progress — agent is doing work, but taking 10× the expected tokens. Mitigation: TaskSpec.expected_effort_tokens budget ceiling enforced by LiteLLM proxy.
Correctly completing the wrong task — no loop to detect. Mitigation: Spec-E refines intent up front; Review-E checks against acceptance criteria.
Subtle semantic-loop (agent writes slightly different wrong code each time) — signature varies enough that the five patterns miss. Mitigation: budget ceiling eventually triggers; StuckGuard raises false-positive-rate concerns as a known limitation.

Pillar 3: Schema-gated tool use¶

The problem¶

LLMs hallucinate tool names when too many tools are loaded into context. Claude Code's empirical pain point is around 50 active tools. Beyond that, tool-name hallucinations rise sharply. Similarly, nested argument objects at 3+ levels deep show compounding JSON-validity failures — even at 0.1% per-token error rate, multi-level JSON objects become frequently invalid.

Mitigations¶

Deferred tools pattern (Claude Code native) — tool schemas fetched on-demand via ToolSearch rather than loaded into context. Keeps the active tool-count under 50 for any given agent step.
Pydantic-validated tool arguments — every tool call's arguments parsed into a strongly-typed model. Invalid argument → retry with validation error as feedback (never a silent fallback to defaults).
Structured output via Instructor — for non-tool-use LLM calls (e.g., Spec-E deciding tier), use Instructor to force JSON-schema-conforming output. Instructor's Anthropic integration is the default; it also supports OpenAI and Gemini backends via LiteLLM. Retries on validation failure, up to 3 attempts.
Reject hallucinated tool names — if the agent emits a tool call with a name not in the current active set, fail fast with an explicit error rather than routing to a fallback.

Known limits¶

Schema-validation catches syntactic errors. Semantic errors ("the agent called the right tool with plausible-but-wrong arguments") pass through. Mitigation: tests + canary + human review for T2/T3.
Forced structured output can reduce model quality on tasks where reasoning benefits from free-form output. For those, we use two-step: free-form reasoning → structured extraction via a second call.

Pillar 4: Hallucination mitigation¶

The four defense layers, ranked by ROI¶

Ranked by empirical effectiveness for code-writing agents:

Execute everything — the only ground truth. Tests, lint, type-check, compile, run. Alone catches the majority of "looks plausible, is wrong" output. If an agent cannot run the code it wrote, it should not be considered to have produced anything trustworthy.
Schema-validated tool calls (pillar 3 above). Catches structural errors.
Execute in sandbox — ephemeral K8s namespace with network limited to the package registry and source repo. An npm install of a hallucinated package fails in the sandbox, not in production. See security.md.
SelfCheckGPT-style N-sample for critical paths — for T2/T3 changes, sample N=3 diffs from the agent, have a reviewer LLM compare them, flag outputs with high divergence for human review. The SWE-agent ScoreRetryLoop pattern. Worth the 3× cost only on high-blast-radius paths.

Slopsquatting — hallucinated packages as attack vector¶

The canonical study: 2.23M LLM-generated package references, 19.7% hallucinated, 205K uniquely fabricated package names. Hallucinations are deterministic enough per-model-per-prompt that attackers pre-register the common fabrications. Multiple documented exploits in 2025 (npm and PyPI).

Our defense layers:

Allowlist registry — agents only install from a mirror we control (pkg.dashecorp.com proxies npm and PyPI with a curated allowlist).
Ephemeral install sandbox — every npm install / pip install runs in a throwaway pod. Hallucinated packages fail there, not on the production image.
SBOM check in CI — Syft generates SBOM; Grype scans; new packages < 30 days old trigger a human review gate.
Socket.dev integration — per-dependency security score in the PR check; scores below threshold block merge.
Package-age policy — Datadog's min-release-age default of 14 days catches most typosquatting account-takeovers (Axios March 2026 compromise, shai-hulud September 2025) without being excessively strict.

Hallucinated file paths and line numbers¶

A subtler failure: agents cite src/Foo.cs:42 when no such line exists. Mitigations:

Post-hoc grep validation — any file:line citation in a commit message or PR body is validated against the actual commit via a GitHub Action. Invalid citations block merge.
Review-E checks citations — character prompt instruction to verify any cited locations.
Integrated in the IDE layer — Claude Code's native Read tool returns line-numbered content; the agent's "I saw this at line X" statements should correspond to actual reads, checkable via OTel span correlation.

Property-based testing¶

From arXiv:2510.09907 (October 2025): agents generating property tests via Hypothesis find bugs across Python ecosystems that unit tests miss. For our rig, the pattern:

After Dev-E writes code, spawn a PropertyTest-E subagent whose only job is "write 5-10 Hypothesis properties for invariants of the code you just wrote, run them, report findings."
If properties fail, the diff is rejected and Dev-E iterates.
If properties pass, they become permanent regression tests checked into the repo.

This is one of the highest-leverage reliability plays available in 2026. Cost: one additional agent invocation per non-trivial change. Benefit: semantic bugs caught before prod.

Pillar 5: Prompt-injection separation (CaMeL)¶

The threat¶

This is not a hypothetical

Prompt injection produced real CVEs in 2025–2026: CVE-2025-54794/95 (InversePrompt against Claude), CVE-2025-59536 / CVE-2026-21852 (Claude Code project-file RCE), CVE-2025-68143/44/45 (Anthropic's own Git MCP server). The named CVEs happen to be Anthropic-specific because those are the attacks that were publicly disclosed at disclosure-time; the defense pattern (CaMeL-style trust separation) applies to any LLM with tool use regardless of provider. A systematic 2025 survey found >85% adaptive-attack success against Claude Code, Copilot, and Cursor in default configs. "Better prompting" does not close this gap.

2025-2026 saw the first wave of agent-specific CVEs:

CVE-2025-54794 / CVE-2025-54795 (InversePrompt against Claude): malicious instructions in user-controlled content hijack the agent.
CVE-2025-59536 / CVE-2026-21852: RCE via Claude Code project files — Hooks, MCP configs, and env vars from an untrusted repo execute on git clone.
CVE-2025-68143 / 68144 / 68145: Anthropic's own Git MCP server had prompt-injection RCE vectors.
Oasis Security "Claudy Day": exfiltrated Claude.ai conversation history via the Files API.
OWASP Top 10 for Agentic Applications (2026): "Agent Goal Hijacking" (ASI01) is #1.

A systematic 2025 survey found >85% adaptive-attack success against Claude Code, Copilot, and Cursor in their default configurations. "Better prompting" does not close this gap.

CaMeL — the only defense with a formal guarantee¶

DeepMind's CaMeL paper (arXiv:2503.18813). The architectural insight is provider-agnostic — it requires only a privileged planner LLM and a quarantined data-processing LLM, both interchangeable via LiteLLM (see provider-portability.md). The named 2025-2026 CVEs happen to target Anthropic systems because that's where disclosures landed first, not because CaMeL is Anthropic-specific:

A privileged LLM plans and decides tool calls. It only sees the user's original trusted query plus prior tool outputs.
A quarantined LLM processes untrusted data (issue bodies, README files, external API responses, code comments in third-party repos). It has no tool access. Its job is to extract values and return them as typed data to the privileged plane.

Untrusted content cannot instruct the privileged plane, because the privileged plane never sees the untrusted content as instructions — only as data extracted by the quarantined plane. Control flow and data flow are extracted from the trusted query only.

CaMeL solves ~77% of AgentDojo tasks with provable security — a meaningfully higher bar than "empirically did not fail our tests."

What our rig does¶

For every agent task that consumes external content:

The trusted Dev-E / Review-E plane receives the TaskSpec from Spec-E (human-shaped, trusted).
Fetches of external content (issue body, README, third-party docs, external API responses) go through a quarantined plane — a separate LLM call with no tool binding.
The quarantined plane extracts typed values ({ bug_description: string, affected_files: string[] }) via Instructor + schema.
The privileged plane consumes the typed values, not the raw text.

Complementary defenses¶

CaMeL is the foundation; layer on top:

Default-deny egress allowlist (partial — Phase 1 live on review-e, 24h burn-in as of 2026-04-22 evening) — agent pod can only reach allowlisted hostnames (api.anthropic.com, api.github.com, github.com, ghcr.io, registry.npmjs.org, discord.com, gateway.discord.gg, cdn.discordapp.com, and rig-conductor). Even if an injection succeeds in instructing the privileged plane to exfil, the network blocks it. The shipped design: an Envoy SNI egress gateway (byte-transparent TLS passthrough, matches on SNI from the ClientHello) deployed in the egress-gw namespace, fronted by a dedicated CoreDNS instance that rewrites each allowlisted public hostname to envoy.egress-gw.svc.cluster.local answer auto. Agent pods opt in via dnsPolicy: None + dnsConfig.nameservers: [egress-dns, kube-dns] (pod-scoped — no cluster-wide rewrite that would catch Flux). Ips-blocks were ruled out on 2026-04-22 morning because api.anthropic.com is Cloudflare-fronted. Cilium L7 was ruled out because the rig is k3s + flannel, not GKE + Dataplane V2. The LiteLLM-based cost/model centralisation story is bundled with Priority 3 — the Envoy gateway handles the safety half today. Dev-e rollout + default-deny NetworkPolicy (allowlist: kube-dns, egress-dns, Envoy 443+8443, rig-conductor 8080+6379+5432) terminate Phase 1.
Tool scoping — Review-E has no shell tool; Dev-E has no cluster-admin access. Minimum-capability principle per role.
Content classification — a heuristic filter (e.g., Anthropic's recommended classifier for Claude-backed agents; equivalents exist for OpenAI and Gemini) runs on untrusted inputs. ~1% bypass on the published Anthropic benchmarks, but stacks with CaMeL for defense-in-depth regardless of provider.
Human-in-the-loop for irreversible actions — the two-attestor Kyverno policy on T3 means even successful injection cannot reach irreversible surfaces without a human signature.
Non-bypassable checks on destructive operations — dangerous-command guard (pillar 1) catches shell-level escape attempts regardless of what the agent was reasoning about.

Specific CVE mitigations¶

CVE	Our mitigation
CVE-2025-54794/95 (InversePrompt)	CaMeL separation
CVE-2025-59536 (Claude Code project-file RCE)	Never `git clone` an untrusted repo into an agent's workspace; clones go into a sandboxed sibling pod with no tool access; Hooks/MCP configs from cloned repos are explicitly stripped before the pod that reads them is instantiated
CVE-2025-68143-45 (Anthropic Git MCP)	Pin MCP server versions; run MCP servers in separate network-policy-restricted pods; no auto-update
Files API exfiltration	Phase 2 hostname allowlist will block `api.anthropic.com/v1/files` from non-Files-role agent pods (Phase 1 CIDR policy cannot path-discriminate)

Known limits¶

CaMeL is secure against today's named attacks. New classes of injection (e.g., side-channel via timing, metadata) are unaddressed until identified.
The quarantined plane itself uses an LLM — if an attacker can coerce the quarantined plane into emitting a malformed typed value that triggers a privileged-plane bug, a new class of attack is opened. Mitigation: strict schema validation on the typed values.
Operational cost: every external-content fetch doubles LLM calls. Mitigate via caching on fetched content with identity-bound TTL.

The safety dashboard¶

Every pillar emits metrics. The safety dashboard (observability.md) shows:

guard_blocks_total{pattern="..."} — dangerous-command guard block counts per pattern
stuck_agent_total{pattern="..."} — StuckGuard trips per pattern
tool_schema_reject_total{tool="..."} — hallucinated-tool and bad-argument counts
hallucinated_package_reject_total{package="..."} — package allowlist rejections (proxy for slopsquatting attempts)
prompt_injection_suspected_total — content-classifier high-confidence flags (provider-specific; Anthropic's classifier on the default path, equivalents per provider — see provider-portability.md)

Steady-state expected values are near-zero. Spikes are the alert signal.

Layered defense illustrated¶

graph LR
    classDef untrusted fill:#ffcccc,color:#000
    classDef quarantined fill:#fff3e0,color:#000
    classDef trusted fill:#e8f5e9,color:#000
    classDef prod fill:#c8e6c9,color:#000

    U[GitHub issue body<br/>Third-party README<br/>External API response]:::untrusted
    Q[Quarantined LLM<br/>no tool access<br/>content classifier pre-filter]:::quarantined
    V[Typed value<br/>schema-validated]:::quarantined
    P[Privileged LLM<br/>plans and acts]:::trusted
    T[Tool call]:::trusted
    G[Guards:<br/>StuckGuard<br/>Dangerous-cmd<br/>Schema validation]:::trusted
    R[Rate limit<br/>LiteLLM proxy]:::trusted
    N[Egress NetworkPolicy<br/>L7 allowlist]:::trusted
    EX[Execution<br/>sandbox pod]:::prod

    U --> Q
    Q --> V
    V --> P
    P --> T
    T --> G
    G -->|allow| R
    G -->|block| STOP[Halt + emit event]
    R --> N
    N -->|egress allowed| EX
    N -->|egress denied| STOP

Six gates between untrusted input and production execution. Each gate is independently testable and deterministic where possible.

What this is not¶

Not a complete security story. See security.md for supply chain, admission policy, attestation chain.
Not a guarantee. Five pillars raise the floor; they do not make the rig uncompromisable. Known limits are enumerated.
Not an excuse for human inattention. Safety layers are risk mitigation, not risk elimination. T3 actions still require humans.

Safety — Guards, Stuck Detection, Hallucination Mitigation, Prompt Injection¶

Pillar 1: Dangerous-command guard¶

What it blocks¶

Design decisions¶

What it does not catch¶

Pillar 2: StuckGuard — deterministic loop detection¶

The five patterns¶

Why deterministic, not LLM-judged¶

What it does not catch¶

Pillar 3: Schema-gated tool use¶

The problem¶

Mitigations¶

Known limits¶

Pillar 4: Hallucination mitigation¶

The four defense layers, ranked by ROI¶

Slopsquatting — hallucinated packages as attack vector¶

Hallucinated file paths and line numbers¶

Property-based testing¶

Pillar 5: Prompt-injection separation (CaMeL)¶

The threat¶

CaMeL — the only defense with a formal guarantee¶

What our rig does¶

Complementary defenses¶

Specific CVE mitigations¶

Known limits¶

The safety dashboard¶

Layered defense illustrated¶

What this is not¶

See also¶