Skip to content

Safety — Guards, Stuck Detection, Hallucination Mitigation, Prompt Injection

TL;DR

Runtime safety has five independent, cumulative pillars: (1) dangerous-command guard, (2) StuckGuard loop detection, (3) schema-gated tool use, (4) hallucination mitigation via execute-everything + package allowlist, (5) CaMeL prompt-injection separation. Each one catches a different failure class; all five raise the floor far higher than any single defense.

Safety in this whitepaper is runtime safety: the middleware that sits between agent reasoning and tool execution. Not to be confused with security.md (supply chain + cryptographic integrity) or trust-model.md (who can decide what).

Pillar 1: Dangerous-command guard

What it blocks

A PreToolUse hook in Claude Code settings reads the tool-call JSON on stdin, matches tool_input.command against a blocklist, and either exits 0 (allow) or exits 2 (block + reason).

Pattern Rationale
sudo (any) Privilege escalation outside agent context
rm -rf / or rm -rf /* Filesystem destruction. Local paths (rm -rf ./build/) are allowed.
git push --force (without --force-with-lease or --force-if-includes) Destroys remote history with no recovery
git reset --hard Loses uncommitted work
git clean -f Loses untracked work
drop table, drop database, truncate table Data loss
kubectl delete namespace, kubectl delete clusterrolebinding Cluster-scope destruction
apt / apt-get / dnf / yum / pacman / brew install Should go through devcontainer image; unknown install = supply chain risk
chmod 777, chmod -R 000 Security regression
curl ... | sh Unverified remote execution

Design decisions

No override flag. Gastown's tap_guard_dangerous has no override, and we adopt that. Escape hatch: the human runs the command manually outside the agent loop. Prevents the failure mode where an agent learns "just add --confirm-dangerous and it works."

Allow --force-with-lease. Explicitly distinguished from --force. --force-with-lease fails if the remote has moved, which is the exact behavior we want.

Smart path matching. rm -rf /etc blocks; rm -rf ./tmp allows. The blocklist uses anchored regex, not substring.

Best-effort event emission. Every block emits a GuardBlocked event to Conductor-E (non-blocking; hook reliability spool retains on failure). Makes block counts visible in the metrics dashboard — a spike means a prompt-injection attempt or an agent bug.

What it does not catch

  • Destructive actions via MCP servers (not shell tools). Mitigation: per-MCP allowlisting at the admission layer.
  • Destructive actions expressed as write operations (e.g., writing a YAML file that itself is a destructive Kyverno policy). Mitigation: Kyverno's admission gate catches this at cluster level.
  • Subtle data corruption (writing to the wrong file, incorrect sed/awk substitution). Mitigation: tests + canary analysis + git-diff review.

Pillar 2: StuckGuard — deterministic loop detection

Strongest convergence signal in the multi-agent research

Three independent codebases — OpenHands StuckDetector, Goose RepetitionInspector, Sweep AI visited_set — all converged on cheap deterministic loop detection at the tool-call layer without copying each other. When three independent teams solve the same problem the same way, build it. See research-multi-agent-platforms.md.

The five patterns

StuckGuard watches the last N tool calls (N=20 default) and detects:

Pattern Threshold Meaning
Identical (tool, args) repeated 4× in last 10 calls Agent is spinning on the same call
Same tool returning same error 3× in last 10 calls Agent doesn't understand the failure
Multiple consecutive agent messages with no tool calls Agent is monologuing, not progressing
ABAB alternation (tool A → tool B → tool A → tool B) 6 consecutive steps Oscillating without progress
AgentCondensationObservation or compaction-marker repeated Falling out of context window

On any pattern match:

  1. Emit AgentStuck { agentId, repo, issueNumber, pattern, recentCalls[] } to Conductor-E
  2. Exit the agent loop cleanly (do not attempt further tool calls)
  3. Escalation router (see self-healing.md) picks it up

Why deterministic, not LLM-judged

LLM-as-judge for stuck detection has two failures:

  • Cost — every step now costs 2× inference (the worker LLM call + the judge LLM call)
  • Self-assessment paradox — the agent being asked "are you stuck" is the same agent that is stuck. Judgment may be compromised by the same failure mode.

Pattern-counting is O(N) per tool call with a constant window, runs in the agent's own process, adds no latency, no cost, and cannot be subverted by the LLM's own reasoning.

What it does not catch

  • Making slow progress — agent is doing work, but taking 10× the expected tokens. Mitigation: TaskSpec.expected_effort_tokens budget ceiling enforced by LiteLLM proxy.
  • Correctly completing the wrong task — no loop to detect. Mitigation: Spec-E refines intent up front; Review-E checks against acceptance criteria.
  • Subtle semantic-loop (agent writes slightly different wrong code each time) — signature varies enough that the five patterns miss. Mitigation: budget ceiling eventually triggers; StuckGuard raises false-positive-rate concerns as a known limitation.

Pillar 3: Schema-gated tool use

The problem

LLMs hallucinate tool names when too many tools are loaded into context. Claude Code's empirical pain point is around 50 active tools. Beyond that, tool-name hallucinations rise sharply. Similarly, nested argument objects at 3+ levels deep show compounding JSON-validity failures — even at 0.1% per-token error rate, multi-level JSON objects become frequently invalid.

Mitigations

  • Deferred tools pattern (Claude Code native) — tool schemas fetched on-demand via ToolSearch rather than loaded into context. Keeps the active tool-count under 50 for any given agent step.
  • Pydantic-validated tool arguments — every tool call's arguments parsed into a strongly-typed model. Invalid argument → retry with validation error as feedback (never a silent fallback to defaults).
  • Structured output via Instructor — for non-tool-use LLM calls (e.g., Spec-E deciding tier), use Instructor to force JSON-schema-conforming output. Instructor's Anthropic integration is the default; it also supports OpenAI and Gemini backends via LiteLLM. Retries on validation failure, up to 3 attempts.
  • Reject hallucinated tool names — if the agent emits a tool call with a name not in the current active set, fail fast with an explicit error rather than routing to a fallback.

Known limits

  • Schema-validation catches syntactic errors. Semantic errors ("the agent called the right tool with plausible-but-wrong arguments") pass through. Mitigation: tests + canary + human review for T2/T3.
  • Forced structured output can reduce model quality on tasks where reasoning benefits from free-form output. For those, we use two-step: free-form reasoning → structured extraction via a second call.

Pillar 4: Hallucination mitigation

The four defense layers, ranked by ROI

Ranked by empirical effectiveness for code-writing agents:

  1. Execute everything — the only ground truth. Tests, lint, type-check, compile, run. Alone catches the majority of "looks plausible, is wrong" output. If an agent cannot run the code it wrote, it should not be considered to have produced anything trustworthy.
  2. Schema-validated tool calls (pillar 3 above). Catches structural errors.
  3. Execute in sandbox — ephemeral K8s namespace with network limited to the package registry and source repo. An npm install of a hallucinated package fails in the sandbox, not in production. See security.md.
  4. SelfCheckGPT-style N-sample for critical paths — for T2/T3 changes, sample N=3 diffs from the agent, have a reviewer LLM compare them, flag outputs with high divergence for human review. The SWE-agent ScoreRetryLoop pattern. Worth the 3× cost only on high-blast-radius paths.

Slopsquatting — hallucinated packages as attack vector

The canonical study: 2.23M LLM-generated package references, 19.7% hallucinated, 205K uniquely fabricated package names. Hallucinations are deterministic enough per-model-per-prompt that attackers pre-register the common fabrications. Multiple documented exploits in 2025 (npm and PyPI).

Our defense layers:

  • Allowlist registry — agents only install from a mirror we control (pkg.dashecorp.com proxies npm and PyPI with a curated allowlist).
  • Ephemeral install sandbox — every npm install / pip install runs in a throwaway pod. Hallucinated packages fail there, not on the production image.
  • SBOM check in CI — Syft generates SBOM; Grype scans; new packages < 30 days old trigger a human review gate.
  • Socket.dev integration — per-dependency security score in the PR check; scores below threshold block merge.
  • Package-age policy — Datadog's min-release-age default of 14 days catches most typosquatting account-takeovers (Axios March 2026 compromise, shai-hulud September 2025) without being excessively strict.

Hallucinated file paths and line numbers

A subtler failure: agents cite src/Foo.cs:42 when no such line exists. Mitigations:

  • Post-hoc grep validation — any file:line citation in a commit message or PR body is validated against the actual commit via a GitHub Action. Invalid citations block merge.
  • Review-E checks citations — character prompt instruction to verify any cited locations.
  • Integrated in the IDE layer — Claude Code's native Read tool returns line-numbered content; the agent's "I saw this at line X" statements should correspond to actual reads, checkable via OTel span correlation.

Property-based testing

From arXiv:2510.09907 (October 2025): agents generating property tests via Hypothesis find bugs across Python ecosystems that unit tests miss. For our rig, the pattern:

  • After Dev-E writes code, spawn a PropertyTest-E subagent whose only job is "write 5-10 Hypothesis properties for invariants of the code you just wrote, run them, report findings."
  • If properties fail, the diff is rejected and Dev-E iterates.
  • If properties pass, they become permanent regression tests checked into the repo.

This is one of the highest-leverage reliability plays available in 2026. Cost: one additional agent invocation per non-trivial change. Benefit: semantic bugs caught before prod.

Pillar 5: Prompt-injection separation (CaMeL)

The threat

This is not a hypothetical

Prompt injection produced real CVEs in 2025–2026: CVE-2025-54794/95 (InversePrompt against Claude), CVE-2025-59536 / CVE-2026-21852 (Claude Code project-file RCE), CVE-2025-68143/44/45 (Anthropic's own Git MCP server). The named CVEs happen to be Anthropic-specific because those are the attacks that were publicly disclosed at disclosure-time; the defense pattern (CaMeL-style trust separation) applies to any LLM with tool use regardless of provider. A systematic 2025 survey found >85% adaptive-attack success against Claude Code, Copilot, and Cursor in default configs. "Better prompting" does not close this gap.

2025-2026 saw the first wave of agent-specific CVEs:

  • CVE-2025-54794 / CVE-2025-54795 (InversePrompt against Claude): malicious instructions in user-controlled content hijack the agent.
  • CVE-2025-59536 / CVE-2026-21852: RCE via Claude Code project files — Hooks, MCP configs, and env vars from an untrusted repo execute on git clone.
  • CVE-2025-68143 / 68144 / 68145: Anthropic's own Git MCP server had prompt-injection RCE vectors.
  • Oasis Security "Claudy Day": exfiltrated Claude.ai conversation history via the Files API.
  • OWASP Top 10 for Agentic Applications (2026): "Agent Goal Hijacking" (ASI01) is #1.

A systematic 2025 survey found >85% adaptive-attack success against Claude Code, Copilot, and Cursor in their default configurations. "Better prompting" does not close this gap.

CaMeL — the only defense with a formal guarantee

DeepMind's CaMeL paper (arXiv:2503.18813). The architectural insight is provider-agnostic — it requires only a privileged planner LLM and a quarantined data-processing LLM, both interchangeable via LiteLLM (see provider-portability.md). The named 2025-2026 CVEs happen to target Anthropic systems because that's where disclosures landed first, not because CaMeL is Anthropic-specific:

  • A privileged LLM plans and decides tool calls. It only sees the user's original trusted query plus prior tool outputs.
  • A quarantined LLM processes untrusted data (issue bodies, README files, external API responses, code comments in third-party repos). It has no tool access. Its job is to extract values and return them as typed data to the privileged plane.

Untrusted content cannot instruct the privileged plane, because the privileged plane never sees the untrusted content as instructions — only as data extracted by the quarantined plane. Control flow and data flow are extracted from the trusted query only.

CaMeL solves ~77% of AgentDojo tasks with provable security — a meaningfully higher bar than "empirically did not fail our tests."

What our rig does

For every agent task that consumes external content:

  1. The trusted Dev-E / Review-E plane receives the TaskSpec from Spec-E (human-shaped, trusted).
  2. Fetches of external content (issue body, README, third-party docs, external API responses) go through a quarantined plane — a separate LLM call with no tool binding.
  3. The quarantined plane extracts typed values ({ bug_description: string, affected_files: string[] }) via Instructor + schema.
  4. The privileged plane consumes the typed values, not the raw text.

Complementary defenses

CaMeL is the foundation; layer on top:

  • L7 egress allowlist (Cilium) — agent pod can only reach the primary LLM provider (default api.anthropic.com; extend to api.openai.com, generativelanguage.googleapis.com, or local Ollama when the agent is configured for a different provider — see provider-portability.md), api.github.com, pkg.dashecorp.com, and Conductor-E. Even if an injection succeeds in instructing the privileged plane to exfil, the network blocks it.
  • Tool scoping — Review-E has no shell tool; Dev-E has no cluster-admin access. Minimum-capability principle per role.
  • Content classification — a heuristic filter (e.g., Anthropic's recommended classifier for Claude-backed agents; equivalents exist for OpenAI and Gemini) runs on untrusted inputs. ~1% bypass on the published Anthropic benchmarks, but stacks with CaMeL for defense-in-depth regardless of provider.
  • Human-in-the-loop for irreversible actions — the two-attestor Kyverno policy on T3 means even successful injection cannot reach irreversible surfaces without a human signature.
  • Non-bypassable checks on destructive operations — dangerous-command guard (pillar 1) catches shell-level escape attempts regardless of what the agent was reasoning about.

Specific CVE mitigations

CVE Our mitigation
CVE-2025-54794/95 (InversePrompt) CaMeL separation
CVE-2025-59536 (Claude Code project-file RCE) Never git clone an untrusted repo into an agent's workspace; clones go into a sandboxed sibling pod with no tool access; Hooks/MCP configs from cloned repos are explicitly stripped before the pod that reads them is instantiated
CVE-2025-68143-45 (Anthropic Git MCP) Pin MCP server versions; run MCP servers in separate network-policy-restricted pods; no auto-update
Files API exfiltration L7 egress policy blocks api.anthropic.com/v1/files from non-Files-role agent pods

Known limits

  • CaMeL is secure against today's named attacks. New classes of injection (e.g., side-channel via timing, metadata) are unaddressed until identified.
  • The quarantined plane itself uses an LLM — if an attacker can coerce the quarantined plane into emitting a malformed typed value that triggers a privileged-plane bug, a new class of attack is opened. Mitigation: strict schema validation on the typed values.
  • Operational cost: every external-content fetch doubles LLM calls. Mitigate via caching on fetched content with identity-bound TTL.

The safety dashboard

Every pillar emits metrics. The safety dashboard (observability.md) shows:

  • guard_blocks_total{pattern="..."} — dangerous-command guard block counts per pattern
  • stuck_agent_total{pattern="..."} — StuckGuard trips per pattern
  • tool_schema_reject_total{tool="..."} — hallucinated-tool and bad-argument counts
  • hallucinated_package_reject_total{package="..."} — package allowlist rejections (proxy for slopsquatting attempts)
  • prompt_injection_suspected_total — content-classifier high-confidence flags (provider-specific; Anthropic's classifier on the default path, equivalents per provider — see provider-portability.md)

Steady-state expected values are near-zero. Spikes are the alert signal.

Layered defense illustrated

graph LR
    classDef untrusted fill:#ffcccc,color:#000
    classDef quarantined fill:#fff3e0,color:#000
    classDef trusted fill:#e8f5e9,color:#000
    classDef prod fill:#c8e6c9,color:#000

    U[GitHub issue body<br/>Third-party README<br/>External API response]:::untrusted
    Q[Quarantined LLM<br/>no tool access<br/>content classifier pre-filter]:::quarantined
    V[Typed value<br/>schema-validated]:::quarantined
    P[Privileged LLM<br/>plans and acts]:::trusted
    T[Tool call]:::trusted
    G[Guards:<br/>StuckGuard<br/>Dangerous-cmd<br/>Schema validation]:::trusted
    R[Rate limit<br/>LiteLLM proxy]:::trusted
    N[Egress NetworkPolicy<br/>L7 allowlist]:::trusted
    EX[Execution<br/>sandbox pod]:::prod

    U --> Q
    Q --> V
    V --> P
    P --> T
    T --> G
    G -->|allow| R
    G -->|block| STOP[Halt + emit event]
    R --> N
    N -->|egress allowed| EX
    N -->|egress denied| STOP

Six gates between untrusted input and production execution. Each gate is independently testable and deterministic where possible.

What this is not

  • Not a complete security story. See security.md for supply chain, admission policy, attestation chain.
  • Not a guarantee. Five pillars raise the floor; they do not make the rig uncompromisable. Known limits are enumerated.
  • Not an excuse for human inattention. Safety layers are risk mitigation, not risk elimination. T3 actions still require humans.

See also