Skip to content

Multi-Agent Platform Research

This document is the research backing for architecture-proposed-v2.md. It captures findings from a source-level audit of 14 multi-agent and AI-dev-platform projects. The point isn't to summarize each one — the picks in v2 cite specific findings here.

Methodology

Three parallel research streams, each with access to the source code and primary docs (not just READMEs):

  • Stream A — Production GitHub-issue-closing dev agents: OpenHands, SWE-agent, Aider, Sweep AI, Cognition Devin (public material), Block Goose
  • Stream B — Multi-agent orchestration frameworks: CrewAI, AutoGen 0.4, MetaGPT, ChatDev, LangGraph, Camel-AI / AgentVerse
  • Stream C — Adjacent infrastructure: Beads, Backlog.md, Anthropic Claude Agent SDK, GPT Pilot, Goose Recipes, e2b / Daytona / Composio, Codex CLI, Cursor Cloud Agents, GitHub Spec Kit

For each project, the agents recorded: central abstraction, state recovery model, tool guards, multi-agent coordination, one pattern worth stealing, one pattern explicitly not worth stealing.

Convergence signals

When multiple independent projects converge on the same pattern, that's the strongest "we should build this" evidence. Five clusters emerged:

1. Cheap deterministic stuck-detection at the tool-call layer

Three independent codebases converged: OpenHands StuckDetector (controller/stuck.py) tracks 5 patterns — action+observation repeats 4×, action+error repeats 3×, agent monologue, ABAB alternation over 6 steps, AgentCondensationObservation repetition (context-window failure marker). Goose RepetitionInspector (tool_monitor.rs) tracks last_call, increments repeat_count on identical (name+params), denies above max_repetitions. Sweep AI used llm_state['visited_set'] plus an attempt counter, skipping a file after attempt_count >= 3.

None of them ask the LLM to detect its own loops. All three count tool calls and break above a threshold.

2. Per-consumer cursor on the event log

Two independent codebases converged: LangGraph uses Checkpoint.versions_seen (per-node-per-channel) to drive "which nodes execute next." MetaGPT uses _watch([ActionType, ...]) declarations + a per-role msg_buffer.

Both treat "what events has this agent already consumed" as first-class state, separate from the global event log. Conductor-E doesn't have this — agents pull from assignments/next with no notion of cursor position, which is why the per-pod capacity problem in v1 of this doc was hard to express.

3. N parallel attempts + single arbitration > multi-agent collaboration

Two independent codebases converged: SWE-agent's ScoreRetryLoop samples N attempts, scores each with a reviewer LLM (5×, averaged), tiebreaks on API call count. Cognition's "Don't Build Multi-Agents" essay argues that fine-grained intra-task multi-agent setups are slower and worse than parallel single-agent attempts with arbitration.

Implication for the rig: when KEDA scales Dev-E to >1 replica on the same issue, the right pattern is parallel-with-reviewer-pick, not coordination protocol.

4. Pre-assignment task refinement

Two independent codebases converged: Camel's TaskSpecifyAgent rewrites a vague brief into a concrete one before any work agent sees it. GPT Pilot's "Spec Writer" asks clarifying questions and refuses to start until requirements are concrete. Both gate work behind clarification.

5. Bounded loop + sentinel + escalation

ChatDev caps inner-phase chats at chat_turn_limit rounds and falls back to self_reflection() if no <INFO> sentinel is emitted. The general pattern (without the magic-string sentinel) shows up repeatedly: no agent loop should run forever; bound the iteration count and emit an escalation event on exhaustion.

Specific patterns by project

OpenHands (All-Hands-AI/OpenHands)

Event-stream + AgentController state machine. Sub-agents (is_delegate=True) subscribe to parent's stream; bubble AgentDelegateObservation back via end_delegate(). State recovery via pickle+base64 in StateTracker; State.__getstate__ deliberately drops history because it gets rehydrated from the event stream (events are the source of truth). Hierarchical delegation, not peer-to-peer.

Steal: StuckDetector 5-pattern set as middleware (see Convergence #1). Skip: Pickle-based state — Marten gives typed JSONB.

SWE-agent (Princeton NLP)

Agent-Computer Interface (ACI): tools as a curated bundle (folder of executables + config.yaml) running inside a sandbox. Each tool can declare a state command that runs after every action. ScoreRetryLoop + Reviewer (agent/reviewer.py) — sample N attempts, score, tiebreak on API call count.

Steal: ScoreRetryLoop pattern when KEDA scales to >1 replica (see Convergence #3). Skip: Bundle/state-command DSL — Claude Code's tool surface already covers this.

Aider (Aider-AI/aider)

Repo map = tree-sitter symbols ranked by PageRank over file-dependency graph, truncated to --map-tokens (default 1k). Edit formats: whole, diff, diff-fenced, udiff, editor-*. State recovery: git itself — every edit is a commit; git reflog for recovery.

Steal: PageRank-ranked repo map as a RepoMapBuilt {sha, mapTokens, topSymbols[]} event on WorkStarted. Improves cold-start grounding without context bloat. Skip: Edit-format negotiation per model — Claude Code handles this.

Sweep AI (historical, archived)

Pipeline of 18 small specialised bots chained explicitly. StatefulCodeSuggestion per file with pending/processing/done. Validation loop: post-edit runs format_file() and rejects if formatter wipes out the change (if file_data['original_contents'] != formatted_contents: ...).

Steal: Formatter-reversion check as a Dev-E pre-commit hook. Two lines of code, catches a real failure mode (LLM adding semantically-null whitespace tweaks). Skip: The 18-bot fan-out. Sweep died partly from the debugging cost.

Cognition / Devin

Public essay "Don't Build Multi-Agents" — "decision-making becomes too dispersed and context isn't shared thoroughly enough between agents." Recommends single-thread linear context with a dedicated context-compression LLM for long tasks.

Steal: ContextCompressed event carrying summary blob, emitted when Dev-E approaches its window. Lets a fresh replica resume. Don't take as dogma: the rig's 3 agents have coarse, well-defined boundaries (assignment → PR-created → review-done). Cognition's warning is against fine-grained intra-task parallelism; coarse role separation is fine.

Block Goose

Recipes (crates/goose/src/recipe/mod.rs) — typed YAML with instructions, prompt, extensions, parameters (typed: string/number/bool/date/file/select), response (with JSON-schema validation), settings, and sub_recipes for composition. Permission system: LLM-as-judge — separate provider classifies each tool call as approved/needs_approval/denied. RepetitionInspector as deterministic backstop.

Steal: Recipe schema for .claude/agents/ overlays — typed parameters + sub_recipes composition is more rigorous than what we have. Skip: LLM-judge for every tool call — doubles latency for marginal safety gain when allowlists + human-in-loop already exist.

CrewAI

Crew carries agents, tasks, optional manager_agent/manager_llm, optional Memory. Process types: sequential, hierarchical (third TODO). Delegation = two tools injected into every agent's toolbelt: DelegateWorkTool(agents=...) and AskQuestionTool(agents=...). Coordination is implicit (manager agent's prompt + the two tools).

Steal: Paired QuestionAsked / QuestionAnswered events for sub-agent clarification, with replyTo correlation. Skip: Hierarchical-as-prompted-tool — we have a real assignment endpoint.

AutoGen 0.4 (Microsoft)

Complete rewrite into actor model. AgentRuntime Protocol with send_message(message, recipient: AgentId, ...) (RPC, awaits response) and publish_message(message, topic_id: TopicId, ...) (pub/sub). RoutedAgent's @message_handler decorator does type-based dispatch with target_types and produces_types declarations.

Steal: Declared consumes / produces event-type contract per agent. Promote Conductor-E event discriminators to a registry that says "Dev-E consumes PRReviewRequested, produces PRReviewSubmitted | ChangesRequested." Static topology check at deploy time. Skip: Actor identity per agent — fights KEDA scale-to-zero. Pods are stateless workers, not addressable actors.

MetaGPT

Each Role runs _observe()_think()_act(). _observe filters the role's msg_buffer against _watch()-registered Action types. Environment is the shared bus, routes via publish_message().

Steal: Explicit _watch([EventType, ...]) declaration per agent. Makes topology auditable; lets you detect "no agent watches event type X" at startup. Skip: RoleZero's runtime tool recommender — character.json + MCP wiring is the right place for tool surface.

ChatDev

Linear phase list over a mutable ChatEnv blackboard. Inner-phase loops bounded by chat_turn_limit; if no <INFO> sentinel, self_reflection() is called.

Steal: Bounded-loop sentinel for review cycles — after N round-trips between Dev-E and Review-E, emit ReviewLoopExceeded and escalate to ATL-E. Skip: <INFO> magic string — use typed tool calls / events.

LangGraph

Checkpoint TypedDict — id, channel_values, channel_versions, versions_seen (per-node-per-channel), pending_writes. Keyed on thread_id. interrupt(value) raises GraphInterrupt; Command(resume=..., update=..., goto=...) for typed resume.

Steal: versions_seen per-consumer cursor model + Command-style typed resume tokens. Skip: Re-executing entire node on resume — agents aren't replayable Python functions.

Camel + AgentVerse

Camel's TaskSpecifyAgent (refines vague briefs before main agents see them). AgentVerse's Visibility as a pluggable component, separate from subscription.

Steal: Pre-assignment task refinement (see Convergence #4). Visibility-as-filter is interesting but lower priority.

Beads (Gastown's underlying issue tracker, Dolt-backed)

Auto-ready task detection: bd ready returns next task whose dependencies are all closed. Typed link relations (relates_to/duplicates/supersedes/replies_to/blocks).

Steal: Add /api/work/ready?agent=dev-e endpoint to Conductor-E that returns ready issues with resolved dependency chains. (Marten projection.) Skip: Dolt as parallel data store — Marten covers it.

Backlog.md

Markdown files in repo + frontmatter + MCP. Tasks-as-files with enforced acceptance criteria + Definition of Done in frontmatter.

Steal: Idea of acceptance_criteria and done_when in issue frontmatter; Review-E reads it, gates merge on it. Skip: Parallel markdown task store — GitHub Issues + Conductor-E covers it.

Anthropic Claude Agent SDK + "Building Effective Agents"

Five workflow patterns: chaining / routing / parallelization / orchestrator-workers / evaluator-optimizer. Explicit warning: "most multi-agent setups are slower and worse than a single agent with good tools — invest in ACI first." The rig is "routing + orchestrator-workers" hybrid via Conductor-E.

Steal: Audit how well-documented Dev-E's tools (Conductor-E REST API, gh, build scripts) are vs. how well a public SDK would be documented. That's the ACI investment Anthropic is pointing at. Add an explicit evaluator-optimizer loop between Dev-E and Review-E (today one-shot; bounded refinement loop with N=2 is small).

GPT Pilot (now archived, redirects to Pythagora.ai)

Multi-role pipeline (Spec Writer / Architect / Tech Lead / Dev / Reviewer / Debugger). Spec Writer asks clarifying questions before any code.

Steal: Clarifier gate at issue intake (see Convergence #4). Cautionary: the dead repo is evidence that maintaining 6 specialized agents is a cost trap. Don't grow our 3-agent shape into a Pythagora-style pipeline.

Goose Recipes / Hub

Recipes as versioned YAML config artifacts — git-cloneable, shareable.

Steal: Treat workflow patterns (review-pr.yaml, triage-issue.yaml) as declarative YAML committed to a .rig/recipes/ dir, consumed by Conductor-E when dispatching work. Today these patterns live as prose inside Dev-E/Review-E system prompts. Lifting to declarative YAML gives per-recipe model selection, cost ceilings, auditable history.

e2b / Daytona / Composio

Snapshot-restore microVMs (Firecracker, ~28-200ms cold start) with hardware-level kernel isolation. Composio: centralized OAuth token vault.

Skip: microVMs — wrong threat model. We run our own code on our own infra; pod isolation is sufficient. Already have: Composio's idea (Bitwarden + SealedSecrets).

Codex CLI hooks

hooks.json in .codex/ with layered config — global ~/.codex hooks + repo .codex hooks both load and merge (don't replace). Five events: SessionStart, PreToolUse, PostToolUse, UserPromptSubmit, Stop.

Steal: Hierarchical config merge model for AGENTS.md and recipe loading — repo overrides org defaults, org overrides global.

Continue.dev / Cursor Cloud Agents

Cursor 2026: each agent gets its own Ubuntu VM + browser, up to 8 parallel via git worktrees. 35% of Cursor's own merged PRs are agent-authored. Sandboxed terminals with no internet by default.

Steal: Git worktrees for parallel agent isolation. One bare clone + N worktrees = one branch per agent, atomic file ops, no race. Default-deny egress NetworkPolicy for agent shell tool — allowlist GitHub, Anthropic, registry, Conductor-E.

GitHub Spec Kit

Markdown specs (.specify/spec.md, plan.md, tasks/, constitution.md) edited by humans and agents, lives in repo. GitHub-blessed, supports Claude Code / Copilot / Gemini CLI.

Steal: For changes bigger than a single PR, use .specify/ layout — spec.md + plan.md linked from parent issue; sub-issues auto-generated from tasks/. Keeps GitHub Issues as work source while adding markdown specs in the repo.

Anthropic's overarching warning

The "Building Effective Agents" doc is worth quoting verbatim:

"We've found that most multi-agent setups are slower and worse than a single agent with good tools — invest in agent-computer interface first."

The rig has 3 agents (Conductor-E, Dev-E, Review-E) with clean handoff boundaries. That's defensible. The trap to watch is: "let's add Spec-E and Triage-E and Architect-E" until we have GPT Pilot's 6-role pipeline (now archived as unmaintained).

When proposing a new agent, the bar is: "does this role have a clean event-shaped boundary with the existing agents?" If the answer requires shared intra-task context, build a tool instead.

How this informs v2

architecture-proposed-v2.md cites this document for justification. The v2 picks that came directly from this research are flagged in v2 with [research] markers.

Sources