The Council of Many: Multi-Agent Coordination
The trials of the single familiar are behind you. You taught one agent to plan, to act, to be measured and reforged. But the great works of the realm are never raised by one pair of hands — they are raised by a council. Summon many familiars at once and the realm’s work multiplies; summon them carelessly and they trip over one another in the dark, each undoing the next’s labor, no scribe able to say which one broke the bridge. This chapter is the law of the council: how to convene many agents, keep a record every one of them signs, and know exactly which to blame when a wall falls.
The trials of the single familiar are behind you. You taught one agent to plan, to act, to be measured and reforged. But the great works of the realm are never raised by one pair of hands — they are raised by a council. Summon many familiars at once and the realm’s work multiplies; summon them carelessly and they trip over one another in the dark, each undoing the next’s labor, no scribe able to say which one broke the bridge. This chapter is the law of the council: how to convene many agents, keep a record every one of them signs, and know exactly which to blame when a wall falls.
The real-world skill beneath the spellcraft is multi-agent system design — and on GitHub it is, almost entirely, a GitHub Actions design problem. The primitives are not exotic: workflow triggers, job dependencies, job outputs, artifacts, and environments. Master orchestration patterns, distributed tracing, failure recovery, and lifecycle management and you have the whole of GH-600 Domain 5 — 17% of the exam, and the moment the certification stops asking “can you build one agent?” and starts asking “can you run a system of them that works reliably together?”
📖 The Legend Behind This Quest
Every guild that scaled past a single craftsperson faced the same reckoning: coordination is harder than craft. One smith is judged by their own hammer. A forge of twenty smiths is judged by whether the swords come out matching, whether two of them reach for the same anvil, and whether the foreman can trace a cracked blade back to the hand that quenched it wrong. Multi-agent systems are that forge. The failure modes are emergent — they live in the seams between agents, not inside any one of them. Agent C fails not because agent C is broken, but because agent B handed it garbage, because agent A drew an ambiguous plan. The council that cannot trace a handoff cannot debug itself, and a system that cannot debug itself cannot be trusted with autonomy. This chapter teaches you to convene the council and to hold it accountable.
🎯 Quest Objectives
Primary Objectives
- Apply the fan-out and chain orchestration patterns to coordinate multiple agents on GitHub Actions
- Configure agent isolation so parallel agents do not collide, and detect/resolve overlapping or contradictory outputs
- Thread a correlation ID through every job so the whole run is one auditable trace
- Choose and encode a failure-recovery strategy (abort, continue, retry, escalate) for a degraded sub-agent
- Maintain an agent registry that governs provisioning, health checks, and graceful deprecation
Mastery Indicators
- You can distinguish sequential, parallel, and hierarchical orchestration and justify which fits a given task
- You can read a multi-agent trace and name the agent that introduced a fault, not just the one that crashed
- You can tell a stalled sub-agent from a conflicting one by its signal, and respond to each differently
- You can add, reconfigure, or retire an agent without disrupting an active workflow
🗺️ Quest Prerequisites
Convening a council assumes you can already command a single familiar. Gather these before you draw the larger circle:
- A tuned single agent (Domain 4) — finish Chapter IV — Evaluation & Tuning so you know how to measure one agent before you measure many.
- Fluency in GitHub Actions YAML — you will write
jobs,needs,outputs, andifconditions by hand. Know where.github/workflows/lives and how a workflow is triggered. - A GitHub repository with Actions enabled — the arena where your council runs. Settings → Actions → General → allow workflows to run.
- GitHub Copilot coding-agent access — the agent that actually does the sub-tasks. You orchestrate it; Actions schedules it.
- Git + an editor — to author workflows and the
_data/agents.ymlregistry and ship them as small PRs.
🧙♂️ Chapter 1: Convening the Council — Fan-Out, Chain, and Hierarchy
⚔️ Skills You’ll Forge
- Reasoning about when to parallelize agents versus pipe them in sequence
- Wiring a fan-out orchestrator with
needsand job outputs on GitHub Actions - Building a chain where each agent’s output is the next agent’s input
- Isolating parallel agents so they cannot corrupt one another’s work
Two patterns cover most multi-agent work, and the exam wants you to tell them apart on sight.
Fan-out (parallel). An orchestrator job triggers several sub-agents simultaneously, each owning a different slice of the work — frontend tests, backend tests, a security scan. A final job waits for all of them (needs: [agent_a, agent_b]) and evaluates the collected results. Fan-out is for independent subtasks: when no agent needs another’s output, run them at once and reclaim the wall-clock time.
Chain (sequential). Each sub-agent’s output is the next one’s input. A planning agent produces a plan; an implementation agent implements it; a review agent reviews the result. Chain is for dependent subtasks — when agent B genuinely cannot start until agent A finishes.
A third shape, hierarchical, nests the two: an orchestrator fans out to several sub-orchestrators, each of which runs its own chain. The exam tests whether you can name which shape a scenario describes, so anchor the rule: independent → parallel, dependent → sequential, both-at-scale → hierarchical.
Here is fan-out expressed in the real primitives. The orchestrator emits an output every downstream job can read; two agents run in parallel; a collector waits for both with if: always() so it runs even when one agent fails.
# .github/workflows/council-fanout.yml
name: Council — Fan-Out
on:
workflow_dispatch:
jobs:
orchestrate:
runs-on: ubuntu-latest
outputs:
run_id: ${{ steps.plan.outputs.run_id }}
steps:
- id: plan
run: echo "run_id=council-${{ github.run_id }}-${{ github.run_attempt }}" >> "$GITHUB_OUTPUT"
agent_frontend:
needs: orchestrate
runs-on: ubuntu-latest
steps:
- run: echo "Frontend agent for ${{ needs.orchestrate.outputs.run_id }}"
# ... invoke the Copilot coding agent against the frontend slice
agent_backend:
needs: orchestrate
runs-on: ubuntu-latest
steps:
- run: echo "Backend agent for ${{ needs.orchestrate.outputs.run_id }}"
# ... invoke the Copilot coding agent against the backend slice
collect:
needs: [agent_frontend, agent_backend]
if: always() # run even if a sub-agent failed
runs-on: ubuntu-latest
steps:
- run: echo "Collecting results for ${{ needs.orchestrate.outputs.run_id }}"
# ... reconcile outputs, detect conflicts, decide pass/fail
Isolation is the discipline that makes parallelism safe. Two agents editing the same files at once will produce overlapping diffs, duplicated effort, or contradictory outputs. Give each parallel agent its own branch (Copilot’s coding agent opens a PR from its own branch by design), scope each to a non-overlapping path of the repo, and reconcile in the collect job — never inside a racing agent. When two agents do touch the same surface, that is a conflict to detect and resolve at the join, exactly the sub-skill the exam tests under “detect and resolve agent conflicts.”
🔍 Knowledge Check
- Given three subtasks where each depends on the previous one’s output, is fan-out or chain correct — and why?
- Why does the
collectjob useif: always()instead of the default behavior? - What single technique keeps two parallel agents from corrupting each other’s file changes?
🧙♂️ Chapter 2: The Scribe of the Council — Correlation IDs and Distributed Tracing
⚔️ Skills You’ll Forge
- Generating a correlation ID that names one entire multi-agent run
- Threading that ID through job outputs, step summaries, and artifact filenames
- Producing review-and-audit artifacts that document handoffs and decisions
- Reading a unified trace to find the agent that caused a fault, not the one that crashed
The defining challenge of a multi-agent system is debugging it. When agent C fails, the cause may be faulty output from B, which traces to an ambiguous plan from A. If each agent logs in isolation, you are left correlating timestamps by hand across three job logs at three in the morning. The cure is distributed tracing: every agent writes structured log entries carrying a shared correlation ID — one unique identifier for the entire run — so you can query every entry for that run, across all agents, in order.
On GitHub Actions the correlation ID is born in the orchestrator and travels as a job output, then gets injected into artifact filenames and step-summary headers so the trace reassembles itself from the run’s own evidence.
# excerpt — every agent stamps the shared correlation ID
agent_backend:
needs: orchestrate
runs-on: ubuntu-latest
env:
CORRELATION_ID: ${{ needs.orchestrate.outputs.run_id }}
steps:
- name: Run agent and emit a traced log line
run: |
echo "{\"cid\":\"$CORRELATION_ID\",\"agent\":\"backend\",\"event\":\"start\"}" \
| tee -a "trace-$CORRELATION_ID.jsonl"
# ... agent work ...
echo "{\"cid\":\"$CORRELATION_ID\",\"agent\":\"backend\",\"event\":\"done\"}" \
| tee -a "trace-$CORRELATION_ID.jsonl"
- name: Write to the run summary
run: echo "### backend agent — run \`$CORRELATION_ID\`" >> "$GITHUB_STEP_SUMMARY"
- uses: actions/upload-artifact@v4
with:
name: trace-backend-${{ env.CORRELATION_ID }}
path: trace-${{ env.CORRELATION_ID }}.jsonl
Because every agent stamps the same cid, the collect job can download all the trace artifacts, concatenate them, and sort by event order to produce one human-readable narrative of the whole council’s work: who started, what each handed off, where the chain broke. That artifact is the audit record — Domain 5 explicitly asks you to “document key decisions, handoffs, and outcomes across agents” and to enable “post-hoc analysis.” A correlation-ID trace is how you satisfy both. The exam’s classic stem — “a trace excerpt is shown; which agent introduced the fault?” — is answerable only because the trace is unified; without the shared ID you can name the agent that crashed but never the one that caused it.
You can collapse all three agents’ artifacts in the orchestrator with the Models API or a small script, but the primitive is the same: shared ID in, unified trace out.
# collect step — stitch every agent's trace into one ordered narrative
cid="$1" # the run's correlation ID
cat trace-*-"$cid"/*.jsonl 2>/dev/null \
| jq -s 'sort_by(.event) | .[] | "\(.cid) \(.agent) \(.event)"' -r \
> "council-trace-$cid.md"
echo "Wrote council-trace-$cid.md"
🔍 Knowledge Check
- What is a correlation ID, and what is the one property it must have across a multi-agent run?
- Where does the correlation ID get injected on GitHub Actions so the trace can be reassembled?
- Why can a unified trace name the agent that caused a failure when isolated logs only name the one that crashed?
🧙♂️ Chapter 3: When a Familiar Falters — Recovery and the Living Roster
⚔️ Skills You’ll Forge
- Choosing among abort, continue, retry, escalate when one sub-agent fails
- Encoding each strategy with
continue-on-error,if:conditions, andneeds - Telling a stalled sub-agent from a conflicting one by its signal
- Running an agent registry to provision, health-check, and deprecate agents
A single agent that fails simply fails. A council is subtler: when one sub-agent falls, the orchestrator must decide what the failure means for the rest. Four strategies cover the field, and the exam wants you to match the strategy to the scenario:
- Abort — stop all agents and mark the whole run failed. Right when subtasks are interdependent and a partial result is worthless or unsafe.
- Continue — mark the failing agent’s subtask failed, let the others finish. Right when subtasks are independent and partial progress has value.
- Retry — re-run the failing agent, often with modified inputs. Right for transient faults (a flaky network, a rate limit) — not for logic errors, which will fail identically.
- Escalate — open a human-review issue and pause. Right when the failure is ambiguous, irreversible-adjacent, or beyond the agents’ authority to resolve.
The signal tells you which fault you have. A stalled agent produces no progress and eventually times out — recover with a retry or a timeout-and-continue. A conflicting agent produces output that contradicts a peer’s — recover at the join with a rollback or a human-in-the-loop decision, never a blind retry (re-running it just reproduces the conflict). Knowing stalled vs conflicting is exactly what Domain 5 sub-skill 5.3 tests.
On GitHub Actions, continue-on-error: true lets the orchestrator survive a sub-agent’s failure, and if: always() / if: failure() route the recovery:
agent_security:
needs: orchestrate
runs-on: ubuntu-latest
continue-on-error: true # CONTINUE strategy: a fail here won't abort the run
steps:
- run: ./run-security-agent.sh
recover:
needs: [agent_frontend, agent_backend, agent_security]
if: failure() # only when a sub-agent failed
runs-on: ubuntu-latest
steps:
- name: Escalate to a human
run: |
gh issue create \
--title "Council run ${{ needs.orchestrate.outputs.run_id }} needs review" \
--body "A sub-agent failed. Trace artifact attached to the run."
env:
GH_TOKEN: ${{ github.token }}
Lifecycle management is the standing-army version of all this. Multi-agent systems have operational demands a lone agent never does: provisioning a new agent and registering it; health monitoring to confirm each agent is responsive and producing expected output; and deprecation — gracefully retiring an agent being replaced without disrupting an active workflow. The Agentic Codex governs this with an agent registry: a single _data/agents.yml that records every agent’s name, role, owner, status, and review date. The registry is the source of truth — you add an agent by adding a row, retire one by flipping its status, and a scheduled health-check workflow reads the file to know who to ping.
# _data/agents.yml — the council's roster (source of truth for lifecycle)
- name: frontend-agent
role: Runs and repairs frontend tests
owner: web-platform
status: active # active | deprecated | retired
review_date: 2026-09-01
- name: security-agent
role: Static + dependency scanning
owner: appsec
status: active
review_date: 2026-09-01
- name: legacy-linter
role: Old style checks — superseded by security-agent
owner: appsec
status: deprecated # still runs, slated for retirement; do not add new deps on it
review_date: 2026-07-15
Adding an agent to an existing workflow is a new roster row plus a new job; replacing one is a status flip and a job swap behind the same correlation-ID contract; retiring one preserves auditability because its past traces still carry the shared cid. That is how a council grows and sheds members without ever losing its memory.
🔍 Knowledge Check
- Which recovery strategy fits a transient network failure, and which fits a logic error — and why are they different?
- What signal distinguishes a stalled sub-agent from a conflicting one, and why is a blind retry wrong for the conflicting case?
- What three lifecycle operations does
_data/agents.ymlgovern, and what does flipping astatustodeprecatedcommunicate?
⚔️ The Quests of This Domain
Domain 5 splits into four trials, each a standalone quest that drills one sub-skill to mastery. Play them in order to convert this chapter’s overview into hands-on command:
- The Council of Many: Multi-Agent Orchestration Patterns — build the full fan-out and chain workflows by hand, configure parallel isolation, and resolve agent conflicts at the join (Sub-skill 5.1).
- The Scribe’s Codex: Observability in Multi-Agent Systems — implement the correlation-ID trace writer and produce the review-and-audit artifact that documents every handoff (Sub-skill 5.2).
- When Familiars Fall: Multi-Agent Failure Recovery — build the recovery coordinator that detects stalled versus conflicting agents and routes abort/continue/retry/escalate (Sub-skill 5.3).
- The Agent Pantheon: Multi-Agent Lifecycle Management — author the
agents.ymlregistry schema and the scheduled health-monitoring workflow that provisions, watches, and retires agents (Sub-skill 5.4).
🎮 Mastery Challenge
Objective: Convene a real council on your own repository and prove it can both coordinate and account for itself.
- A
workflow_dispatchworkflow fans out to two sub-agent jobs that run in parallel, each on its own branch/path scope, with acollectjob that runsif: always() - A correlation ID minted in the orchestrator reaches every job, every step summary, and every uploaded trace artifact — and the
collectjob stitches them into one orderedcouncil-trace-<cid>.md - One sub-agent is marked
continue-on-error: true, and arecoverjob withif: failure()escalates by opening an issue tagged with the run’s correlation ID - A
_data/agents.ymlregistry lists both agents withstatusandreview_date, and you can describe how you would retire one without breaking the run
🎁 Rewards & Progression
- 🏛️ Council Convener — orchestrated your first multi-agent workflow with fan-out and a collector
- 🧵 Trace Weaver — threaded a single correlation ID through an entire agent run
- 🗂️ Skill unlocked: Fan-out, chain, and hierarchical orchestration on GitHub Actions
- 🔭 Skill unlocked: Distributed tracing across agents with correlation IDs
- 🛟 Skill unlocked: Multi-agent failure recovery and lifecycle management
- +90 XP
🗺️ Quest Network
graph LR
Hub["👑 Epic: The Agentic Codex"] --> Prev["Ch. IV — Evaluation & Tuning"]
Prev --> This["Ch. V — Multi-Agent Coordination"]
This --> Next["Ch. VI — Guardrails & Accountability"]
click Hub "/quests/codex/agentic-codex/"
click Prev "/quests/1010/agentic-codex-04-evaluation-and-tuning/"
click This "/quests/1011/agentic-codex-05-multi-agent-coordination/"
click Next "/quests/1100/agentic-codex-06-guardrails-and-accountability/"
classDef current fill:#1f6feb,stroke:#0b3d91,color:#fff;
class This current;
🔮 Next Adventures
The council convenes, traces its own work, and survives a fallen familiar. But a council that can act in parallel can also do harm in parallel — many hands moving at once need many brakes. The next chapter teaches the law that binds them: autonomy levels, least-privilege scope, and the human-in-the-loop gate that keeps a fast council from becoming a runaway one.
- ➡️ Next chapter: Chapter VI — Guardrails & Accountability
- ⬅️ Previous chapter: Chapter IV — Evaluation & Tuning
- 🏰 Campaign hub: Epic Quest: The Agentic Codex
📚 Resource Codex
- GH-600 Study Guide — Skills Measured — the official Domain 5 breakdown on Microsoft Learn
- GitHub Copilot coding agent — the agent each council member runs, branching and opening PRs autonomously
- GitHub Actions — workflow syntax —
jobs,needs,outputs,continue-on-error, andifconditions - Job dependencies with
needs— how fan-out and chain are expressed in Actions - Storing and sharing workflow data with artifacts — where the correlation-ID traces live
- Model Context Protocol (MCP) — the tool-integration standard agents share across a council
- GitHub Models — the Models API for reconciling agent outputs in the collector
🕸️ Knowledge Graph
Structured wiki-links connect this quest to the IT-Journey knowledge graph. Open the Obsidian Graph View to explore connections.
Campaign hub: [[Epic Quest: The Agentic Codex]] Previous: [[Evaluation and Tuning: Reforging the Agent’s Mind]] Next: [[Guardrails and Accountability: The Warden’s Pact]] Domain quests: [[The Council of Many: Multi-Agent Orchestration Patterns]] · [[The Scribe’s Codex: Observability in Multi-Agent Systems]] · [[When Familiars Fall: Multi-Agent Failure Recovery]] · [[The Agent Pantheon: Multi-Agent Lifecycle Management]] Obsidian docs: [[Obsidian Knowledge Graph and Wiki Links]]
🎁 Rewards
Badges
- 🏛️ Council Convener — orchestrated your first multi-agent workflow
- 🧵 Trace Weaver — threaded a correlation ID through a whole agent run
Skills unlocked
- 🗂️ Fan-out and chain orchestration on GitHub Actions
- 🔭 Distributed tracing across agents with correlation IDs
- 🛟 Multi-agent failure recovery and lifecycle management
Features unlocked
- Domain 6 — Guardrails & Accountability
🕸️ Quest Network
Click a node to open the quest · ⌘/Ctrl-click for a new tab · drag to reposition · scroll to zoom.
Referenced by
- Loading…