When Familiars Fall: Multi-Agent Failure Recovery

By IT-Journey Team

Design resilient multi-agent systems on GitHub — detect sub-agent failures, apply compensation strategies, re-delegate failed tasks, and ensure partial progress is never lost.

Estimated reading time: 7 minutes

Edit on GitHub

The Proving Grounds are where the Council sends candidates who believe their systems are perfect. The Trials of the Fallen Familiar begin simply: one sub-agent fails. Does the system collapse? Does it recover? Does it preserve what was already done? Only those who plan for failure earn the right to deploy to production.

🗺️ Quest Network Position

graph LR
    Q15[✅ Q15: Scribe's Codex] --> Q16[🎯 Q16: Fallen Familiars]
    Q16 --> Q17[🔜 Q17: Agent Pantheon]
    style Q16 fill:#4CAF50,stroke:#2E7D32,stroke-width:4px,color:#fff

🎯 Quest Objectives

Classify sub-agent failure types — categorise by recoverability and blast radius
Implement failure detection — orchestrator detects when a sub-agent has failed
Apply a retry strategy — retry idempotent failures with exponential backoff
Implement re-delegation — on permanent failure, reassign task to a different agent
Preserve partial progress — ensure work done before failure is not lost

⚔️ The Quest Begins

Chapter 1 — Sub-Agent Failure Classification

Failure Type	Example	Recovery Strategy
Transient	Network timeout, rate limit	Retry with backoff
Idempotent but failed	Duplicate action, safe to re-run	Retry from checkpoint
Non-idempotent failure	Partial file write	Rollback + re-delegate
Permanent	Invalid input, missing permission	Escalate to human
Cascade	Sub-agent A failure blocks sub-agent B	Compensate + continue with partial

Chapter 2 — Detecting Sub-Agent Failure in the Orchestrator

Exercise 16.1: Configure the orchestrator to continue after sub-agent failure and assess damage.

# .github/workflows/orchestrator-with-recovery.yml
name: Multi-Agent with Failure Recovery

on:
  workflow_dispatch:
    inputs:
      task_id:
        description: "Task identifier"
        required: true

jobs:
  sub-agent-1:
    runs-on: ubuntu-latest
    continue-on-error: true      # Orchestrator must see all outcomes
    outputs:
      status: $
    steps:
      - uses: actions/checkout@v4
      - name: Execute sub-task 1
        id: run
        run: |
          set +e  # Don't fail immediately — capture outcome
          python3 work/gh-600/scripts/subtask.py --task analysis
          EXIT_CODE=$?
          
          if [ $EXIT_CODE -eq 0 ]; then
            echo "status=success" >> "$GITHUB_OUTPUT"
          else
            echo "status=failed" >> "$GITHUB_OUTPUT"
            # Save partial results before exiting
            python3 work/gh-600/scripts/save_checkpoint.py --task analysis
            exit $EXIT_CODE
          fi

      - name: Upload partial/full results
        if: always()   # Upload even on failure
        uses: actions/upload-artifact@v4
        with:
          name: subtask1-result
          path: subtask1-*.json

  sub-agent-2:
    runs-on: ubuntu-latest
    needs: sub-agent-1
    continue-on-error: true
    if: always()     # Run even if sub-agent-1 failed
    steps:
      - uses: actions/checkout@v4
      - name: Run with awareness of upstream status
        run: |
          UPSTREAM_STATUS="$"
          
          if [ "$UPSTREAM_STATUS" = "failed" ]; then
            echo "⚠️ Sub-agent 1 failed — running in degraded mode"
            python3 work/gh-600/scripts/subtask.py \
              --task synthesis \
              --degraded-mode \
              --skip-analysis
          else
            python3 work/gh-600/scripts/subtask.py --task synthesis
          fi

  recover-and-report:
    runs-on: ubuntu-latest
    needs: [sub-agent-1, sub-agent-2]
    if: always()
    steps:
      - uses: actions/checkout@v4

      - name: Download all results
        uses: actions/download-artifact@v4
        with:
          pattern: subtask*-result
          path: ./results/

      - name: Assess and recover
        id: assess
        run: |
          python3 work/gh-600/scripts/recovery_coordinator.py \
            --results-dir ./results/ \
            --task-id "$" \
            --agent1-status "$" \
            --agent2-status "$" \
            --output recovery-plan.json

      - name: Re-delegate failed tasks
        if: fromJSON(steps.assess.outputs.needs_redelegation)
        run: |
          python3 work/gh-600/scripts/redelegate_tasks.py \
            --failed-tasks "$"

Chapter 3 — Compensation Strategy Implementation

Exercise 16.2: Implement the recovery coordinator.

# work/gh-600/scripts/recovery_coordinator.py
"""Coordinates recovery from sub-agent failures in multi-agent workflows."""

import argparse
import json
import os
from pathlib import Path


def assess_and_recover(
    results_dir: str,
    task_id: str,
    agent_statuses: dict[str, str],
    output_file: str
) -> dict:
    """Assess the state of a multi-agent run and produce a recovery plan."""
    
    results = {}
    for result_file in Path(results_dir).rglob("*.json"):
        with open(result_file) as f:
            results[result_file.stem] = json.load(f)
    
    failed_agents = [k for k, v in agent_statuses.items() if v == "failure"]
    succeeded_agents = [k for k, v in agent_statuses.items() if v == "success"]
    
    recovery_plan = {
        "task_id": task_id,
        "failed_agents": failed_agents,
        "succeeded_agents": succeeded_agents,
        "partial_results_preserved": len(results),
        "recovery_actions": []
    }
    
    for agent_id in failed_agents:
        # Determine recovery strategy based on what's available
        agent_result = results.get(f"{agent_id}-result")
        
        if agent_result and agent_result.get("checkpoint_available"):
            recovery_plan["recovery_actions"].append({
                "agent": agent_id,
                "strategy": "retry_from_checkpoint",
                "checkpoint": agent_result["checkpoint_path"]
            })
        else:
            recovery_plan["recovery_actions"].append({
                "agent": agent_id,
                "strategy": "redelegate",
                "task": agent_result.get("original_task") if agent_result else "unknown"
            })
    
    with open(output_file, "w") as f:
        json.dump(recovery_plan, f, indent=2)
    
    print(f"Recovery plan: {len(failed_agents)} failed, {len(succeeded_agents)} succeeded")
    print(f"Recovery actions: {len(recovery_plan['recovery_actions'])}")
    
    # Set GitHub Actions outputs
    needs_redelegation = any(
        a["strategy"] == "redelegate"
        for a in recovery_plan["recovery_actions"]
    )
    print(f"::set-output name=needs_redelegation::{str(needs_redelegation).lower()}")
    
    return recovery_plan


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--results-dir", required=True)
    parser.add_argument("--task-id", required=True)
    parser.add_argument("--agent1-status", required=True)
    parser.add_argument("--agent2-status", required=True)
    parser.add_argument("--output", required=True)
    args = parser.parse_args()
    
    statuses = {
        "sub-agent-1": args.agent1_status,
        "sub-agent-2": args.agent2_status
    }
    assess_and_recover(args.results_dir, args.task_id, statuses, args.output)

✅ Quest Validation

python3 scripts/validate_quest.py --quest q16
# ✅ Recovery workflow: orchestrator-with-recovery.yml present
# ✅ Recovery coordinator: recovery_coordinator.py present
# ✅ Compensation strategies: all 5 types documented
# 🏆 Quest Q16 complete!

🏆 Quest Rewards

Reward	Details
🛡️ Battle-Tested Architect Badge	Earned on completion
🔄 Compensation Strategies	Skill unlocked
100 XP	Added to Level 1011 total
Unlocks	Q17: The Agent Pantheon

Layout	`default`
Collection	`quests`
Path	`_quests/1011/agentic-multi-agent-failure-recovery.md`
URL	`/quests/gh-600/agentic-multi-agent-failure-recovery/`
Date	`2026-05-17`

Settings

Search

Appearance

About

Page Location

Source Code

Page Info

Theme Skin

SVG Backgrounds

Layer Opacity