Domain 4 of GH-600 (19% of the exam — tied with Domain 3 as the largest) covers evaluation and improvement. This is a domain that separates engineers who deploy agents from engineers who operate agents.
Deploying an agent is a one-time act. Operating an agent is continuous work: measuring its success rate, analysing its failures, and iterating on its instructions until it reliably does what you need.
Machine-Verifiable Success Criteria (Sub-skill 4.1)
The most powerful concept in Domain 4 is the distinction between:
- Vague criteria: “The agent should implement the feature correctly.”
- Machine-verifiable criteria: “All CI checks pass, no new security alerts are opened, and the PR receives at least one approving review.”
Machine-verifiable criteria can be checked automatically by a workflow. This means you can know, without human review, whether the agent’s output meets the bar — and you can do this check in a reproducible, consistent way across every agent run.
The implementation pattern is a check-task-completion.yml workflow that runs after the agent creates a PR and evaluates each criterion against GitHub’s API signals.
Root Cause Analysis (Sub-skill 4.2)
When an agent fails, the instinct is to re-run it. But re-running without understanding the failure is how you create a pattern of intermittent failures that are never really fixed.
Domain 4 sub-skill 4.2 covers a structured RCA approach for agent failures. The key artefacts are:
- Failure taxonomy — a classification of failure types (tool failure, context failure, instruction failure, environment failure, etc.)
- 5-Why analysis — a root cause drill-down that asks “why” five times to find the systemic cause
- RCA document — a written record of the findings, the root cause, and the fix applied
In GitHub Actions, forensics are collected using gh run download to get the full log and artifact set from a failed run.
Behaviour Tuning (Sub-skill 4.3)
After identifying root causes, you change the agent’s instructions to prevent recurrence. This is sub-skill 4.3: iterative instruction improvement.
The key discipline is treating instructions like code: version them, test changes, and keep a changelog. The Agentic Codex uses a docs/agent-instructions/CHANGELOG.md pattern where every instruction change is recorded with a before/after comparison and the metric it was targeting.
Domain 4 Quests
| Quest | Skill | Link |
|---|---|---|
| Q11 | Success Criteria & Signals | Success Criteria & Signals |
| Q12 | Failure Root Cause Analysis | Failure Root Cause Analysis |
| Q13 | Behaviour Tuning | Behavior Tuning |
These quests include the full success criteria schema, RCA template, instruction changelog pattern, and the measure_agent_baseline.sh script for establishing baselines before making changes.