Alerting Systems: Alertmanager, Routing, On-Call & Runbooks

Forge production alerting with Prometheus rules and Alertmanager: routing, grouping, silencing, inhibition, on-call escalation, and runbooks.

IT-Journey Team
Published Nov 29, 2025
Updated Jun 14, 2026
Quests
🔴 hard
1010 alertmanager prometheus on-call main_quest +2
View source

Estimated reading time: 24 minutes

Edit on GitHub

🔥 Lvl 1010Warrior 🏰 Main Quest 🔴 Hard 75-90 minutes

Alerting Systems: Alertmanager, Routing, On-Call & Runbooks

Turn signals into actionable pages with alert rules, Alertmanager routing, on-call, and runbooks

Primary Tech: 🛠️ alertmanager
Skill Focus: Devops
Series: Observability Mastery
Author: IT-Journey Team
XP Range: ⚡ 4500-5250

📈 Your Progress

Not started · 0%

Your progress is stored in your browser only. Use your inventory to back it up.

🗝️ Prerequisites

Recommended quests

Monitoring Fundamentals: Metrics, Logs, and Traces

Knowledge requirements

Comfort on the command line and reading YAML
SLIs, SLOs, and error budgets from Monitoring Fundamentals
Basic PromQL or willingness to read the examples carefully

System requirements

Modern OS (macOS, Windows 10+, Linux)
Docker and Docker Compose for the Prometheus + Alertmanager lab
A terminal and a text editor or IDE (VS Code recommended)

Greetings, brave adventurer! From the Watchtower you have learned to see fires - in metrics, in logs, in traces. But sight alone does not save a kingdom; someone must be woken when the fire is real, and *not woken for every harmless spark. This final quest of the Watchtower, Alerting Systems, teaches you to forge the signal-horn: alert rules that fire only on what matters, routing that wakes the right defender, and runbooks that tell them exactly what to do when the horn sounds.*

Whether you have been paged at 3 a.m. for a problem that fixed itself, or you have watched a real outage go unnoticed because the alert was buried in noise, this adventure forges the discipline every on-call Warrior needs: actionable alert rules, Alertmanager routing, grouping and silencing, escalation, and the humble, life-saving runbook.

📖 The Legend Behind This Quest

In the early ages, an operator watched a dashboard with their own eyes and shouted when something broke. That does not scale past one tired human. The kingdoms that survived learned a hard truth: an alert that does not demand action is noise, and noise trains defenders to ignore the very horn that should save them. This is alert fatigue, and it has caused more outages than any single bug.

Good alerting is therefore as much about what you do not alert on as what you do. This quest teaches you to wire Prometheus alert rules into Alertmanager, route each alert to the right receiver, silence the expected, suppress the redundant, and attach a runbook so the person you wake knows precisely what to do.

🎯 Quest Objectives

By the time you complete this journey, you will have mastered:

Primary Objectives (Required for Quest Completion)

Alert Rules - Write a Prometheus alerting rule with expr, for, labels, and annotations
Alertmanager Routing - Route alerts to receivers by label, with grouping to batch related alerts
Silencing & Inhibition - Mute expected alerts and suppress redundant ones during an outage
On-Call & Runbooks - Connect alerts to escalation and to a runbook the responder can follow

Secondary Objectives (Bonus Achievements)

Symptom-Based Alerting - Alert on user-visible symptoms, not raw resource causes
Burn-Rate Alerts - Tie pages to error-budget burn so they fire only when it matters
Escalation Policies - Page a secondary responder when the primary does not acknowledge

Mastery Indicators

You’ll know you’ve truly mastered this quest when you can:

Write an alert rule that fires only after a condition holds for several minutes
Route a severity: page alert to on-call and a severity: ticket alert elsewhere
Create a silence so a planned maintenance window stays quiet
Explain why every page must link to a runbook

🗺️ Quest Prerequisites

📋 Knowledge Requirements

SLIs, SLOs, and error budgets (complete Monitoring Fundamentals first)
Comfort reading and writing YAML
Basic PromQL, or willingness to read the worked examples carefully

🛠️ System Requirements

Modern operating system (Windows 10+, macOS 10.14+, or Linux)
Docker and Docker Compose installed
A terminal and a text editor or IDE (VS Code recommended)

🧠 Skill Level Indicators

This 🔴 Hard quest expects:

You can run a small monitoring stack and read a dashboard
You are ready to design alerts a human will trust at 3 a.m.
Ready for 75-90 minutes of focused, hands-on building

🌍 Choose Your Adventure Platform

Prometheus and Alertmanager run in containers everywhere; only Docker installation differs. Then everyone meets at the same docker compose up.

🍎 macOS Kingdom Path

Click to expand macOS instructions

```bash brew install --cask docker docker --version && docker compose version # Bring up the lab (compose file below), then open the UIs: open http://localhost:9090 # Prometheus open http://localhost:9093 # Alertmanager ```

🪟 Windows Empire Path

Click to expand Windows instructions

```powershell winget install Docker.DockerDesktop docker --version; docker compose version start http://localhost:9090 # Prometheus start http://localhost:9093 # Alertmanager ```

🐧 Linux Territory Path

Click to expand Linux instructions

```bash sudo apt update && sudo apt install -y docker.io docker-compose-plugin sudo systemctl enable --now docker xdg-open http://localhost:9090 # Prometheus xdg-open http://localhost:9093 # Alertmanager ```

☁️ Cloud Realms Path

Click to expand Cloud/Container instructions

```bash # In a Codespace or container host, run the same compose file and forward # ports 9090 (Prometheus) and 9093 (Alertmanager) to your browser. docker compose up -d ```

🧙‍♂️ Chapter 1: Anatomy of a Good Alert Rule

An alert begins as a question asked of your metrics, every few seconds: “is this still true?” Learn to phrase that question so it fires on real problems and stays quiet otherwise.

⚔️ Skills You’ll Forge in This Chapter

The four parts of a Prometheus alerting rule
Why for: prevents flapping
Symptom-based versus cause-based alerting

🏗️ The Four Parts of a Rule

# rules.yml — a symptom-based alert wired into Prometheus
groups:
  - name: checkout-slos
    rules:
      - alert: CheckoutHighErrorRate
        expr: |
          sum(rate(http_requests_total{service="checkout",status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total{service="checkout"}[5m]))
          > 0.05
        for: 5m                       # condition must hold 5m before firing (anti-flap)
        labels:
          severity: page              # routing key Alertmanager uses
          team: payments
        annotations:
          summary: "Checkout 5xx error rate above 5%"
          description: "{{ $value | humanizePercentage }} of checkout requests are failing."
          runbook_url: "https://runbooks.example.com/checkout-5xx"

expr - the PromQL condition; when it returns results, the alert is pending.
for - how long the condition must persist before firing. This single line kills most flapping.
labels - metadata Alertmanager routes on (severity, team).
annotations - human-facing text, including the all-important runbook_url.

Alert on symptoms, not causes. Page on “checkout error rate above 5%” (a user feels this), not “CPU at 80%” (a clue, not a problem). Users never notice your CPU; they notice failed checkouts.

🔍 Knowledge Check: Alert Rules

What does the for: clause prevent?
Why route on labels rather than the alert name?
Why is “5xx rate above 5%” a better page than “CPU above 80%”?

⚡ Quick Wins and Checkpoints

Wrote a rule: You can name expr, for, labels, annotations
Chose a symptom: You picked a user-visible condition to alert on

🧙‍♂️ Chapter 2: Routing, Grouping, and Receivers with Alertmanager

Prometheus decides *when an alert fires; Alertmanager decides who hears it and how. Its routing tree sends each alert to the right receiver, grouped so one incident is one notification, not fifty.*

⚔️ Skills You’ll Forge in This Chapter

The Alertmanager routing tree
Grouping related alerts into a single notification
Configuring receivers (email, Slack, PagerDuty)

🏗️ A Routing Configuration

# alertmanager.yml — route by label, group related alerts, wire receivers
route:
  receiver: default-email           # catch-all
  group_by: ['alertname', 'service'] # one notification per service incident
  group_wait: 30s                    # wait to batch alerts that fire together
  group_interval: 5m
  repeat_interval: 4h                # re-notify if still firing after 4h
  routes:
    - matchers: [ 'severity = page' ] # pages go to on-call (e.g. PagerDuty)
      receiver: oncall-pagerduty
    - matchers: [ 'severity = ticket' ]
      receiver: ticket-slack

receivers:
  - name: default-email
    email_configs:
      - to: 'ops@example.com'
  - name: oncall-pagerduty
    pagerduty_configs:
      - service_key: '<integration-key>'
  - name: ticket-slack
    slack_configs:
      - channel: '#alerts-tickets'
        api_url: '<webhook-url>'

Grouping is the unsung hero: if a database fails and twenty services error at once, group_by collapses them into one notification instead of twenty pages. group_wait gives related alerts a moment to arrive together.

🔍 Knowledge Check: Routing and Grouping

How does Alertmanager decide which receiver an alert goes to?
What problem does group_by solve during a large outage?
What does repeat_interval control?

🧙‍♂️ Chapter 3: Silencing and Inhibition

Not every firing alert deserves a page. Silencing mutes alerts you already know about; inhibition suppresses lesser alerts when a bigger one is already firing.

⚔️ Skills You’ll Forge in This Chapter

Creating time-boxed silences
Writing inhibition rules
Keeping the on-call experience signal-rich

🏗️ Silencing for Planned Work

A silence is a label-matched mute with an expiry - perfect for maintenance windows. Create one from the CLI:

# Silence all checkout alerts for a 2-hour maintenance window
amtool silence add \
  service=checkout \
  --duration=2h \
  --comment="Planned checkout DB migration — see CHG-4821" \
  --author="oncall@example.com"

# List and later expire active silences
amtool silence query
amtool silence expire <silence-id>

🏗️ Inhibition: Suppress the Redundant

When an entire datacenter is down, you do not also want a page for every service inside it. Inhibition says “if the big alert is firing, mute the small ones it implies”:

# alertmanager.yml — suppress per-service alerts when the cluster is down
inhibit_rules:
  - source_matchers: [ 'alertname = ClusterDown' ]   # if this fires...
    target_matchers: [ 'severity = page' ]            # ...mute these
    equal: ['cluster']                                # ...for the same cluster

Together, silencing (you act ahead of time) and inhibition (Alertmanager acts automatically) keep the pager meaningful.

🔍 Knowledge Check: Silencing and Inhibition

When would you create a silence rather than edit a rule?
What does inhibition prevent during a large, cascading failure?
Why should every silence have an expiry and a comment?

🧙‍♂️ Chapter 4: On-Call, Escalation, and Runbooks

An alert is only as useful as the human response behind it. On-call schedules decide who is woken; escalation ensures someone *is woken; runbooks ensure they know what to do.*

⚔️ Skills You’ll Forge in This Chapter

On-call rotations and escalation policies
Writing a runbook a half-asleep responder can follow
Closing the loop with postmortems

🏗️ Escalation Policies

An escalation policy defines what happens if the first responder does not acknowledge:

Escalation policy: payments-oncall
  Level 1: page primary on-call         -> wait 5 min for ack
  Level 2: page secondary on-call       -> wait 5 min for ack
  Level 3: page engineering manager + open an incident bridge

This guarantees that a missed page does not become a missed outage. Alertmanager’s severity = page route feeds an incident tool (PagerDuty, Opsgenie) that runs this policy.

🏗️ The Runbook

Every paging alert must link a runbook via runbook_url. A good runbook is short, specific, and written for 3 a.m.:

# Runbook: CheckoutHighErrorRate

## What this means
Checkout is returning 5xx to real users. Revenue is impacted right now.

## First checks (in order)
1. Open the checkout dashboard: <link>. Is the spike still climbing?
2. Check the payment-gateway status page: <link>. Provider outage?
3. Check recent deploys: `kubectl rollout history deploy/checkout`.

## Likely fixes
- Bad deploy in the last 30 min  -> `kubectl rollout undo deploy/checkout`
- Provider outage                -> flip to backup processor (feature flag PAY_FALLBACK)

## If unresolved in 15 min
Escalate to Level 2 and open an incident. Page #incident-bridge.

A runbook turns a panicked guess into a checklist. After the incident, a blameless postmortem feeds improvements back into both the rule and the runbook - closing the loop.

🔍 Knowledge Check: On-Call and Runbooks

What does an escalation policy guarantee that a single page cannot?
What three things make a runbook usable at 3 a.m.?
How does a postmortem improve future alerting?

🎮 Mastery Challenges

🟢 Novice Challenge: Fire Your First Alert

Objective: Write a Prometheus alert rule that fires when a target is down, and watch it reach Alertmanager.

Requirements:

A rule using up == 0 with a for: clause and a severity label
The alert moves from pending to firing in the Prometheus UI
The firing alert appears in the Alertmanager UI

Validation: Stop a scrape target and see the alert fire end to end.

🟡 Intermediate Challenge: Route and Silence

Objective: Route severity: page and severity: ticket to different receivers, then silence one during “maintenance.”

Requirements:

Two routes sending to two receivers based on severity
Grouping configured so related alerts batch into one notification
A time-boxed silence that mutes the page during a window

Validation: The page is suppressed while the silence is active and resumes after it expires.

🔴 Advanced Challenge: Symptom-Based, Runbook-Ready Page

Objective: Build a burn-rate or error-rate alert that fires on a user-visible symptom and links a runbook.

Requirements:

The expr is a symptom (error rate / budget burn), not a resource threshold
The alert carries severity, team, and a runbook_url
An inhibition rule prevents a flood when a parent alert is already firing

Validation: A reviewer agrees the page is actionable, deduplicated, and would not fire on harmless blips.

🏆 Quest Rewards & Achievements

🎖️ Badges Earned:

🏆 Warden of the Pager - You built alerts humans actually trust
🔔 Master of the Routing - You tamed grouping, silencing, and escalation

🛠️ Skills Unlocked:

Alert Rule & Alertmanager Configuration - Fire, route, group, and mute with intent
On-Call and Runbook-Driven Incident Response - Wake the right person with the right plan

🔓 Unlocked Quests:

Security Fundamentals - Begin the Warrior tier’s next bastion, Security & Compliance

📊 Progression Points: +75 XP

🗺️ Next Steps in Your Journey

Continue the Main Story:

🎯 Security Fundamentals - Cross into the Security & Compliance tier

Explore Side Adventures:

⚔️ Distributed Tracing - Use traces to enrich incident response
⚔️ ELK Stack - Correlate alerts with the logs behind them

Character Class Recommendations

💻 Software Developer: Continue to Security Fundamentals
🏗️ System Engineer: Revisit Monitoring Fundamentals to tighten SLOs
🛡️ Security Specialist: Advance to Security Fundamentals

📚 Resources

Official Documentation

Prometheus Alerting Rules - expr, for, labels, annotations
Alertmanager Documentation - Routing, grouping, silencing, inhibition
amtool - The Alertmanager CLI used for silences
PagerDuty Integration Guide - Wiring Alertmanager to on-call

Community Resources

Google SRE Book - Being On-Call - The on-call chapter
My Philosophy on Alerting (Rob Ewaschuk) - The classic essay on actionable alerts
Awesome SRE - Curated reliability resources

Learning Materials

Multi-Window, Multi-Burn-Rate Alerts (SRE Workbook) - Budget-aware alerting in depth
Writing Runbooks - How to write an actionable runbook

🤝 Quest Completion Checklist

✅ Completed all primary objectives
✅ Wrote and fired a Prometheus alert rule
✅ Configured Alertmanager routing and grouping
✅ Created a silence and an inhibition rule
✅ Linked a paging alert to a runbook
✅ Identified your next quest in the journey

🕸️ Knowledge Graph

Structured wiki-links connect this quest to the IT-Journey knowledge graph. Open the Obsidian Graph View to explore connections.

Level hub: [[Level 1010 - Monitoring & Observability]] Overworld: [[🏰 Overworld - Master Quest Map]] Requires: [[Monitoring Fundamentals: Metrics, Logs, and Traces for Observability]] Unlocks: [[Security Fundamentals: CIA Triad and Defense in Depth Strategies]] Obsidian docs: [[Obsidian Knowledge Graph and Wiki Links]]

🎁 Rewards

75 XP

Badges

🏆 Warden of the Pager - Built alerts humans actually trust
🔔 Master of the Routing - Tamed grouping, silencing, and escalation

Skills unlocked

🛠️ Alert Rule & Alertmanager Configuration
🧠 On-Call and Runbook-Driven Incident Response

Features unlocked

Completes the Level 1010 Monitoring & Observability quest line

Unlocks

Security Fundamentals: CIA Triad and Defense in Depth

🕸️ Quest Network

Click a node to open the quest · ⌘/Ctrl-click for a new tab · drag to reposition · scroll to zoom.

Referenced by

Loading…

Layout	`quest`
Collection	`quests`
Path	`_quests/1010/alerting-systems.md`
URL	`/quests/1010/alerting-systems/`
Date	`2025-11-29`

Settings

Color Mode

Theme Skin

Background

Environment

Theme & Build

Page Location

Page Info

Source Code

Alerting Systems: Alertmanager, Routing, On-Call & Runbooks

Table of Contents

📖 The Legend Behind This Quest

🎯 Quest Objectives

Primary Objectives (Required for Quest Completion)

Secondary Objectives (Bonus Achievements)

Mastery Indicators

🗺️ Quest Prerequisites

📋 Knowledge Requirements

🛠️ System Requirements

🧠 Skill Level Indicators

🌍 Choose Your Adventure Platform

🍎 macOS Kingdom Path

🪟 Windows Empire Path

🐧 Linux Territory Path

☁️ Cloud Realms Path

🧙‍♂️ Chapter 1: Anatomy of a Good Alert Rule

⚔️ Skills You’ll Forge in This Chapter

🏗️ The Four Parts of a Rule

🔍 Knowledge Check: Alert Rules

⚡ Quick Wins and Checkpoints

🧙‍♂️ Chapter 2: Routing, Grouping, and Receivers with Alertmanager

⚔️ Skills You’ll Forge in This Chapter

🏗️ A Routing Configuration

🔍 Knowledge Check: Routing and Grouping

🧙‍♂️ Chapter 3: Silencing and Inhibition

⚔️ Skills You’ll Forge in This Chapter

🏗️ Silencing for Planned Work

🏗️ Inhibition: Suppress the Redundant

🔍 Knowledge Check: Silencing and Inhibition

🧙‍♂️ Chapter 4: On-Call, Escalation, and Runbooks

⚔️ Skills You’ll Forge in This Chapter

🏗️ Escalation Policies

🏗️ The Runbook

🔍 Knowledge Check: On-Call and Runbooks

🎮 Mastery Challenges

🟢 Novice Challenge: Fire Your First Alert

🟡 Intermediate Challenge: Route and Silence

🔴 Advanced Challenge: Symptom-Based, Runbook-Ready Page

🏆 Quest Rewards & Achievements

🗺️ Next Steps in Your Journey

Character Class Recommendations

📚 Resources

Official Documentation

Community Resources

Learning Materials

🤝 Quest Completion Checklist

🕸️ Knowledge Graph

🎁 Rewards

Badges

Skills unlocked

Features unlocked

Unlocks

🕸️ Quest Network

Referenced by