Monitoring Fundamentals: Metrics, Logs, and Traces
Master the three pillars of observability—metrics, logs, and traces—for production-grade monitoring
Greetings, brave adventurer! You have climbed into the Warrior tier, and before you rises the Watchtower - the highest vantage in the realm of Monitoring & Observability. From here a vigilant Warrior sees fires before they spread, famine before the granaries empty, and invaders long before they breach the walls. This quest, Monitoring Fundamentals, hands you the spyglass, the signal-horn, and the war-map you will carry through every quest that follows.
Whether you are an engineer who has only ever stared at a single dashboard someone else built, or a seasoned operator ready to formalize the instincts you trust at 3 a.m., this adventure forges the mental model every Warrior of the Watchtower needs: the three pillars of observability, the language of SLIs and SLOs, and the discipline that keeps your alerts meaningful instead of maddening.
📖 The Legend Behind This Quest
In the early ages of computing, a single machine could be watched by a single pair of eyes. When the great cities of microservices rose - hundreds of services calling thousands of others across the cloud - no watcher could see it all. The operators who survived the chaos learned a single truth: you cannot fix what you cannot see, and you cannot see a distributed system without deliberate instrumentation.
This quest teaches the “why” beneath every dashboard, every alert, and every trace you will ever read. Master it, and the tool-quests that follow - Prometheus, Grafana, the ELK Stack, distributed tracing, and alerting systems - become instruments you wield with purpose rather than dashboards you merely glance at.
🎯 Quest Objectives
By the time you complete this journey, you will have mastered:
Primary Objectives (Required for Quest Completion)
- The Three Pillars - Explain metrics, logs, and traces and choose the right one for a given question
- SLI / SLO / SLA - Define a Service Level Indicator, turn it into an Objective, and reason about error budgets
- The RED and USE Methods - Apply two complementary frameworks for deciding what to measure
- Alert Fatigue - Recognize why noisy alerting fails and design alerts that humans actually trust
Secondary Objectives (Bonus Achievements)
- Cardinality Awareness - Understand why high-cardinality labels can sink a metrics system
- The Observability vs. Monitoring Distinction - Articulate what observability adds beyond classic monitoring
- Golden Signals - Map Google’s four golden signals onto a service you operate
Mastery Indicators
You’ll know you’ve truly mastered this quest when you can:
- Explain the three pillars to another person without notes
- Write one SLI and one SLO for a real service and compute its error budget
- Decide whether RED or USE is the better lens for a given component
- Critique an existing alert and say whether it is actionable
🗺️ Quest Prerequisites
📋 Knowledge Requirements
- Basic understanding of how web requests, servers, and services interact
- Comfort reading YAML and simple shell commands
- Familiarity with operating at least one running service
🛠️ System Requirements
- Modern operating system (Windows 10+, macOS 10.14+, or Linux)
- A terminal and a text editor or IDE (VS Code recommended)
- Optional: Docker, for the local hands-on lab
🧠 Skill Level Indicators
This 🟡 Medium quest expects:
- You have run or maintained at least one service or application
- You are ready to think about systems in terms of health, not just features
- Ready for 90-120 minutes of focused learning
🌍 Choose Your Adventure Platform
The concepts here are platform-independent, but the optional lab spins up a tiny metrics endpoint you can scrape. Choose the path that fits your setup.
🍎 macOS Kingdom Path
Click to expand macOS instructions
```bash # A one-file way to emit Prometheus-style metrics for inspection. # Install a lightweight HTTP tool and serve a sample metrics page. brew install curl # Run a throwaway node_exporter container to see real metrics docker run --rm -d -p 9100:9100 prom/node-exporter # Inspect the raw metrics exposition format curl -s http://localhost:9100/metrics | head -n 20 ```🪟 Windows Empire Path
Click to expand Windows instructions
```powershell # Run a throwaway metrics exporter and read its output docker run --rm -d -p 9100:9100 prom/node-exporter # Inspect the exposition format curl.exe -s http://localhost:9100/metrics | Select-Object -First 20 ```🐧 Linux Territory Path
Click to expand Linux instructions
```bash # Install Docker via your distribution if needed sudo apt update && sudo apt install -y docker.io # Debian/Ubuntu sudo systemctl enable --now docker # Run the node exporter and read the raw metrics sudo docker run --rm -d -p 9100:9100 prom/node-exporter curl -s http://localhost:9100/metrics | head -n 20 ```☁️ Cloud Realms Path
Click to expand Cloud/Container instructions
```bash # In a Codespace or any container runtime the same image works. docker run --rm -d -p 9100:9100 prom/node-exporter # Forward port 9100 to your browser via your platform's port forwarding, # then open /metrics to read the exposition format. ```🧙♂️ Chapter 1: The Three Pillars of Observability
Every question you will ever ask about a running system can be answered by one of three signal types. Learn to reach for the right one and you have a lens for the entire field.
⚔️ Skills You’ll Forge in This Chapter
- The meaning of metrics, logs, and traces
- Which pillar answers which kind of question
- The trade-offs in cost, cardinality, and detail
🏗️ The Three Pillars
| Pillar | Question it answers | Shape of the data | Strength | Cost driver |
|---|---|---|---|---|
| Metrics | “Is something wrong, and by how much?” | Numeric time series with labels | Cheap, aggregatable, great for alerting | Label cardinality |
| Logs | “What exactly happened in this event?” | Timestamped, structured records | Rich detail, full context | Volume and retention |
| Traces | “Where did the time go across services?” | A tree of timed spans per request | Pinpoints latency in distributed calls | Sampling and storage |
A useful rule of thumb: metrics tell you that something is wrong, logs tell you what, and traces tell you where. You will almost always start at a dashboard (metrics), drill into the offending time window (logs), and follow a slow request across services (traces).
Here is the difference made concrete. A metric is a number over time:
http_requests_total{service="checkout", status="500"} = 1423
A structured log is a single rich event:
{ "ts": "2026-06-14T10:31:02Z", "level": "error", "service": "checkout",
"msg": "payment gateway timeout", "order_id": "ord_91af", "duration_ms": 5012 }
A trace is a tree of spans showing where time was spent:
checkout-request (520ms)
├─ auth-service (12ms)
├─ inventory-check (40ms)
└─ payment-gateway (455ms) <-- the latency lives here
🔍 Knowledge Check: The Three Pillars
- Which pillar would you alert on, and why not the others?
- Why are traces the right tool when latency is spread across many services?
- What makes logs expensive to keep at high volume?
⚡ Quick Wins and Checkpoints
- Read real metrics: You inspected the exposition format from the lab
- Classified a question: You named which pillar answers a given question
🧙♂️ Chapter 2: SLIs, SLOs, SLAs, and Error Budgets
A Warrior who watches everything watches nothing. Service Level discipline tells you which numbers actually matter to your users and how much imperfection you can afford.
⚔️ Skills You’ll Forge in This Chapter
- The difference between an SLI, an SLO, and an SLA
- How to compute an error budget
- Why error budgets create a shared language between developers and operators
🏗️ The Service Level Stack
- SLI (Indicator) - a measured number that reflects user happiness, e.g. the proportion of HTTP requests that succeed.
- SLO (Objective) - a target for an SLI over a window, e.g. 99.9% of requests succeed over 30 days.
- SLA (Agreement) - a contractual promise with consequences, usually looser than the SLO (e.g. 99.5%) so you have internal headroom.
A good availability SLI is a ratio of good events to valid events:
SLI = good_requests / valid_requests
= (total_requests - requests_returning_5xx) / total_requests
An error budget is simply 100% - SLO. At a 99.9% SLO over 30 days:
Total minutes in 30 days = 43,200
Allowed downtime (0.1%) = 43.2 minutes
That 43.2 minutes IS your error budget for the month.
Spend the budget on risky deploys and experiments. When it runs out, you freeze risky changes and pour effort into reliability until it recovers. This turns “should we ship?” from an argument into arithmetic.
# A simple SLO definition you can keep in version control
service: checkout
sli:
type: availability
good_events: requests with status < 500
valid_events: all requests
slo:
target: 99.9
window: 30d
error_budget_minutes: 43.2
🔍 Knowledge Check: Service Levels
- Why is the SLA usually a looser number than the SLO?
- Compute the monthly error budget for a 99.95% SLO
- How does an exhausted error budget change a team’s behavior?
🧙♂️ Chapter 3: The RED Method, the USE Method, and Alert Fatigue
Two famous frameworks tell you what to measure. RED watches request-driven services; USE watches the resources beneath them. Together they cover almost everything you operate.
⚔️ Skills You’ll Forge in This Chapter
- The RED method for services
- The USE method for resources
- How to design alerts that humans trust
🏗️ RED and USE
RED (for request-driven services - APIs, web servers):
- Rate - requests per second
- Errors - failed requests per second
- Duration - distribution of request latency (use percentiles, not averages)
USE (for resources - CPU, memory, disks, queues):
- Utilization - percent of time the resource was busy
- Saturation - how much extra work is queued and waiting
- Errors - count of error events
Google’s four golden signals - latency, traffic, errors, and saturation - are a closely related checklist; if you can see those four for every service, you are in good shape.
Fighting Alert Fatigue
An alert that does not demand human action is noise. Noise trains your team to ignore the very pages that matter. Three rules keep alerts trustworthy:
- Alert on symptoms, not causes. Page on “checkout error rate exceeds the SLO,” not on “CPU is at 80%.” Users feel symptoms; CPU is just a clue.
- Every page must be actionable. If there is nothing a human can do right now, it is a ticket or a dashboard, not a page.
- Tie alerts to error budgets. A multi-window burn-rate alert fires when you are spending budget fast enough to matter, and stays quiet for harmless blips.
# A symptom-based, budget-aware alert (Prometheus-style pseudocode)
alert: CheckoutErrorBudgetBurnFast
expr: |
(
sum(rate(http_requests_total{service="checkout",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="checkout"}[5m]))
) > (14.4 * 0.001) # 14.4x burn of a 99.9% SLO over a short window
for: 2m
labels:
severity: page
annotations:
summary: "Checkout is burning its error budget fast"
runbook: "https://runbooks.example.com/checkout-5xx"
🔍 Knowledge Check: RED, USE, and Alerting
- When would you reach for USE instead of RED?
- Why alert on the error rate rather than CPU utilization?
- What makes a page actionable, and why does that matter at 3 a.m.?
🎮 Mastery Challenges
🟢 Novice Challenge: Classify the Questions
Objective: Take five questions you might ask about a service and label each as a metrics, logs, or traces question.
Requirements:
- At least five distinct questions
- A one-line justification for each pillar choice
- At least one question per pillar
Validation: Each choice survives the “that tells me / what / where” rule of thumb.
🟡 Intermediate Challenge: Write a Real SLO
Objective: Pick a service you operate or use daily and write an availability SLO with an error budget.
Requirements:
- A clearly defined SLI (good events over valid events)
- An SLO target and window
- The computed error-budget minutes for that window
Validation: You can defend the target number in one sentence and state what happens when the budget is exhausted.
🔴 Advanced Challenge: Design a Trustworthy Alert
Objective: Write one symptom-based, budget-aware alert for the same service and explain why it will not contribute to alert fatigue.
Requirements:
- The alert fires on a user-visible symptom
- It references an SLO or burn rate, not a raw resource threshold
- It links to a runbook and has a clear severity
Validation: A reviewer agrees the alert is actionable and would not fire on harmless blips.
🏆 Quest Rewards & Achievements
🎖️ Badges Earned:
- 🏆 Watchkeeper - You internalized the three pillars of observability
- 🛡️ Keeper of the Signal - You can define SLIs, SLOs, and error budgets
🛠️ Skills Unlocked:
- Observability Strategy - Choose the right signal for the right question
- SLO and Error-Budget Thinking - Turn reliability into arithmetic
🔓 Unlocked Quests:
- Prometheus & Grafana - Collect and visualize metrics
- ELK Stack - Centralize and search your logs
- Distributed Tracing - Follow a request across services
- Alerting Systems - Route, silence, and respond to the alerts you design
📊 Progression Points: +75 XP
🗺️ Next Steps in Your Journey
Continue the Main Story:
- 🎯 Prometheus & Grafana - Build the metrics pillar for real
Explore Side Adventures:
- ⚔️ ELK Stack - Master the logs pillar
- ⚔️ Distributed Tracing - Master the traces pillar
Character Class Recommendations
💻 Software Developer: Continue to Distributed Tracing
🏗️ System Engineer: Explore Prometheus & Grafana
🛡️ Security Specialist: Advance to Alerting Systems
📚 Resources
Official Documentation
- Google SRE Book - Service Level Objectives - The canonical SLO chapter
- Prometheus - Metric Types - Counters, gauges, histograms, summaries
- OpenTelemetry - Observability Primer - The three pillars in one place
Community Resources
- The RED Method (Weaveworks/Grafana) - Rate, Errors, Duration
- The USE Method (Brendan Gregg) - Utilization, Saturation, Errors
- Google SRE - Monitoring Distributed Systems - The four golden signals
Learning Materials
- Implementing SLOs (Google SRE Workbook) - Error budgets in practice
- My Philosophy on Alerting (Rob Ewaschuk) - The classic essay on actionable alerts
🤝 Quest Completion Checklist
- ✅ Completed all primary objectives
- ✅ Wrote an SLO and computed its error budget
- ✅ Answered all knowledge check questions
- ✅ Completed at least one mastery challenge
- ✅ Explored the resource library
- ✅ Identified your next quest in the journey
🕸️ Knowledge Graph
Structured wiki-links connect this quest to the IT-Journey knowledge graph. Open the Obsidian Graph View to explore connections.
Level hub: [[Level 1010 - Monitoring & Observability]] Overworld: [[🏰 Overworld - Master Quest Map]] Unlocks: [[Prometheus & Grafana: Metrics Collection and Visualization]] · [[ELK Stack: Elasticsearch, Logstash, and Kibana for Log Analysis]] · [[Distributed Tracing: OpenTelemetry and Jaeger]] · [[Alerting Systems: Alertmanager, Routing, and On-Call]] Obsidian docs: [[Obsidian Knowledge Graph and Wiki Links]]
🎁 Rewards
Badges
- 🏆 Watchkeeper - Internalized the three pillars of observability
- 🛡️ Keeper of the Signal - Can define SLIs, SLOs, and error budgets
Skills unlocked
- 🛠️ Observability Strategy
- 🧠 SLO and Error-Budget Thinking
Features unlocked
- Access to the rest of the Level 1010 Monitoring & Observability quest line
🕸️ Quest Network
Click a node to open the quest · ⌘/Ctrl-click for a new tab · drag to reposition · scroll to zoom.
Referenced by
- Loading…