Monitoring Fundamentals: Metrics, Logs, and Traces

Master the three pillars of observability—metrics, logs, and traces—plus SLI/SLO/SLA, the RED and USE methods, and fighting alert fatigue.

IT-Journey Team
Published Nov 29, 2025
Updated Jun 14, 2026
Quests
🟡 medium
1010 monitoring observability main_quest devops +2
View source

Estimated reading time: 23 minutes

Edit on GitHub

🔥 Lvl 1010Warrior 🏰 Main Quest 🟡 Medium 90-120 minutes

Monitoring Fundamentals: Metrics, Logs, and Traces

Master the three pillars of observability—metrics, logs, and traces—for production-grade monitoring

Primary Tech: 🛠️ observability
Skill Focus: Devops
Series: Observability Mastery
Author: IT-Journey Team
XP Range: ⚡ 4500-5250

📈 Your Progress

Not started · 0%

Your progress is stored in your browser only. Use your inventory to back it up.

🗝️ Prerequisites

Knowledge requirements

Basic command line navigation
Comfort reading YAML and a little code (examples use shell and YAML)
General understanding of how web services and HTTP requests work

System requirements

Modern OS (macOS, Windows 10+, Linux)
A terminal and a text editor or IDE (VS Code recommended)
Optional Docker for the hands-on lab

Greetings, brave adventurer! You have climbed into the Warrior tier, and before you rises the Watchtower - the highest vantage in the realm of Monitoring & Observability. From here a vigilant Warrior sees fires before they spread, famine before the granaries empty, and invaders long before they breach the walls. This quest, Monitoring Fundamentals, hands you the spyglass, the signal-horn, and the war-map you will carry through every quest that follows.

Whether you are an engineer who has only ever stared at a single dashboard someone else built, or a seasoned operator ready to formalize the instincts you trust at 3 a.m., this adventure forges the mental model every Warrior of the Watchtower needs: the three pillars of observability, the language of SLIs and SLOs, and the discipline that keeps your alerts meaningful instead of maddening.

📖 The Legend Behind This Quest

In the early ages of computing, a single machine could be watched by a single pair of eyes. When the great cities of microservices rose - hundreds of services calling thousands of others across the cloud - no watcher could see it all. The operators who survived the chaos learned a single truth: you cannot fix what you cannot see, and you cannot see a distributed system without deliberate instrumentation.

This quest teaches the “why” beneath every dashboard, every alert, and every trace you will ever read. Master it, and the tool-quests that follow - Prometheus, Grafana, the ELK Stack, distributed tracing, and alerting systems - become instruments you wield with purpose rather than dashboards you merely glance at.

🎯 Quest Objectives

By the time you complete this journey, you will have mastered:

Primary Objectives (Required for Quest Completion)

The Three Pillars - Explain metrics, logs, and traces and choose the right one for a given question
SLI / SLO / SLA - Define a Service Level Indicator, turn it into an Objective, and reason about error budgets
The RED and USE Methods - Apply two complementary frameworks for deciding what to measure
Alert Fatigue - Recognize why noisy alerting fails and design alerts that humans actually trust

Secondary Objectives (Bonus Achievements)

Cardinality Awareness - Understand why high-cardinality labels can sink a metrics system
The Observability vs. Monitoring Distinction - Articulate what observability adds beyond classic monitoring
Golden Signals - Map Google’s four golden signals onto a service you operate

Mastery Indicators

You’ll know you’ve truly mastered this quest when you can:

Explain the three pillars to another person without notes
Write one SLI and one SLO for a real service and compute its error budget
Decide whether RED or USE is the better lens for a given component
Critique an existing alert and say whether it is actionable

🗺️ Quest Prerequisites

📋 Knowledge Requirements

Basic understanding of how web requests, servers, and services interact
Comfort reading YAML and simple shell commands
Familiarity with operating at least one running service

🛠️ System Requirements

Modern operating system (Windows 10+, macOS 10.14+, or Linux)
A terminal and a text editor or IDE (VS Code recommended)
Optional: Docker, for the local hands-on lab

🧠 Skill Level Indicators

This 🟡 Medium quest expects:

You have run or maintained at least one service or application
You are ready to think about systems in terms of health, not just features
Ready for 90-120 minutes of focused learning

🌍 Choose Your Adventure Platform

The concepts here are platform-independent, but the optional lab spins up a tiny metrics endpoint you can scrape. Choose the path that fits your setup.

🍎 macOS Kingdom Path

Click to expand macOS instructions

```bash # A one-file way to emit Prometheus-style metrics for inspection. # Install a lightweight HTTP tool and serve a sample metrics page. brew install curl # Run a throwaway node_exporter container to see real metrics docker run --rm -d -p 9100:9100 prom/node-exporter # Inspect the raw metrics exposition format curl -s http://localhost:9100/metrics | head -n 20 ```

🪟 Windows Empire Path

Click to expand Windows instructions

```powershell # Run a throwaway metrics exporter and read its output docker run --rm -d -p 9100:9100 prom/node-exporter # Inspect the exposition format curl.exe -s http://localhost:9100/metrics | Select-Object -First 20 ```

🐧 Linux Territory Path

Click to expand Linux instructions

```bash # Install Docker via your distribution if needed sudo apt update && sudo apt install -y docker.io # Debian/Ubuntu sudo systemctl enable --now docker # Run the node exporter and read the raw metrics sudo docker run --rm -d -p 9100:9100 prom/node-exporter curl -s http://localhost:9100/metrics | head -n 20 ```

☁️ Cloud Realms Path

Click to expand Cloud/Container instructions

```bash # In a Codespace or any container runtime the same image works. docker run --rm -d -p 9100:9100 prom/node-exporter # Forward port 9100 to your browser via your platform's port forwarding, # then open /metrics to read the exposition format. ```

🧙‍♂️ Chapter 1: The Three Pillars of Observability

Every question you will ever ask about a running system can be answered by one of three signal types. Learn to reach for the right one and you have a lens for the entire field.

⚔️ Skills You’ll Forge in This Chapter

The meaning of metrics, logs, and traces
Which pillar answers which kind of question
The trade-offs in cost, cardinality, and detail

🏗️ The Three Pillars

Pillar	Question it answers	Shape of the data	Strength	Cost driver
Metrics	“Is something wrong, and by how much?”	Numeric time series with labels	Cheap, aggregatable, great for alerting	Label cardinality
Logs	“What exactly happened in this event?”	Timestamped, structured records	Rich detail, full context	Volume and retention
Traces	“Where did the time go across services?”	A tree of timed spans per request	Pinpoints latency in distributed calls	Sampling and storage

A useful rule of thumb: metrics tell you that something is wrong, logs tell you what, and traces tell you where. You will almost always start at a dashboard (metrics), drill into the offending time window (logs), and follow a slow request across services (traces).

Here is the difference made concrete. A metric is a number over time:

http_requests_total{service="checkout", status="500"} = 1423

A structured log is a single rich event:

{ "ts": "2026-06-14T10:31:02Z", "level": "error", "service": "checkout",
  "msg": "payment gateway timeout", "order_id": "ord_91af", "duration_ms": 5012 }

A trace is a tree of spans showing where time was spent:

checkout-request  (520ms)
├─ auth-service     (12ms)
├─ inventory-check  (40ms)
└─ payment-gateway  (455ms)   <-- the latency lives here

🔍 Knowledge Check: The Three Pillars

Which pillar would you alert on, and why not the others?
Why are traces the right tool when latency is spread across many services?
What makes logs expensive to keep at high volume?

⚡ Quick Wins and Checkpoints

Read real metrics: You inspected the exposition format from the lab
Classified a question: You named which pillar answers a given question

🧙‍♂️ Chapter 2: SLIs, SLOs, SLAs, and Error Budgets

A Warrior who watches everything watches nothing. Service Level discipline tells you which numbers actually matter to your users and how much imperfection you can afford.

⚔️ Skills You’ll Forge in This Chapter

The difference between an SLI, an SLO, and an SLA
How to compute an error budget
Why error budgets create a shared language between developers and operators

🏗️ The Service Level Stack

SLI (Indicator) - a measured number that reflects user happiness, e.g. the proportion of HTTP requests that succeed.
SLO (Objective) - a target for an SLI over a window, e.g. 99.9% of requests succeed over 30 days.
SLA (Agreement) - a contractual promise with consequences, usually looser than the SLO (e.g. 99.5%) so you have internal headroom.

A good availability SLI is a ratio of good events to valid events:

SLI = good_requests / valid_requests
    = (total_requests - requests_returning_5xx) / total_requests

An error budget is simply 100% - SLO. At a 99.9% SLO over 30 days:

Total minutes in 30 days        = 43,200
Allowed downtime (0.1%)         = 43.2 minutes
That 43.2 minutes IS your error budget for the month.

Spend the budget on risky deploys and experiments. When it runs out, you freeze risky changes and pour effort into reliability until it recovers. This turns “should we ship?” from an argument into arithmetic.

# A simple SLO definition you can keep in version control
service: checkout
sli:
  type: availability
  good_events: requests with status < 500
  valid_events: all requests
slo:
  target: 99.9
  window: 30d
error_budget_minutes: 43.2

🔍 Knowledge Check: Service Levels

Why is the SLA usually a looser number than the SLO?
Compute the monthly error budget for a 99.95% SLO
How does an exhausted error budget change a team’s behavior?

🧙‍♂️ Chapter 3: The RED Method, the USE Method, and Alert Fatigue

Two famous frameworks tell you what to measure. RED watches request-driven services; USE watches the resources beneath them. Together they cover almost everything you operate.

⚔️ Skills You’ll Forge in This Chapter

The RED method for services
The USE method for resources
How to design alerts that humans trust

🏗️ RED and USE

RED (for request-driven services - APIs, web servers):

Rate - requests per second
Errors - failed requests per second
Duration - distribution of request latency (use percentiles, not averages)

USE (for resources - CPU, memory, disks, queues):

Utilization - percent of time the resource was busy
Saturation - how much extra work is queued and waiting
Errors - count of error events

Google’s four golden signals - latency, traffic, errors, and saturation - are a closely related checklist; if you can see those four for every service, you are in good shape.

Fighting Alert Fatigue

An alert that does not demand human action is noise. Noise trains your team to ignore the very pages that matter. Three rules keep alerts trustworthy:

Alert on symptoms, not causes. Page on “checkout error rate exceeds the SLO,” not on “CPU is at 80%.” Users feel symptoms; CPU is just a clue.
Every page must be actionable. If there is nothing a human can do right now, it is a ticket or a dashboard, not a page.
Tie alerts to error budgets. A multi-window burn-rate alert fires when you are spending budget fast enough to matter, and stays quiet for harmless blips.

# A symptom-based, budget-aware alert (Prometheus-style pseudocode)
alert: CheckoutErrorBudgetBurnFast
expr: |
  (
    sum(rate(http_requests_total{service="checkout",status=~"5.."}[5m]))
    /
    sum(rate(http_requests_total{service="checkout"}[5m]))
  ) > (14.4 * 0.001)   # 14.4x burn of a 99.9% SLO over a short window
for: 2m
labels:
  severity: page
annotations:
  summary: "Checkout is burning its error budget fast"
  runbook: "https://runbooks.example.com/checkout-5xx"

🔍 Knowledge Check: RED, USE, and Alerting

When would you reach for USE instead of RED?
Why alert on the error rate rather than CPU utilization?
What makes a page actionable, and why does that matter at 3 a.m.?

🎮 Mastery Challenges

🟢 Novice Challenge: Classify the Questions

Objective: Take five questions you might ask about a service and label each as a metrics, logs, or traces question.

Requirements:

At least five distinct questions
A one-line justification for each pillar choice
At least one question per pillar

Validation: Each choice survives the “that tells me / what / where” rule of thumb.

🟡 Intermediate Challenge: Write a Real SLO

Objective: Pick a service you operate or use daily and write an availability SLO with an error budget.

Requirements:

A clearly defined SLI (good events over valid events)
An SLO target and window
The computed error-budget minutes for that window

Validation: You can defend the target number in one sentence and state what happens when the budget is exhausted.

🔴 Advanced Challenge: Design a Trustworthy Alert

Objective: Write one symptom-based, budget-aware alert for the same service and explain why it will not contribute to alert fatigue.

Requirements:

The alert fires on a user-visible symptom
It references an SLO or burn rate, not a raw resource threshold
It links to a runbook and has a clear severity

Validation: A reviewer agrees the alert is actionable and would not fire on harmless blips.

🏆 Quest Rewards & Achievements

🎖️ Badges Earned:

🏆 Watchkeeper - You internalized the three pillars of observability
🛡️ Keeper of the Signal - You can define SLIs, SLOs, and error budgets

🛠️ Skills Unlocked:

Observability Strategy - Choose the right signal for the right question
SLO and Error-Budget Thinking - Turn reliability into arithmetic

🔓 Unlocked Quests:

Prometheus & Grafana - Collect and visualize metrics
ELK Stack - Centralize and search your logs
Distributed Tracing - Follow a request across services
Alerting Systems - Route, silence, and respond to the alerts you design

📊 Progression Points: +75 XP

🗺️ Next Steps in Your Journey

Continue the Main Story:

🎯 Prometheus & Grafana - Build the metrics pillar for real

Explore Side Adventures:

⚔️ ELK Stack - Master the logs pillar
⚔️ Distributed Tracing - Master the traces pillar

Character Class Recommendations

💻 Software Developer: Continue to Distributed Tracing
🏗️ System Engineer: Explore Prometheus & Grafana
🛡️ Security Specialist: Advance to Alerting Systems

📚 Resources

Official Documentation

Google SRE Book - Service Level Objectives - The canonical SLO chapter
Prometheus - Metric Types - Counters, gauges, histograms, summaries
OpenTelemetry - Observability Primer - The three pillars in one place

Community Resources

The RED Method (Weaveworks/Grafana) - Rate, Errors, Duration
The USE Method (Brendan Gregg) - Utilization, Saturation, Errors
Google SRE - Monitoring Distributed Systems - The four golden signals

Learning Materials

Implementing SLOs (Google SRE Workbook) - Error budgets in practice
My Philosophy on Alerting (Rob Ewaschuk) - The classic essay on actionable alerts

🤝 Quest Completion Checklist

✅ Completed all primary objectives
✅ Wrote an SLO and computed its error budget
✅ Answered all knowledge check questions
✅ Completed at least one mastery challenge
✅ Explored the resource library
✅ Identified your next quest in the journey

🕸️ Knowledge Graph

Structured wiki-links connect this quest to the IT-Journey knowledge graph. Open the Obsidian Graph View to explore connections.

Level hub: [[Level 1010 - Monitoring & Observability]] Overworld: [[🏰 Overworld - Master Quest Map]] Unlocks: [[Prometheus & Grafana: Metrics Collection and Visualization]] · [[ELK Stack: Elasticsearch, Logstash, and Kibana for Log Analysis]] · [[Distributed Tracing: OpenTelemetry and Jaeger]] · [[Alerting Systems: Alertmanager, Routing, and On-Call]] Obsidian docs: [[Obsidian Knowledge Graph and Wiki Links]]

🎁 Rewards

75 XP

Badges

🏆 Watchkeeper - Internalized the three pillars of observability
🛡️ Keeper of the Signal - Can define SLIs, SLOs, and error budgets

Skills unlocked

🛠️ Observability Strategy
🧠 SLO and Error-Budget Thinking

Features unlocked

Access to the rest of the Level 1010 Monitoring & Observability quest line

Unlocks

🕸️ Quest Network

Click a node to open the quest · ⌘/Ctrl-click for a new tab · drag to reposition · scroll to zoom.

Referenced by

Loading…

Layout	`quest`
Collection	`quests`
Path	`_quests/1010/monitoring-fundamentals.md`
URL	`/quests/1010/monitoring-fundamentals/`
Date	`2025-11-29`

Settings

Color Mode

Theme Skin

Background

Environment

Theme & Build

Page Location

Page Info

Source Code

Monitoring Fundamentals: Metrics, Logs, and Traces

Table of Contents

📖 The Legend Behind This Quest

🎯 Quest Objectives

Primary Objectives (Required for Quest Completion)

Secondary Objectives (Bonus Achievements)

Mastery Indicators

🗺️ Quest Prerequisites

📋 Knowledge Requirements

🛠️ System Requirements

🧠 Skill Level Indicators

🌍 Choose Your Adventure Platform

🍎 macOS Kingdom Path

🪟 Windows Empire Path

🐧 Linux Territory Path

☁️ Cloud Realms Path

🧙‍♂️ Chapter 1: The Three Pillars of Observability

⚔️ Skills You’ll Forge in This Chapter

🏗️ The Three Pillars

🔍 Knowledge Check: The Three Pillars

⚡ Quick Wins and Checkpoints

🧙‍♂️ Chapter 2: SLIs, SLOs, SLAs, and Error Budgets

⚔️ Skills You’ll Forge in This Chapter

🏗️ The Service Level Stack

🔍 Knowledge Check: Service Levels

🧙‍♂️ Chapter 3: The RED Method, the USE Method, and Alert Fatigue

⚔️ Skills You’ll Forge in This Chapter

🏗️ RED and USE

Fighting Alert Fatigue

🔍 Knowledge Check: RED, USE, and Alerting

🎮 Mastery Challenges

🟢 Novice Challenge: Classify the Questions

🟡 Intermediate Challenge: Write a Real SLO

🔴 Advanced Challenge: Design a Trustworthy Alert

🏆 Quest Rewards & Achievements

🗺️ Next Steps in Your Journey

Character Class Recommendations

📚 Resources

Official Documentation

Community Resources

Learning Materials

🤝 Quest Completion Checklist

🕸️ Knowledge Graph

🎁 Rewards

Badges

Skills unlocked

Features unlocked

Unlocks

🕸️ Quest Network

Referenced by