Stream Processing: Real-Time Data with Apache Kafka & Flink
Build real-time pipelines with Kafka, windowing, event-time processing, and exactly-once delivery
Greetings, brave adventurer! In every prior quest your data sat still - a file, a table, a finished batch - waiting patiently to be processed. But the real kingdom never sleeps: orders, clicks, sensor readings, and log events pour in without end, and your analysts want answers *now, not tomorrow. This quest, Stream Processing, teaches you to build aqueducts for water that never stops flowing - pipelines that compute over an endless river of events in real time.*
Whether you have only ever run nightly batch jobs, or you already pipe events through a queue and wrestle with late-arriving data, this adventure forges the discipline every real-time data engineer needs: the batch-versus-stream mindset, Kafka’s log abstraction, windowing over event time, watermarks for late data, and the holy grail of exactly-once processing.
📖 The Legend Behind This Quest
In the early ages, all data engineering was batch: collect a day of data, process it overnight, report in the morning. But fraud cannot wait until morning; a dashboard of yesterday’s sales is a museum piece. The kingdoms that thrived learned to process events as they arrive, treating data not as a finished pile but as an infinite, ordered log.
Apache Kafka became the backbone of that revolution - a durable, partitioned, replayable log that decouples the producers of events from their consumers. Stream processors like Flink and Kafka Streams compute over that log continuously. But streaming introduces a brutal new challenge absent from batch: events arrive out of order and late, so you must reason carefully about *which clock you trust. This quest teaches you to master that clock.*
🎯 Quest Objectives
By the time you complete this journey, you will have mastered:
Primary Objectives (Required for Quest Completion)
- Batch vs Stream - Explain the differences and choose the right model for a workload
- Kafka Fundamentals - Understand topics, partitions, offsets, producers, and consumer groups
- Event Time vs Processing Time - Distinguish when an event happened from when it was processed
- Windowing - Aggregate an unbounded stream into tumbling, sliding, and session windows
Secondary Objectives (Bonus Achievements)
- Watermarks & Late Data - Handle out-of-order events with watermarks and allowed lateness
- Exactly-Once Semantics - Explain how idempotent producers and transactions prevent duplicates
- Backpressure & Replay - Reason about consumer lag and replaying from an offset
Mastery Indicators
You’ll know you’ve truly mastered this quest when you can:
- Decide whether a problem needs batch or streaming and defend the choice
- Produce to and consume from a partitioned Kafka topic
- Explain why a windowed count differs under event time versus processing time
- Describe the three guarantees - at-most-once, at-least-once, exactly-once
🗺️ Quest Prerequisites
📋 Knowledge Requirements
- Comfortable writing Python functions and using
pippackages - The ETL stages and idempotency (complete ETL Pipeline Design first)
- A mental model of message queues and how services pass messages
🛠️ System Requirements
- Modern operating system (Windows 10+, macOS 10.14+, or Linux)
- Docker and Docker Compose for a local Kafka broker
- Python 3.10+ with
pipandvenv
🧠 Skill Level Indicators
This 🔴 Hard quest expects:
- You can build and run a small Python data script end to end
- You are ready to reason about unbounded, out-of-order data
- Ready for 4-5 hours of focused, hands-on building
🌍 Choose Your Adventure Platform
Kafka runs in a container everywhere; only Docker setup differs. Then everyone meets at the same kafka-python producer.
🍎 macOS Kingdom Path
Click to expand macOS instructions
```bash brew install --cask docker brew install python python3 -m venv .venv && source .venv/bin/activate pip install --upgrade pip kafka-python # Bring up the single-broker Kafka (compose file in Chapter 2), then: docker compose up -d ```🪟 Windows Empire Path
Click to expand Windows instructions
```powershell winget install Docker.DockerDesktop Python.Python.3.12 py -3 -m venv .venv; .\.venv\Scripts\activate pip install --upgrade pip kafka-python docker compose up -d ```🐧 Linux Territory Path
Click to expand Linux instructions
```bash sudo apt update && sudo apt install -y docker.io docker-compose-plugin python3 python3-venv sudo systemctl enable --now docker python3 -m venv .venv && source .venv/bin/activate pip install --upgrade pip kafka-python docker compose up -d ```☁️ Cloud Realms Path
Click to expand Cloud/Container instructions
```bash # In a Codespace or container host, run the same compose file. # Forward port 9092 (broker) so your Python client can connect. docker compose up -d ```🧙♂️ Chapter 1: Batch vs Stream - Two Mental Models
Before a single event flows, decide whether you are processing a finished pile or an endless river. The two models demand different thinking.
⚔️ Skills You’ll Forge in This Chapter
- The defining differences between batch and stream
- When real-time processing is worth its complexity
- The vocabulary of bounded vs unbounded data
🏗️ The Two Models
| Aspect | Batch | Stream |
|---|---|---|
| Data shape | Bounded - a finite, complete dataset | Unbounded - an endless, never-complete sequence |
| Latency | Minutes to hours | Milliseconds to seconds |
| Completeness | You see all the data before computing | You never see “all” the data; you compute as it arrives |
| Typical use | Daily reports, ML training, reconciliation | Fraud detection, live dashboards, alerting, personalization |
| Reprocessing | Re-run the whole job | Replay from an offset in the log |
Rule of thumb: if a delay of hours is acceptable, batch is simpler and cheaper - prefer it. Reach for streaming only when freshness has real value (a fraud alert tomorrow is worthless). Many modern systems run both: a streaming layer for now, a batch layer for correctness and backfills.
🔍 Knowledge Check: Batch vs Stream
- What makes streaming data “unbounded”?
- Name one workload where streaming’s complexity is unjustified
- Why can a stream processor never wait to “see all the data”?
⚡ Quick Wins and Checkpoints
- Chose a model: You can justify batch or stream for a workload you know
- Defined the terms: You can explain bounded versus unbounded data
🧙♂️ Chapter 2: Kafka - The Log at the Heart of Streaming
Kafka is not a queue you drain; it is a durable, replayable log. Understanding that abstraction unlocks everything else.
⚔️ Skills You’ll Forge in This Chapter
- Topics, partitions, offsets, and consumer groups
- Why the log is replayable and durable
- Producing and consuming events in Python
🏗️ The Kafka Mental Model
- Topic - a named stream of events (e.g.
orders). - Partition - a topic is split into ordered, append-only partitions; order is guaranteed within a partition, not across them. Partitions are how Kafka scales and parallelizes.
- Offset - each event’s position in its partition. Consumers track their offset, so they can replay from any point - the log is not deleted on read.
- Consumer group - a set of consumers that split a topic’s partitions among themselves for parallel consumption.
topic: orders (3 partitions) consumer group "billing"
p0: [e0][e1][e2][e3 ...] <─────────── consumer A (reads p0)
p1: [e0][e1][e2 ...] <─────────── consumer B (reads p1)
p2: [e0][e1 ...] <─────────── consumer C (reads p2)
▲ offsets advance → replay by resetting an offset
Bring up a broker, then produce and consume:
# docker-compose.yml — a single-broker Kafka in KRaft mode (no ZooKeeper)
services:
kafka:
image: bitnami/kafka:3.7
ports:
- "9092:9092"
environment:
- KAFKA_CFG_NODE_ID=0
- KAFKA_CFG_PROCESS_ROLES=controller,broker
- KAFKA_CFG_CONTROLLER_QUORUM_VOTERS=0@kafka:9093
- KAFKA_CFG_LISTENERS=PLAINTEXT://:9092,CONTROLLER://:9093
- KAFKA_CFG_ADVERTISED_LISTENERS=PLAINTEXT://localhost:9092
- KAFKA_CFG_CONTROLLER_LISTENER_NAMES=CONTROLLER
- KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP=CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT
# producer.py — append events to the 'orders' topic
import json, time
from kafka import KafkaProducer
producer = KafkaProducer(
bootstrap_servers="localhost:9092",
value_serializer=lambda v: json.dumps(v).encode(),
key_serializer=lambda k: k.encode(),
)
for i in range(5):
event = {"order_id": i, "amount": 10 * i, "event_time": time.time()}
# The key decides the partition, so all events for one customer keep order.
producer.send("orders", key=f"cust-{i % 2}", value=event)
producer.flush()
print("produced 5 events")
# consumer.py — read events from the beginning, tracking offsets per group
from kafka import KafkaConsumer
import json
consumer = KafkaConsumer(
"orders",
bootstrap_servers="localhost:9092",
group_id="billing",
auto_offset_reset="earliest", # start from offset 0 the first time
value_deserializer=lambda b: json.loads(b),
)
for msg in consumer:
print(f"partition={msg.partition} offset={msg.offset} value={msg.value}")
🔍 Knowledge Check: Kafka
- Within what scope does Kafka guarantee ordering?
- Why can a Kafka consumer replay old events but a traditional queue cannot?
- What determines which partition an event lands in?
🧙♂️ Chapter 3: Event Time, Processing Time, and Windowing
This is the hardest and most important idea in streaming. Events carry the time they *happened, but you process them at a different, later time - and they may arrive out of order. Choosing the right clock changes your results.*
⚔️ Skills You’ll Forge in This Chapter
- Event time vs processing time vs ingestion time
- Tumbling, sliding, and session windows
- Watermarks and handling late data
🏗️ Three Clocks
- Event time - when the event actually occurred (a timestamp inside the event). What you almost always want to aggregate by.
- Processing time - when your system happened to process it. Easy, but wrong whenever events are delayed.
- Ingestion time - when the event entered Kafka. A compromise between the two.
A mobile order placed at 11:59 but uploaded at 12:03 (after the phone regained signal) belongs in the 11:00-12:00 hour by event time, even though it was processed in the next hour. Use processing time and you mis-bucket it.
🏗️ Windowing an Unbounded Stream
Because a stream never ends, you aggregate over windows - finite slices of the timeline:
Tumbling (fixed, non-overlapping): [00:00-00:05)[00:05-00:10)[00:10-00:15)
Sliding (fixed, overlapping): [00:00-00:05)[00:01-00:06)[00:02-00:07)
Session (gap-defined): [activity...][gap > 5m][new session...]
- Tumbling - “count orders every 5 minutes.” Each event lands in exactly one window.
- Sliding - “average over the last 5 minutes, updated every minute.” Windows overlap.
- Session - “group a user’s activity until they go quiet for 5 minutes.” Windows are data-driven.
🏗️ Watermarks: Deciding When a Window Is “Done”
Since late events can always trickle in, how does the processor know a window is complete enough to emit? A watermark is the stream’s assertion that “no events older than time T will (probably) still arrive.” When the watermark passes a window’s end, the window fires. Events arriving after, within an allowed lateness, can still update it; beyond that, they are dropped or sent to a side output.
window [12:00-12:05) closes when watermark >= 12:05
late event with event_time 12:04 arriving at 12:06:
if watermark < 12:05 + allowed_lateness -> still counted
else -> dropped (or routed to a "late" stream)
🔍 Knowledge Check: Time and Windows
- Why prefer event time over processing time for most aggregations?
- When is a session window the right choice over a tumbling one?
- What question does a watermark answer for the processor?
🧙♂️ Chapter 4: Delivery Guarantees and Exactly-Once
The final mystery: in a distributed stream with retries and failures, how do you avoid counting an event twice - or losing it? The answer is a careful combination of idempotence and transactions.
⚔️ Skills You’ll Forge in This Chapter
- The three delivery guarantees
- How exactly-once is actually achieved
- Why idempotent consumers matter
🏗️ The Three Guarantees
| Guarantee | Meaning | Risk |
|---|---|---|
| At-most-once | Each event delivered zero or one times | May lose events on failure |
| At-least-once | Each event delivered one or more times | May duplicate events on retry |
| Exactly-once | Each event takes effect exactly once | Hardest; needs coordination |
Naive retries give at-least-once: if a producer resends after a timeout, the broker may store the event twice. Exactly-once semantics (EOS) in Kafka combines two mechanisms:
- Idempotent producer (
enable.idempotence=true) - the broker deduplicates retries using a producer ID and sequence number, so a resend does not create a duplicate. - Transactions - a consume-process-produce cycle is wrapped in a transaction so the output write and the consumer-offset commit succeed or fail together (
read_committedisolation downstream).
# Exactly-once flavored producer: idempotence + transactions
from kafka import KafkaProducer
producer = KafkaProducer(
bootstrap_servers="localhost:9092",
enable_idempotence=True, # broker dedups retried sends
transactional_id="orders-aggregator-1", # enables atomic transactions
value_serializer=lambda v: v.encode(),
)
producer.init_transactions()
producer.begin_transaction()
try:
producer.send("orders-agg", value="window=12:00 count=42")
# ... also commit the consumed offsets inside this same transaction ...
producer.commit_transaction() # output + offsets land atomically
except Exception:
producer.abort_transaction() # nothing is half-applied
A practical complement: make your consumer idempotent too - key writes on a stable event id and upsert (just like the idempotent load from the ETL quest), so even an at-least-once delivery causes no harm downstream.
🔍 Knowledge Check: Delivery Guarantees
- What does at-least-once risk that exactly-once prevents?
- What two mechanisms combine to give Kafka exactly-once?
- Why does an idempotent consumer make duplicates harmless?
🎮 Mastery Challenges
🟢 Novice Challenge: Produce and Consume
Objective: Run the local Kafka broker, produce several events to a partitioned topic, and consume them in a group.
Requirements:
- A topic with at least two partitions
- A producer that keys events so related ones share a partition
- A consumer group that reads from the earliest offset
Validation: The consumer prints each event with its partition and offset; replaying resets the offset and re-reads.
🟡 Intermediate Challenge: Windowed Count over Event Time
Objective: Aggregate the stream into tumbling windows keyed by the event’s own timestamp.
Requirements:
- Events carry an
event_timefield used for bucketing - A tumbling window count per fixed interval
- An out-of-order event lands in the correct window by event time
Validation: A deliberately late event is counted in its true window, not the current one.
🔴 Advanced Challenge: Exactly-Once Aggregation
Objective: Build a consume-process-produce loop that survives a forced failure without duplicating output.
Requirements:
- Idempotent producer with a transactional id
- Offsets committed inside the same transaction as the output
- An idempotent downstream upsert keyed on a stable id
Validation: Kill the process mid-batch, restart it, and confirm the output totals are correct with no duplicates.
🏆 Quest Rewards & Achievements
🎖️ Badges Earned:
- 🏆 Rider of the Stream - You processed data that never stops arriving
- ⏱️ Keeper of Event Time - You tamed windows, watermarks, and late data
🛠️ Skills Unlocked:
- Kafka Producer/Consumer Engineering - Build durable, replayable event pipelines
- Windowing and Exactly-Once Stream Design - Aggregate unbounded data correctly
🔓 Unlocked Quests:
- Data Quality Engineering - Validate streaming and batch data alike
📊 Progression Points: +75 XP
🗺️ Next Steps in Your Journey
Continue the Main Story:
- 🎯 Data Quality Engineering - Guard the correctness of every event
Explore Side Adventures:
- ⚔️ Apache Spark - Batch-process the same data at scale
Character Class Recommendations
💻 Software Developer: Continue to Data Quality Engineering
🏗️ System Engineer: Revisit Apache Spark for Structured Streaming
📊 Data Scientist: Advance to Data Quality Engineering
📚 Resources
Official Documentation
- Apache Kafka Documentation - Topics, partitions, producers, consumers
- Kafka Exactly-Once Semantics - Idempotence and transactions
- Apache Flink Documentation - Event time, windows, watermarks
- kafka-python - The Python client used in this quest
Community Resources
- The Log: What every engineer should know (Jay Kreps) - The foundational essay
- Streaming 101 & 102 (Tyler Akidau) - Event time and windowing explained
- Awesome Kafka - Curated tools and reading
Learning Materials
- Kafka: The Definitive Guide (Confluent, free) - The standard book
- Designing Data-Intensive Applications - Chapter 11 on stream processing
🤝 Quest Completion Checklist
- ✅ Completed all primary objectives
- ✅ Produced to and consumed from a partitioned Kafka topic
- ✅ Built a windowed aggregation over event time
- ✅ Handled a late event correctly with a watermark
- ✅ Implemented an exactly-once aggregation
- ✅ Identified your next quest in the journey
🕸️ Knowledge Graph
Structured wiki-links connect this quest to the IT-Journey knowledge graph. Open the Obsidian Graph View to explore connections.
Level hub: [[Level 1100 - Data Engineering]] Overworld: [[🏰 Overworld - Master Quest Map]] Requires: [[ETL Pipeline Design: Build Scalable Data Pipelines with Python]] Unlocks: [[Data Quality Engineering: Testing, Validation & Monitoring Frameworks]] Obsidian docs: [[Obsidian Knowledge Graph and Wiki Links]]
🎁 Rewards
Badges
- 🏆 Rider of the Stream - Processed data that never stops arriving
- ⏱️ Keeper of Event Time - Tamed windows, watermarks, and late data
Skills unlocked
- 🛠️ Kafka Producer/Consumer Engineering
- 🧠 Windowing and Exactly-Once Stream Design
Features unlocked
- Access to the Data Quality Engineering quest
🕸️ Quest Network
Click a node to open the quest · ⌘/Ctrl-click for a new tab · drag to reposition · scroll to zoom.
Referenced by
- Loading…