ETL Pipeline Design: Scalable Python Data Pipelines
Build production-grade ETL pipelines with Python, covering extraction, transformation, idempotent loading, and Airflow orchestration
Greetings, brave adventurer! You stand at the gates of the Master tier and the realm of Data Engineering opens before you. Your first trial is the ETL Pipeline - the great aqueduct that carries raw data from scattered springs, purifies it in transformation chambers, and delivers it into the reservoirs your analysts and models depend on. Build it well and the kingdom drinks clean water; build it carelessly and every dashboard downstream is poisoned.
Whether you have hand-fed CSVs into spreadsheets or you are formalizing pipelines you already run by intuition, this quest forges the core discipline of every data engineer: moving data reliably, repeatably, and idempotently from source to destination - then handing the work to an orchestrator that runs it on schedule, forever.
📖 The Legend Behind This Quest
In the early days, data lived in one database and one report. Then came the flood: APIs, event logs, SaaS exports, sensor feeds. Engineers discovered that the value was never in the raw data itself but in the pipeline that reshaped it into something trustworthy. The ones who thrived learned a hard truth - a pipeline that cannot be safely rerun is not a pipeline, it is a time bomb.
This quest teaches the “why” behind every Extract, Transform, and Load. Master it and the rest of the Data Engineering tier - warehousing, Spark, streaming, quality - becomes a set of variations on a theme you already understand.
🎯 Quest Objectives
By the time you complete this journey, you will have mastered:
Primary Objectives (Required for Quest Completion)
- ETL vs ELT - Explain both patterns and choose correctly for a given workload
- The Extract / Transform / Load stages - Build each stage as a clean, testable function
- Idempotency & incremental loads - Write a load that is safe to rerun and only moves new data
- Orchestration with Airflow - Express the pipeline as a DAG of dependent tasks on a schedule
Secondary Objectives (Bonus Achievements)
- Error handling & retries - Make the pipeline survive transient failures
- Watermarking - Track the high-water mark so reruns do not duplicate data
- Backfills - Reprocess a historical window without breaking the present
Mastery Indicators
You’ll know you’ve truly mastered this quest when you can:
- Explain idempotency to another person with a concrete example
- Convert a full-load script into an incremental one using a watermark
- Sketch an Airflow DAG with the correct task dependencies for a real pipeline
- Rerun your pipeline twice and prove the destination is unchanged the second time
🗺️ Quest Prerequisites
📋 Knowledge Requirements
- Comfortable writing Python functions and using
pippackages - Basic SQL:
SELECT,INSERT,JOIN, primary keys - Understanding of JSON, CSV, and how an HTTP API returns data
🛠️ System Requirements
- Modern operating system (Windows 10+, macOS 10.14+, or Linux)
- Python 3.10+ with
pipandvenv - SQLite (ships with Python) and VS Code or your editor of choice
🧠 Skill Level Indicators
This 🔴 Hard quest expects:
- You can build and run a small Python program end to end
- You are ready to think about failure, retries, and safe reruns
- Ready for 4-5 hours of focused, hands-on building
🌍 Choose Your Adventure Platform
The pipeline is plain Python and SQLite, so it runs anywhere. Pick the path that fits your setup, then everyone meets at the same pip install.
🍎 macOS Kingdom Path
Click to expand macOS instructions
```bash brew install python python3 -m venv .venv source .venv/bin/activate python -m pip install --upgrade pip requests pandas python --version ```🪟 Windows Empire Path
Click to expand Windows instructions
```powershell winget install Python.Python.3.12 py -3 -m venv .venv .\.venv\Scripts\activate python -m pip install --upgrade pip requests pandas python --version ```🐧 Linux Territory Path
Click to expand Linux instructions
```bash sudo apt update && sudo apt install -y python3 python3-venv # Debian/Ubuntu python3 -m venv .venv source .venv/bin/activate python -m pip install --upgrade pip requests pandas python --version ```☁️ Cloud Realms Path
Click to expand Cloud/Container instructions
```bash # Any Codespace or container with Python works. For the Airflow chapter: docker run --rm -p 8080:8080 apache/airflow:2.9.3 standalone # Then open http://localhost:8080 (the standalone command prints a login). ```🧙♂️ Chapter 1: ETL vs ELT - Choosing the Shape of Your Pipeline
Before you move a single byte, you must decide where transformation happens. This single choice shapes your whole architecture.
⚔️ Skills You’ll Forge in This Chapter
- The difference between ETL and ELT
- When each pattern is the right tool
- The vocabulary of source, staging, and destination
🏗️ The Two Patterns
| Aspect | ETL (Extract → Transform → Load) | ELT (Extract → Load → Transform) |
|---|---|---|
| Where transforms run | In a processing layer before loading | Inside the warehouse, after loading raw data |
| Best for | Legacy targets, heavy cleansing, small/medium volume, compliance filtering before storage | Cloud warehouses (Snowflake, BigQuery, Redshift) with cheap compute and storage |
| Raw data kept? | Often discarded after transform | Raw lands first, so you can re-transform later |
| Tooling | Python, Spark, Informatica, custom code | dbt, SQL, warehouse compute |
Rule of thumb: if your destination is a cloud warehouse with cheap, scalable compute, prefer ELT - load raw, transform with SQL/dbt, and keep the raw layer so you can fix transforms without re-extracting. If you must cleanse, mask, or reshape before data ever touches storage (PII, tiny target, on-prem), use ETL. This quest builds an ETL pipeline because it teaches every stage explicitly; the warehousing quest that follows leans into ELT.
🔍 Knowledge Check: ETL vs ELT
- Why does ELT let you re-run transformations without re-extracting?
- Name one scenario where ETL is mandatory for compliance reasons
- Which pattern keeps a permanent raw layer by default?
⚡ Quick Wins and Checkpoints
- Pattern chosen: You can justify ETL or ELT for a workload you know
- Vocabulary set: You can define source, staging, and destination
🧙♂️ Chapter 2: Extract and Transform - The First Two Stages
Every pipeline begins by pulling data from a source and reshaping it. We’ll build each stage as a small, pure function so it can be tested and reused.
⚔️ Skills You’ll Forge in This Chapter
- Extracting from an API or file into Python structures
- Writing transforms that are deterministic and side-effect-free
- Separating concerns so each stage can fail and retry independently
🏗️ Extract: Pull Raw Data
# etl.py — stage 1: extract
import requests
def extract(url: str) -> list[dict]:
"""Pull raw records from a JSON API. Raises on HTTP errors so the
orchestrator can retry the task rather than load garbage."""
resp = requests.get(url, timeout=30)
resp.raise_for_status()
payload = resp.json()
# Normalize to a list of records regardless of API envelope shape.
return payload if isinstance(payload, list) else payload.get("data", [])
if __name__ == "__main__":
rows = extract("https://jsonplaceholder.typicode.com/users")
print(f"Extracted {len(rows)} records")
🏗️ Transform: Reshape and Clean
A good transform is deterministic: the same input always yields the same output, with no hidden state. That property is what makes the whole pipeline rerunnable.
# etl.py — stage 2: transform
from datetime import datetime, timezone
def transform(rows: list[dict]) -> list[dict]:
"""Select fields, normalize types, and stamp an ingestion time.
Pure function: no I/O, no globals — easy to unit test."""
cleaned = []
for r in rows:
cleaned.append({
"user_id": int(r["id"]),
"name": r["name"].strip(),
"email": r["email"].strip().lower(),
"city": r.get("address", {}).get("city", "").strip(),
"ingested_at": datetime.now(timezone.utc).isoformat(),
})
return cleaned
🔍 Knowledge Check: Extract & Transform
- Why should
extractraise on an HTTP error instead of returning[]? - What makes a transform function “pure,” and why does that matter for reruns?
- Where would you add field validation - and what should happen on a bad row?
🧙♂️ Chapter 3: Idempotent Loading and Incremental Pipelines
Loading is where pipelines live or die. The golden rule: running the pipeline twice must not create two copies of the same data. That property is called idempotency.
⚔️ Skills You’ll Forge in This Chapter
- Idempotent upserts with a natural/primary key
- Incremental loads using a watermark
- Designing loads that are safe to retry mid-failure
🏗️ Load: Idempotent Upsert into SQLite
The naive approach - INSERT every row - duplicates data on every rerun. Instead, use an upsert keyed on a stable identifier so a rerun overwrites rather than duplicates.
# etl.py — stage 3: load (idempotent)
import sqlite3
def load(rows: list[dict], db_path: str = "warehouse.db") -> int:
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS users (
user_id INTEGER PRIMARY KEY,
name TEXT,
email TEXT,
city TEXT,
ingested_at TEXT
)
""")
# ON CONFLICT makes this idempotent: rerunning UPDATES the existing row
# instead of inserting a duplicate. The PRIMARY KEY is the conflict target.
conn.executemany("""
INSERT INTO users (user_id, name, email, city, ingested_at)
VALUES (:user_id, :name, :email, :city, :ingested_at)
ON CONFLICT(user_id) DO UPDATE SET
name=excluded.name,
email=excluded.email,
city=excluded.city,
ingested_at=excluded.ingested_at
""", rows)
conn.commit()
count = conn.total_changes
conn.close()
return count
🏗️ Wire the Full Pipeline
# etl.py — orchestrate the three stages with retry on extract
import time
def run_pipeline(url: str) -> None:
for attempt in range(3):
try:
raw = extract(url)
break
except requests.RequestException as e:
wait = 2 ** attempt
print(f"Extract failed ({e}); retrying in {wait}s")
time.sleep(wait)
else:
raise RuntimeError("Extract failed after 3 attempts")
clean = transform(raw)
changed = load(clean)
print(f"Loaded/updated {changed} rows idempotently")
if __name__ == "__main__":
run_pipeline("https://jsonplaceholder.typicode.com/users")
run_pipeline("https://jsonplaceholder.typicode.com/users") # run twice — still no duplicates
Run it twice. The second run updates the same rows instead of duplicating them - that is idempotency in action.
🏗️ Incremental Loads with a Watermark
Full loads waste effort once data grows. An incremental load moves only records newer than the last successful run, tracked by a watermark (a high-water mark such as a timestamp or monotonic id).
-- Find the watermark: the newest record we have already loaded.
SELECT COALESCE(MAX(updated_at), '1970-01-01') AS watermark FROM users;
-- Then extract only newer rows from the source:
-- SELECT * FROM source_users WHERE updated_at > :watermark;
-- After a successful load, the new MAX(updated_at) becomes the next watermark.
This makes reruns cheap and correct: combined with the idempotent upsert, even an overlapping window cannot create duplicates.
🔍 Knowledge Check: Idempotency & Incrementals
- What does
ON CONFLICT ... DO UPDATEguarantee on a rerun? - How does a watermark prevent re-processing old data?
- Why is “idempotent upsert + watermark” safer than either alone?
🧙♂️ Chapter 4: Orchestration with Airflow DAGs
A pipeline you run by hand is a chore. A pipeline an orchestrator runs on a schedule, retries on failure, and alerts you when it breaks is infrastructure. Apache Airflow models your pipeline as a DAG - a Directed Acyclic Graph of tasks with dependencies.
⚔️ Skills You’ll Forge in This Chapter
- DAGs, tasks, dependencies, and schedules
- Expressing extract → transform → load as ordered tasks
- Built-in retries and idempotent task design
🏗️ The Same Pipeline as an Airflow DAG
# dags/users_etl.py — drop into your Airflow dags/ folder
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
from etl import extract, transform, load # your functions from Chapter 2-3
default_args = {
"retries": 3, # Airflow retries failed tasks for you
"retry_delay": timedelta(minutes=2),
}
with DAG(
dag_id="users_etl",
start_date=datetime(2025, 1, 1),
schedule="@daily", # run once per day
catchup=False,
default_args=default_args,
tags=["data-engineering", "etl"],
) as dag:
def _extract(**ctx):
rows = extract("https://jsonplaceholder.typicode.com/users")
ctx["ti"].xcom_push(key="raw", value=rows)
def _transform(**ctx):
raw = ctx["ti"].xcom_pull(key="raw", task_ids="extract")
ctx["ti"].xcom_push(key="clean", value=transform(raw))
def _load(**ctx):
clean = ctx["ti"].xcom_pull(key="clean", task_ids="transform")
load(clean)
t_extract = PythonOperator(task_id="extract", python_callable=_extract)
t_transform = PythonOperator(task_id="transform", python_callable=_transform)
t_load = PythonOperator(task_id="load", python_callable=_load)
# The dependency chain — this is the "directed acyclic" part of the DAG:
t_extract >> t_transform >> t_load
The >> operator declares dependencies: load cannot start until transform succeeds, which cannot start until extract succeeds. Because each task is idempotent, Airflow can safely retry any single task without corrupting your data.
🔍 Knowledge Check: Orchestration
- What do the nodes and edges of an Airflow DAG represent?
- Why must a DAG be acyclic?
- How do Airflow retries rely on your tasks being idempotent?
🎮 Mastery Challenges
🟢 Novice Challenge: Run It Twice
Objective: Run etl.py twice and prove the row count is identical both times.
Requirements:
- Pipeline executes extract, transform, and load
SELECT COUNT(*) FROM usersreturns the same number after both runs- No
UNIQUE constraint failederrors
Validation: Run python etl.py, then re-run it; the table is unchanged in size.
🟡 Intermediate Challenge: Make It Incremental
Objective: Convert the full load into an incremental one driven by a watermark.
Requirements:
- Compute the current watermark from the destination
- Extract only records newer than the watermark
- Keep the idempotent upsert so overlap is harmless
Validation: A second run with no new source data loads zero rows.
🔴 Advanced Challenge: Orchestrate and Backfill
Objective: Run the DAG in Airflow and perform a backfill over a historical window.
Requirements:
- DAG appears and runs in the Airflow UI on a schedule
- Tasks have retries configured and survive a simulated extract failure
- A backfill reprocesses past dates without duplicating present data
Validation: Trigger the DAG, kill the extract once, and confirm the retry completes the run cleanly.
🏆 Quest Rewards & Achievements
🎖️ Badges Earned:
- 🏆 Pipeline Architect - You designed an idempotent, orchestrated ETL flow
- ⚙️ Idempotency Adept - Your loads are safe to rerun, every time
🛠️ Skills Unlocked:
- Python ETL Engineering - Build extract, transform, and load stages cleanly
- Workflow Orchestration - Express pipelines as scheduled Airflow DAGs
🔓 Unlocked Quests:
- Data Warehousing - Model the destination your pipeline feeds
- Apache Spark - Scale your transforms to distributed compute
- Data Quality Engineering - Guard the data your pipeline moves
📊 Progression Points: +75 XP
🗺️ Next Steps in Your Journey
Continue the Main Story:
- 🎯 Data Warehousing - Design the analytical destination
Explore Side Adventures:
- ⚔️ Apache Spark - Distribute your transformations
- ⚔️ Data Quality Engineering - Validate every load
Character Class Recommendations
💻 Software Developer: Continue to Data Warehousing
🏗️ System Engineer: Explore Apache Spark
📊 Data Scientist: Advance to Data Quality Engineering
📚 Resources
Official Documentation
- Apache Airflow Documentation - DAGs, operators, scheduling
- Python sqlite3 module - The standard-library database used here
- SQLite UPSERT syntax -
ON CONFLICT DO UPDATE
Community Resources
- dbt - the analytics engineering standard for ELT - Transform-in-warehouse tooling
- Awesome Data Engineering - Curated tools and reading
- r/dataengineering - Practitioner community
Learning Materials
- The Data Engineering Cookbook - Patterns and interview prep
- Airflow Tutorials - Hands-on DAG building
🤝 Quest Completion Checklist
- ✅ Completed all primary objectives
- ✅ Built a runnable, idempotent ETL script
- ✅ Answered all knowledge check questions
- ✅ Completed at least one mastery challenge
- ✅ Explored the resource library
- ✅ Identified your next quest in the journey
🕸️ Knowledge Graph
Structured wiki-links connect this quest to the IT-Journey knowledge graph. Open the Obsidian Graph View to explore connections.
Level hub: [[Level 1100 - Data Engineering]] Overworld: [[🏰 Overworld - Master Quest Map]] Unlocks: [[Data Warehousing: Design Star Schema & Build Modern Analytics Architecture]] · [[Apache Spark Mastery: Big Data Processing with PySpark & Scala]] · [[Data Quality Engineering: Testing, Validation & Monitoring Frameworks]] Obsidian docs: [[Obsidian Knowledge Graph and Wiki Links]]
🎁 Rewards
Badges
- 🏆 Pipeline Architect - Designed an idempotent, orchestrated ETL flow
- ⚙️ Idempotency Adept - Built loads that are safe to rerun
Skills unlocked
- 🛠️ Python ETL Engineering
- 🧠 Workflow Orchestration with Airflow
Features unlocked
- Access to the Data Warehousing and Apache Spark quests
🕸️ Quest Network
Click a node to open the quest · ⌘/Ctrl-click for a new tab · drag to reposition · scroll to zoom.
Referenced by
- Loading…