Skip to main content
Settings
Search
Appearance
Theme Mode
About
Jekyll v3.10.0
Environment Production
Last Build
2026-07-02 02:06 UTC
Current Environment Production
Build Time Jul 02, 02:06
Jekyll v3.10.0
Build env (JEKYLL_ENV) production
Quick Links
Page Location
Page Info
Layout quest
Collection quests
Path _quests/1100/data-warehousing.md
URL /quests/1100/data-warehousing/
Date 2025-11-29
Theme Skin
SVG Backgrounds
Layer Opacity
0.6
0.04
0.08

Data Warehousing: Build a Dimensional Star Schema in SQL

Build the Analytical Citadel: model OLTP vs OLAP, design star and snowflake schemas, implement slowly changing dimensions, and explore columnar storage.

Table of Contents

Lvl 1100Master 🏰 Main Quest 🔴 Hard 5-6 hours

Data Warehousing: Build a Dimensional Star Schema in SQL

Design data warehouses with dimensional modeling, star schemas, slowly changing dimensions, and columnar storage

Primary Tech
🛠️ sql
Skill Focus
Data engineering
Series
Data Engineering Mastery
Author
IT-Journey Team
XP Range
⚡ 6000-7000

Greetings, brave adventurer! Your aqueduct from the ETL quest now delivers clean water - but where does it pool? You stand before the Analytical Citadel, the great reservoir where data is shaped not for fast single-row transactions but for sweeping questions across millions of rows: “What were sales by region by quarter for the last five years?” Build the citadel with the wrong layout and every report crawls; build it as a dimensional model and queries fly.

Whether you have only ever queried application databases or you are formalizing the analytics layer your company half-built, this quest forges the discipline of the warehouse: OLTP versus OLAP, star and snowflake schemas, fact and dimension tables, slowly changing dimensions, and the columnar engines that make it all fast.

📖 The Legend Behind This Quest

Application databases were built for speed at the level of a single order, a single user, a single click. When the kingdom’s strategists demanded answers across all orders, all users, all time, those same databases buckled. A new architecture arose - the data warehouse - optimized not to write one row quickly but to read billions efficiently. Its master plan was dimensional modeling, devised by Ralph Kimball, and it remains the blueprint analysts trust today.

This quest teaches the “why” behind every fact table and every dimension. Master it and the rest of the tier - distributed Spark jobs, streaming aggregations, quality contracts - all have a place to land.

🎯 Quest Objectives

By the time you complete this journey, you will have mastered:

Primary Objectives (Required for Quest Completion)

  • OLTP vs OLAP - Explain transactional versus analytical workloads and why they need different designs
  • Star schema modeling - Design a central fact table surrounded by dimension tables
  • Snowflake schemas - Normalize dimensions and weigh the trade-offs
  • Slowly Changing Dimensions - Track history correctly as dimension attributes change over time

Secondary Objectives (Bonus Achievements)

  • Columnar storage - Understand why analytics engines store data by column
  • Grain definition - Pin down exactly what one fact row represents
  • Surrogate keys - Decouple the warehouse from source system identifiers

Mastery Indicators

You’ll know you’ve truly mastered this quest when you can:

  • Describe why an OLTP schema is a poor fit for analytics
  • Design a star schema directly from a list of business questions
  • Implement a Type 2 SCD that preserves the full history of a record
  • Explain why columnar storage accelerates aggregate queries

🗺️ Quest Prerequisites

📋 Knowledge Requirements

  • Solid SQL: SELECT, JOIN, GROUP BY, aggregate functions
  • Comfort with primary keys and foreign keys
  • Completion of ETL Pipeline Design (recommended)

🛠️ System Requirements

  • Modern operating system (Windows 10+, macOS 10.14+, or Linux)
  • SQLite (ships with Python) for the schema work
  • Optional: DuckDB for the columnar storage demo
  • A text editor or IDE (VS Code recommended)

🧠 Skill Level Indicators

This 🔴 Hard quest expects:

  • You can write multi-table JOIN and GROUP BY queries
  • You are ready to think in analytics, not single-row transactions
  • Ready for 5-6 hours of focused, hands-on modeling

🌍 Choose Your Adventure Platform

All the SQL runs in SQLite, which ships with Python everywhere. The optional columnar demo uses DuckDB. Choose your setup path.

🍎 macOS Kingdom Path

Click to expand macOS instructions ```bash brew install python duckdb python3 -m venv .venv && source .venv/bin/activate python -m pip install --upgrade pip duckdb sqlite3 --version ```

🪟 Windows Empire Path

Click to expand Windows instructions ```powershell winget install Python.Python.3.12 DuckDB.cli py -3 -m venv .venv .\.venv\Scripts\activate python -m pip install --upgrade pip duckdb ```

🐧 Linux Territory Path

Click to expand Linux instructions ```bash sudo apt update && sudo apt install -y python3 python3-venv sqlite3 # Debian/Ubuntu python3 -m venv .venv && source .venv/bin/activate python -m pip install --upgrade pip duckdb ```

☁️ Cloud Realms Path

Click to expand Cloud/Container instructions ```bash # DuckDB needs no server and runs in any container or notebook: pip install duckdb python -c "import duckdb; print(duckdb.sql('SELECT 42 AS answer'))" ```

🧙‍♂️ Chapter 1: OLTP vs OLAP - Two Worlds of Data

The same data can be stored two completely different ways depending on the questions you ask of it.

⚔️ Skills You’ll Forge in This Chapter

  • The distinction between transactional and analytical workloads
  • Why one schema cannot serve both well
  • The vocabulary of facts, dimensions, and grain

🏗️ The Two Workloads

Aspect OLTP (Online Transaction Processing) OLAP (Online Analytical Processing)
Typical query Insert/update one order; read one customer Aggregate millions of rows across time
Optimized for Many small, concurrent writes Few large, read-heavy scans
Schema Highly normalized (3NF), no redundancy Denormalized star schema, redundancy embraced
Example system PostgreSQL backing an app Snowflake, BigQuery, Redshift, DuckDB
Row vs column Row-oriented storage Column-oriented storage

An OLTP schema spreads a sale across orders, order_items, customers, products, and addresses to avoid duplication. Answering “total revenue by city by month” then demands five-way joins over millions of rows. The warehouse denormalizes that into a shape built for exactly such questions.

🔍 Knowledge Check: OLTP vs OLAP

  • Why is heavy normalization great for OLTP but painful for OLAP?
  • What does “denormalized for analytics” buy you, and what does it cost?
  • Name two OLAP engines and two OLTP databases

🧙‍♂️ Chapter 2: Star Schemas - Facts and Dimensions

The star schema is the heart of dimensional modeling: one central fact table of measurements, surrounded by dimension tables of descriptive context. Drawn out, it looks like a star.

⚔️ Skills You’ll Forge in This Chapter

  • Identifying facts (measures) versus dimensions (context)
  • Defining the grain - what one fact row means
  • Building a star schema in SQL with surrogate keys

🏗️ Anatomy of a Star

  • Fact table - the numbers you measure: quantities, amounts, counts. One row per event at a defined grain (e.g., “one row per product per order”).
  • Dimension tables - the who, what, where, when: customers, products, dates, stores. These give your numbers meaning.
  • Surrogate keys - integer keys minted by the warehouse (date_key, product_key) instead of reusing source ids, so the warehouse stays stable when sources change.
-- Dimension tables: descriptive context, one row per entity.
CREATE TABLE dim_date (
    date_key   INTEGER PRIMARY KEY,   -- e.g. 20250613
    full_date  TEXT,                  -- '2025-06-13'
    year       INTEGER,
    quarter    INTEGER,
    month      INTEGER,
    day_name   TEXT
);

CREATE TABLE dim_product (
    product_key  INTEGER PRIMARY KEY, -- surrogate key
    product_id   TEXT,                -- natural key from the source system
    name         TEXT,
    category     TEXT
);

CREATE TABLE dim_customer (
    customer_key INTEGER PRIMARY KEY,
    customer_id  TEXT,
    name         TEXT,
    city         TEXT
);

-- Fact table: measures + foreign keys to every dimension.
-- Grain: one row per product per order line.
CREATE TABLE fact_sales (
    sale_id       INTEGER PRIMARY KEY,
    date_key      INTEGER REFERENCES dim_date(date_key),
    product_key   INTEGER REFERENCES dim_product(product_key),
    customer_key  INTEGER REFERENCES dim_customer(customer_key),
    quantity      INTEGER,            -- additive measure
    amount        REAL                -- additive measure
);

Now the dreaded “revenue by city by quarter” is a clean, fast query:

SELECT c.city, d.year, d.quarter, SUM(f.amount) AS revenue
FROM fact_sales f
JOIN dim_customer c ON c.customer_key = f.customer_key
JOIN dim_date     d ON d.date_key     = f.date_key
GROUP BY c.city, d.year, d.quarter
ORDER BY revenue DESC;

🔍 Knowledge Check: Star Schemas

  • What is the “grain” of fact_sales, and why must you define it first?
  • Why use a surrogate product_key instead of the source product_id?
  • Which columns are measures and which are context?

🧙‍♂️ Chapter 3: Snowflake Schemas and Slowly Changing Dimensions

Two refinements separate a working warehouse from a robust one: deciding how far to normalize dimensions, and deciding how to record history when a dimension changes.

⚔️ Skills You’ll Forge in This Chapter

  • Star versus snowflake normalization trade-offs
  • The three classic Slowly Changing Dimension (SCD) types
  • Implementing a Type 2 SCD that preserves full history

🏗️ Snowflake: Normalizing the Dimensions

A snowflake schema normalizes dimensions into sub-tables (e.g., dim_productdim_categorydim_department). It saves storage and removes redundancy but adds joins and complexity.

-- Snowflaked: category extracted into its own table.
CREATE TABLE dim_category (
    category_key INTEGER PRIMARY KEY,
    category     TEXT,
    department   TEXT
);
-- dim_product now references dim_category instead of repeating the text.
ALTER TABLE dim_product ADD COLUMN category_key INTEGER REFERENCES dim_category(category_key);

Rule of thumb: prefer star (denormalized) for query speed and simplicity - storage is cheap and analysts love fewer joins. Reach for snowflake only when a dimension is huge, frequently reused, or governed centrally.

🏗️ Slowly Changing Dimensions

When a customer moves city, what should history show? The SCD types answer this:

  • Type 1 - overwrite the old value. Simple, but history is lost.
  • Type 2 - add a new row with effective dates, keeping the old one. Full history preserved.
  • Type 3 - add a “previous value” column. Limited history (one prior value).
-- Type 2 SCD: track history with effective dates and a current flag.
CREATE TABLE dim_customer_scd2 (
    customer_key   INTEGER PRIMARY KEY,  -- surrogate; a new key per version
    customer_id    TEXT,                 -- natural key, stable across versions
    name           TEXT,
    city           TEXT,
    valid_from     TEXT,
    valid_to       TEXT,                 -- NULL while current
    is_current     INTEGER               -- 1 for the active version
);

-- When customer C1 moves from Austin to Denver:
-- 1) Close the old version.
UPDATE dim_customer_scd2
SET valid_to = '2025-06-13', is_current = 0
WHERE customer_id = 'C1' AND is_current = 1;

-- 2) Insert the new current version.
INSERT INTO dim_customer_scd2 (customer_id, name, city, valid_from, valid_to, is_current)
VALUES ('C1', 'Ada Lovelace', 'Denver', '2025-06-13', NULL, 1);

Now a fact row joined on the surrogate customer_key forever reflects the city the customer lived in at the time of the sale - history stays correct.

🔍 Knowledge Check: Snowflake & SCDs

  • When is a snowflake schema worth the extra joins?
  • Which SCD type loses history, and which preserves it fully?
  • Why does a Type 2 SCD require a new surrogate key per version?

🧙‍♂️ Chapter 4: Columnar Storage - Why Analytics Engines Are Fast

The final secret of the warehouse is physical, not logical: analytical engines store data by column, not by row.

⚔️ Skills You’ll Forge in This Chapter

  • Row-oriented versus column-oriented storage
  • Why columnar layout accelerates aggregates and compresses well
  • Seeing it in action with DuckDB

🏗️ Rows vs Columns

A row store keeps (sale_id, date, product, amount) together on disk - perfect for reading one whole order. But SUM(amount) over a billion rows must then touch every column of every row. A column store keeps all amount values contiguously, so the engine reads only that column - far less I/O, and columns of similar values compress dramatically.

# columnar_demo.py — DuckDB is an embedded columnar OLAP engine
import duckdb

duckdb.sql("CREATE TABLE sales AS SELECT range AS id, random() AS amount FROM range(5_000_000)")
# Only the 'amount' column is scanned — columnar engines shine on aggregates.
print(duckdb.sql("SELECT COUNT(*), SUM(amount), AVG(amount) FROM sales"))

This is why Snowflake, BigQuery, Redshift, Parquet files, and DuckDB are all columnar: aggregate analytics over wide tables is exactly the workload columns are built for.

🔍 Knowledge Check: Columnar Storage

  • Why does SUM(amount) run faster on a column store than a row store?
  • Why do columns compress better than rows?
  • Name three columnar systems you might meet in production

🎮 Mastery Challenges

🟢 Novice Challenge: Build the Star

Objective: Create the fact_sales star schema with dim_date, dim_product, and dim_customer, then load a few rows.

Requirements:

  • One fact table and at least two dimension tables
  • Surrogate keys on all dimensions
  • A working “revenue by dimension” aggregate query

Validation: Your GROUP BY query returns sensible totals.

🟡 Intermediate Challenge: Track History

Objective: Implement a Type 2 SCD for customers and record a city change.

Requirements:

  • valid_from, valid_to, and is_current columns
  • Old version closed, new version inserted on change
  • A query that shows both historical and current versions

Validation: A sale before the move still maps to the old city.

🔴 Advanced Challenge: Star to Snowflake and Columnar

Objective: Snowflake the product dimension and benchmark an aggregate in DuckDB.

Requirements:

  • Extract category into dim_category with a foreign key
  • Load a multi-million-row fact table into DuckDB
  • Compare the aggregate query against a row-store equivalent and note the difference

Validation: You can explain the speed difference in terms of columnar I/O.

🏆 Quest Rewards & Achievements

🎖️ Badges Earned:

  • 🏆 Citadel Architect - You designed a dimensional star schema
  • 🌟 Dimension Keeper - You preserve history with slowly changing dimensions

🛠️ Skills Unlocked:

  • Dimensional Modeling - Facts, dimensions, grain, and surrogate keys
  • OLAP Schema Design - Star, snowflake, and columnar trade-offs

🔓 Unlocked Quests:

  • Apache Spark - Build these tables at petabyte scale
  • Data Quality Engineering - Enforce contracts on your warehouse tables

📊 Progression Points: +80 XP

🗺️ Next Steps in Your Journey

Continue the Main Story:

  • 🎯 Apache Spark - Scale dimensional builds with distributed compute

Explore Side Adventures:

Character Class Recommendations

💻 Software Developer: Continue to Apache Spark
🏗️ System Engineer: Explore Stream Processing
📊 Data Scientist: Advance to Data Quality Engineering

📚 Resources

Official Documentation

Community Resources

Learning Materials

🤝 Quest Completion Checklist

  • ✅ Completed all primary objectives
  • ✅ Built a working star schema with facts and dimensions
  • ✅ Answered all knowledge check questions
  • ✅ Completed at least one mastery challenge
  • ✅ Explored the resource library
  • ✅ Identified your next quest in the journey

🕸️ Knowledge Graph

Structured wiki-links connect this quest to the IT-Journey knowledge graph. Open the Obsidian Graph View to explore connections.

Level hub: [[Level 1100 - Data Engineering]] Overworld: [[🏰 Overworld - Master Quest Map]] Prerequisites: [[ETL Pipeline Design: Build Scalable Data Pipelines with Python]] Unlocks: [[Apache Spark Mastery: Big Data Processing with PySpark & Scala]] · [[Data Quality Engineering: Testing, Validation & Monitoring Frameworks]] Obsidian docs: [[Obsidian Knowledge Graph and Wiki Links]]

🎁 Rewards

80 XP

Badges

  • 🏆 Citadel Architect - Designed a dimensional star schema
  • 🌟 Dimension Keeper - Mastered slowly changing dimensions

Skills unlocked

  • 🛠️ Dimensional Modeling
  • 🧠 OLAP Schema Design

Features unlocked

  • Access to the Apache Spark and Data Quality quests

🕸️ Quest Network

Loading quest graph…

Click a node to open the quest · ⌘/Ctrl-click for a new tab · drag to reposition · scroll to zoom.