Data Warehousing: Build a Dimensional Star Schema in SQL

Build the Analytical Citadel: model OLTP vs OLAP, design star and snowflake schemas, implement slowly changing dimensions, and explore columnar storage.

IT-Journey Team
Published Nov 29, 2025
Updated Jun 14, 2026
Quests
🔴 hard
1100 sql data-warehouse main_quest data-engineering +2
View source

Estimated reading time: 25 minutes

Edit on GitHub

⚡ Lvl 1100Master 🏰 Main Quest 🔴 Hard 5-6 hours

Data Warehousing: Build a Dimensional Star Schema in SQL

Design data warehouses with dimensional modeling, star schemas, slowly changing dimensions, and columnar storage

Primary Tech: 🛠️ sql
Skill Focus: Data engineering
Series: Data Engineering Mastery
Author: IT-Journey Team
XP Range: ⚡ 6000-7000

📈 Your Progress

Not started · 0%

Your progress is stored in your browser only. Use your inventory to back it up.

🗝️ Prerequisites

Required quests

ETL Pipeline Design: Scalable Python Data Pipelines

Recommended quests

Data Modeling: Schema Design and Database Relationships

Knowledge requirements

Solid SQL - SELECT, JOIN, GROUP BY, aggregate functions
Understanding of primary and foreign keys
Completion of the ETL Pipeline Design quest (recommended)

System requirements

Modern OS (macOS, Windows 10+, Linux)
SQLite (bundled with Python) or DuckDB for the columnar demo
A text editor or IDE (VS Code recommended)

Greetings, brave adventurer! Your aqueduct from the ETL quest now delivers clean water - but where does it pool? You stand before the Analytical Citadel, the great reservoir where data is shaped not for fast single-row transactions but for sweeping questions across millions of rows: “What were sales by region by quarter for the last five years?” Build the citadel with the wrong layout and every report crawls; build it as a dimensional model and queries fly.

Whether you have only ever queried application databases or you are formalizing the analytics layer your company half-built, this quest forges the discipline of the warehouse: OLTP versus OLAP, star and snowflake schemas, fact and dimension tables, slowly changing dimensions, and the columnar engines that make it all fast.

📖 The Legend Behind This Quest

Application databases were built for speed at the level of a single order, a single user, a single click. When the kingdom’s strategists demanded answers across all orders, all users, all time, those same databases buckled. A new architecture arose - the data warehouse - optimized not to write one row quickly but to read billions efficiently. Its master plan was dimensional modeling, devised by Ralph Kimball, and it remains the blueprint analysts trust today.

This quest teaches the “why” behind every fact table and every dimension. Master it and the rest of the tier - distributed Spark jobs, streaming aggregations, quality contracts - all have a place to land.

🎯 Quest Objectives

By the time you complete this journey, you will have mastered:

Primary Objectives (Required for Quest Completion)

OLTP vs OLAP - Explain transactional versus analytical workloads and why they need different designs
Star schema modeling - Design a central fact table surrounded by dimension tables
Snowflake schemas - Normalize dimensions and weigh the trade-offs
Slowly Changing Dimensions - Track history correctly as dimension attributes change over time

Secondary Objectives (Bonus Achievements)

Columnar storage - Understand why analytics engines store data by column
Grain definition - Pin down exactly what one fact row represents
Surrogate keys - Decouple the warehouse from source system identifiers

Mastery Indicators

You’ll know you’ve truly mastered this quest when you can:

Describe why an OLTP schema is a poor fit for analytics
Design a star schema directly from a list of business questions
Implement a Type 2 SCD that preserves the full history of a record
Explain why columnar storage accelerates aggregate queries

🗺️ Quest Prerequisites

📋 Knowledge Requirements

Solid SQL: SELECT, JOIN, GROUP BY, aggregate functions
Comfort with primary keys and foreign keys
Completion of ETL Pipeline Design (recommended)

🛠️ System Requirements

Modern operating system (Windows 10+, macOS 10.14+, or Linux)
SQLite (ships with Python) for the schema work
Optional: DuckDB for the columnar storage demo
A text editor or IDE (VS Code recommended)

🧠 Skill Level Indicators

This 🔴 Hard quest expects:

You can write multi-table JOIN and GROUP BY queries
You are ready to think in analytics, not single-row transactions
Ready for 5-6 hours of focused, hands-on modeling

🌍 Choose Your Adventure Platform

All the SQL runs in SQLite, which ships with Python everywhere. The optional columnar demo uses DuckDB. Choose your setup path.

🍎 macOS Kingdom Path

Click to expand macOS instructions

```bash brew install python duckdb python3 -m venv .venv && source .venv/bin/activate python -m pip install --upgrade pip duckdb sqlite3 --version ```

🪟 Windows Empire Path

Click to expand Windows instructions

```powershell winget install Python.Python.3.12 DuckDB.cli py -3 -m venv .venv .\.venv\Scripts\activate python -m pip install --upgrade pip duckdb ```

🐧 Linux Territory Path

Click to expand Linux instructions

```bash sudo apt update && sudo apt install -y python3 python3-venv sqlite3 # Debian/Ubuntu python3 -m venv .venv && source .venv/bin/activate python -m pip install --upgrade pip duckdb ```

☁️ Cloud Realms Path

Click to expand Cloud/Container instructions

```bash # DuckDB needs no server and runs in any container or notebook: pip install duckdb python -c "import duckdb; print(duckdb.sql('SELECT 42 AS answer'))" ```

🧙‍♂️ Chapter 1: OLTP vs OLAP - Two Worlds of Data

The same data can be stored two completely different ways depending on the questions you ask of it.

⚔️ Skills You’ll Forge in This Chapter

The distinction between transactional and analytical workloads
Why one schema cannot serve both well
The vocabulary of facts, dimensions, and grain

🏗️ The Two Workloads

Aspect	OLTP (Online Transaction Processing)	OLAP (Online Analytical Processing)
Typical query	Insert/update one order; read one customer	Aggregate millions of rows across time
Optimized for	Many small, concurrent writes	Few large, read-heavy scans
Schema	Highly normalized (3NF), no redundancy	Denormalized star schema, redundancy embraced
Example system	PostgreSQL backing an app	Snowflake, BigQuery, Redshift, DuckDB
Row vs column	Row-oriented storage	Column-oriented storage

An OLTP schema spreads a sale across orders, order_items, customers, products, and addresses to avoid duplication. Answering “total revenue by city by month” then demands five-way joins over millions of rows. The warehouse denormalizes that into a shape built for exactly such questions.

🔍 Knowledge Check: OLTP vs OLAP

Why is heavy normalization great for OLTP but painful for OLAP?
What does “denormalized for analytics” buy you, and what does it cost?
Name two OLAP engines and two OLTP databases

🧙‍♂️ Chapter 2: Star Schemas - Facts and Dimensions

The star schema is the heart of dimensional modeling: one central fact table of measurements, surrounded by dimension tables of descriptive context. Drawn out, it looks like a star.

⚔️ Skills You’ll Forge in This Chapter

Identifying facts (measures) versus dimensions (context)
Defining the grain - what one fact row means
Building a star schema in SQL with surrogate keys

🏗️ Anatomy of a Star

Fact table - the numbers you measure: quantities, amounts, counts. One row per event at a defined grain (e.g., “one row per product per order”).
Dimension tables - the who, what, where, when: customers, products, dates, stores. These give your numbers meaning.
Surrogate keys - integer keys minted by the warehouse (date_key, product_key) instead of reusing source ids, so the warehouse stays stable when sources change.

-- Dimension tables: descriptive context, one row per entity.
CREATE TABLE dim_date (
    date_key   INTEGER PRIMARY KEY,   -- e.g. 20250613
    full_date  TEXT,                  -- '2025-06-13'
    year       INTEGER,
    quarter    INTEGER,
    month      INTEGER,
    day_name   TEXT
);

CREATE TABLE dim_product (
    product_key  INTEGER PRIMARY KEY, -- surrogate key
    product_id   TEXT,                -- natural key from the source system
    name         TEXT,
    category     TEXT
);

CREATE TABLE dim_customer (
    customer_key INTEGER PRIMARY KEY,
    customer_id  TEXT,
    name         TEXT,
    city         TEXT
);

-- Fact table: measures + foreign keys to every dimension.
-- Grain: one row per product per order line.
CREATE TABLE fact_sales (
    sale_id       INTEGER PRIMARY KEY,
    date_key      INTEGER REFERENCES dim_date(date_key),
    product_key   INTEGER REFERENCES dim_product(product_key),
    customer_key  INTEGER REFERENCES dim_customer(customer_key),
    quantity      INTEGER,            -- additive measure
    amount        REAL                -- additive measure
);

Now the dreaded “revenue by city by quarter” is a clean, fast query:

SELECT c.city, d.year, d.quarter, SUM(f.amount) AS revenue
FROM fact_sales f
JOIN dim_customer c ON c.customer_key = f.customer_key
JOIN dim_date     d ON d.date_key     = f.date_key
GROUP BY c.city, d.year, d.quarter
ORDER BY revenue DESC;

🔍 Knowledge Check: Star Schemas

What is the “grain” of fact_sales, and why must you define it first?
Why use a surrogate product_key instead of the source product_id?
Which columns are measures and which are context?

🧙‍♂️ Chapter 3: Snowflake Schemas and Slowly Changing Dimensions

Two refinements separate a working warehouse from a robust one: deciding how far to normalize dimensions, and deciding how to record history when a dimension changes.

⚔️ Skills You’ll Forge in This Chapter

Star versus snowflake normalization trade-offs
The three classic Slowly Changing Dimension (SCD) types
Implementing a Type 2 SCD that preserves full history

🏗️ Snowflake: Normalizing the Dimensions

A snowflake schema normalizes dimensions into sub-tables (e.g., dim_product → dim_category → dim_department). It saves storage and removes redundancy but adds joins and complexity.

-- Snowflaked: category extracted into its own table.
CREATE TABLE dim_category (
    category_key INTEGER PRIMARY KEY,
    category     TEXT,
    department   TEXT
);
-- dim_product now references dim_category instead of repeating the text.
ALTER TABLE dim_product ADD COLUMN category_key INTEGER REFERENCES dim_category(category_key);

Rule of thumb: prefer star (denormalized) for query speed and simplicity - storage is cheap and analysts love fewer joins. Reach for snowflake only when a dimension is huge, frequently reused, or governed centrally.

🏗️ Slowly Changing Dimensions

When a customer moves city, what should history show? The SCD types answer this:

Type 1 - overwrite the old value. Simple, but history is lost.
Type 2 - add a new row with effective dates, keeping the old one. Full history preserved.
Type 3 - add a “previous value” column. Limited history (one prior value).

-- Type 2 SCD: track history with effective dates and a current flag.
CREATE TABLE dim_customer_scd2 (
    customer_key   INTEGER PRIMARY KEY,  -- surrogate; a new key per version
    customer_id    TEXT,                 -- natural key, stable across versions
    name           TEXT,
    city           TEXT,
    valid_from     TEXT,
    valid_to       TEXT,                 -- NULL while current
    is_current     INTEGER               -- 1 for the active version
);

-- When customer C1 moves from Austin to Denver:
-- 1) Close the old version.
UPDATE dim_customer_scd2
SET valid_to = '2025-06-13', is_current = 0
WHERE customer_id = 'C1' AND is_current = 1;

-- 2) Insert the new current version.
INSERT INTO dim_customer_scd2 (customer_id, name, city, valid_from, valid_to, is_current)
VALUES ('C1', 'Ada Lovelace', 'Denver', '2025-06-13', NULL, 1);

Now a fact row joined on the surrogate customer_key forever reflects the city the customer lived in at the time of the sale - history stays correct.

🔍 Knowledge Check: Snowflake & SCDs

When is a snowflake schema worth the extra joins?
Which SCD type loses history, and which preserves it fully?
Why does a Type 2 SCD require a new surrogate key per version?

🧙‍♂️ Chapter 4: Columnar Storage - Why Analytics Engines Are Fast

The final secret of the warehouse is physical, not logical: analytical engines store data by column, not by row.

⚔️ Skills You’ll Forge in This Chapter

Row-oriented versus column-oriented storage
Why columnar layout accelerates aggregates and compresses well
Seeing it in action with DuckDB

🏗️ Rows vs Columns

A row store keeps (sale_id, date, product, amount) together on disk - perfect for reading one whole order. But SUM(amount) over a billion rows must then touch every column of every row. A column store keeps all amount values contiguously, so the engine reads only that column - far less I/O, and columns of similar values compress dramatically.

# columnar_demo.py — DuckDB is an embedded columnar OLAP engine
import duckdb

duckdb.sql("CREATE TABLE sales AS SELECT range AS id, random() AS amount FROM range(5_000_000)")
# Only the 'amount' column is scanned — columnar engines shine on aggregates.
print(duckdb.sql("SELECT COUNT(*), SUM(amount), AVG(amount) FROM sales"))

This is why Snowflake, BigQuery, Redshift, Parquet files, and DuckDB are all columnar: aggregate analytics over wide tables is exactly the workload columns are built for.

🔍 Knowledge Check: Columnar Storage

Why does SUM(amount) run faster on a column store than a row store?
Why do columns compress better than rows?
Name three columnar systems you might meet in production

🎮 Mastery Challenges

🟢 Novice Challenge: Build the Star

Objective: Create the fact_sales star schema with dim_date, dim_product, and dim_customer, then load a few rows.

Requirements:

One fact table and at least two dimension tables
Surrogate keys on all dimensions
A working “revenue by dimension” aggregate query

Validation: Your GROUP BY query returns sensible totals.

🟡 Intermediate Challenge: Track History

Objective: Implement a Type 2 SCD for customers and record a city change.

Requirements:

valid_from, valid_to, and is_current columns
Old version closed, new version inserted on change
A query that shows both historical and current versions

Validation: A sale before the move still maps to the old city.

🔴 Advanced Challenge: Star to Snowflake and Columnar

Objective: Snowflake the product dimension and benchmark an aggregate in DuckDB.

Requirements:

Extract category into dim_category with a foreign key
Load a multi-million-row fact table into DuckDB
Compare the aggregate query against a row-store equivalent and note the difference

Validation: You can explain the speed difference in terms of columnar I/O.

🏆 Quest Rewards & Achievements

🎖️ Badges Earned:

🏆 Citadel Architect - You designed a dimensional star schema
🌟 Dimension Keeper - You preserve history with slowly changing dimensions

🛠️ Skills Unlocked:

Dimensional Modeling - Facts, dimensions, grain, and surrogate keys
OLAP Schema Design - Star, snowflake, and columnar trade-offs

🔓 Unlocked Quests:

Apache Spark - Build these tables at petabyte scale
Data Quality Engineering - Enforce contracts on your warehouse tables

📊 Progression Points: +80 XP

🗺️ Next Steps in Your Journey

Continue the Main Story:

🎯 Apache Spark - Scale dimensional builds with distributed compute

Explore Side Adventures:

⚔️ Data Quality Engineering - Validate every dimension and fact
⚔️ Stream Processing - Feed the warehouse in real time

Character Class Recommendations

💻 Software Developer: Continue to Apache Spark
🏗️ System Engineer: Explore Stream Processing
📊 Data Scientist: Advance to Data Quality Engineering

📚 Resources

Official Documentation

DuckDB Documentation - Embedded columnar OLAP engine
SQLite SQL Reference - Used for the schema work here
Kimball Dimensional Modeling Techniques - The canonical reference

Community Resources

dbt Star Schema guidance - Modeling marts in modern warehouses
Awesome Data Engineering - Tools and reading
Snowflake vs star schema discussions - Practitioner debates

Learning Materials

The Data Warehouse Toolkit (Kimball) overview - Dimensional modeling bible
Apache Parquet format - The columnar file format behind modern lakes

🤝 Quest Completion Checklist

✅ Completed all primary objectives
✅ Built a working star schema with facts and dimensions
✅ Answered all knowledge check questions
✅ Completed at least one mastery challenge
✅ Explored the resource library
✅ Identified your next quest in the journey

🕸️ Knowledge Graph

Structured wiki-links connect this quest to the IT-Journey knowledge graph. Open the Obsidian Graph View to explore connections.

Level hub: [[Level 1100 - Data Engineering]] Overworld: [[🏰 Overworld - Master Quest Map]] Prerequisites: [[ETL Pipeline Design: Build Scalable Data Pipelines with Python]] Unlocks: [[Apache Spark Mastery: Big Data Processing with PySpark & Scala]] · [[Data Quality Engineering: Testing, Validation & Monitoring Frameworks]] Obsidian docs: [[Obsidian Knowledge Graph and Wiki Links]]

🎁 Rewards

80 XP

Badges

🏆 Citadel Architect - Designed a dimensional star schema
🌟 Dimension Keeper - Mastered slowly changing dimensions

Skills unlocked

🛠️ Dimensional Modeling
🧠 OLAP Schema Design

Features unlocked

Access to the Apache Spark and Data Quality quests

Unlocks

🕸️ Quest Network

Click a node to open the quest · ⌘/Ctrl-click for a new tab · drag to reposition · scroll to zoom.

Referenced by

Loading…

Layout	`quest`
Collection	`quests`
Path	`_quests/1100/data-warehousing.md`
URL	`/quests/1100/data-warehousing/`
Date	`2025-11-29`

Settings

Color Mode

Theme Skin

Background

Environment

Theme & Build

Page Location

Page Info

Source Code

Data Warehousing: Build a Dimensional Star Schema in SQL

Table of Contents

📖 The Legend Behind This Quest

🎯 Quest Objectives

Primary Objectives (Required for Quest Completion)

Secondary Objectives (Bonus Achievements)

Mastery Indicators

🗺️ Quest Prerequisites

📋 Knowledge Requirements

🛠️ System Requirements

🧠 Skill Level Indicators

🌍 Choose Your Adventure Platform

🍎 macOS Kingdom Path

🪟 Windows Empire Path

🐧 Linux Territory Path

☁️ Cloud Realms Path

🧙‍♂️ Chapter 1: OLTP vs OLAP - Two Worlds of Data

⚔️ Skills You’ll Forge in This Chapter

🏗️ The Two Workloads

🔍 Knowledge Check: OLTP vs OLAP

🧙‍♂️ Chapter 2: Star Schemas - Facts and Dimensions

⚔️ Skills You’ll Forge in This Chapter

🏗️ Anatomy of a Star

🔍 Knowledge Check: Star Schemas

🧙‍♂️ Chapter 3: Snowflake Schemas and Slowly Changing Dimensions

⚔️ Skills You’ll Forge in This Chapter

🏗️ Snowflake: Normalizing the Dimensions

🏗️ Slowly Changing Dimensions

🔍 Knowledge Check: Snowflake & SCDs

🧙‍♂️ Chapter 4: Columnar Storage - Why Analytics Engines Are Fast

⚔️ Skills You’ll Forge in This Chapter

🏗️ Rows vs Columns

🔍 Knowledge Check: Columnar Storage

🎮 Mastery Challenges

🟢 Novice Challenge: Build the Star

🟡 Intermediate Challenge: Track History

🔴 Advanced Challenge: Star to Snowflake and Columnar

🏆 Quest Rewards & Achievements

🗺️ Next Steps in Your Journey

Character Class Recommendations

📚 Resources

Official Documentation

Community Resources

Learning Materials

🤝 Quest Completion Checklist

🕸️ Knowledge Graph

🎁 Rewards

Badges

Skills unlocked

Features unlocked

Unlocks

🕸️ Quest Network

Referenced by