Python for Data Science: NumPy, Pandas & Matplotlib
Master NumPy, Pandas, Matplotlib, and EDA for Python data science workflows
Greetings, brave adventurer! Before the Oracle can teach a machine to learn, you must first master the Data Artisan’s Toolkit. Raw data arrives messy, missing, and mute. This quest, Python for Data Science, forges the three legendary instruments every practitioner carries: NumPy for raw numerical power, pandas for shaping tabular data, and Matplotlib for making data speak in pictures.
Whether you have never touched a DataFrame or you wrangle spreadsheets daily and want the real tools, this adventure will turn chaos into clarity.
📖 The Legend Behind This Quest
In the early days, analysts toiled over numbers one cell at a time. Then the artisans of the scientific Python guild forged NumPy - arrays that compute on millions of values at once through vectorization. Upon that foundation rose pandas, which gave data a name, an index, and a shape. With Matplotlib, the numbers finally became visible. Together they form the bedrock beneath every machine learning model you will ever train.
Learn these tools well, and the messiest dataset becomes a story you can read.
🎯 Quest Objectives
By the time you complete this journey, you will have mastered:
Primary Objectives (Required for Quest Completion)
- NumPy Arrays - Create, index, and compute on arrays using vectorized operations
- Pandas DataFrames - Load, filter, group, and aggregate tabular data
- Exploratory Data Analysis - Summarize a dataset and handle missing values deliberately
- Visualization - Build a clearly labeled chart that answers a real question
Secondary Objectives (Bonus Achievements)
- GroupBy Aggregations - Split, apply, and combine to compute group statistics
- Merging Data - Join two DataFrames on a shared key
- First scikit-learn Model - Feed a cleaned DataFrame into a simple estimator
Mastery Indicators
You’ll know you’ve truly mastered this quest when you can:
- Explain why a vectorized NumPy operation beats a Python loop
- Decide whether to drop or impute missing values for a given column
- Produce a chart that a non-technical person can read at a glance
- Troubleshoot a
SettingWithCopyWarningwithout panic
🗺️ Quest Prerequisites
📋 Knowledge Requirements
- Basic Python: variables, functions, loops, lists, and dictionaries
- Familiarity with importing modules
- No prior data science experience required
🛠️ System Requirements
- Modern operating system (Windows 10+, macOS 10.14+, or Linux)
- Python 3.10 or newer installed and on your PATH
- A Jupyter environment or VS Code with the Python extension
- Internet connection for installing packages
🧠 Skill Level Indicators
This 🟡 Medium quest expects:
- You can write and run a small Python program
- You are comfortable experimenting in a notebook or REPL
- Ready for 4-5 hours of hands-on learning
🌍 Choose Your Adventure Platform
Create an isolated environment so your data tools stay tidy.
🍎 macOS Kingdom Path
Click to expand macOS instructions
```bash python3 -m venv ~/ds-quest && source ~/ds-quest/bin/activate pip install --upgrade pip pip install numpy pandas matplotlib seaborn scikit-learn jupyter python -c "import pandas as pd; print('pandas', pd.__version__)" ```🪟 Windows Empire Path
Click to expand Windows instructions
```powershell python -m venv $HOME\ds-quest & $HOME\ds-quest\Scripts\Activate.ps1 pip install --upgrade pip pip install numpy pandas matplotlib seaborn scikit-learn jupyter python -c "import pandas as pd; print('pandas', pd.__version__)" ```🐧 Linux Territory Path
Click to expand Linux instructions
```bash sudo apt update && sudo apt install -y python3-venv python3-pip python3 -m venv ~/ds-quest && source ~/ds-quest/bin/activate pip install --upgrade pip pip install numpy pandas matplotlib seaborn scikit-learn jupyter python -c "import pandas as pd; print('pandas', pd.__version__)" ```☁️ Cloud Realms Path
Click to expand Cloud/Container instructions
```bash # Google Colab and most data science images ship these preinstalled. pip install "numpy>=1.26" "pandas>=2.2" matplotlib seaborn scikit-learn ```🧙♂️ Chapter 1: NumPy - The Engine of Vectorized Power
NumPy is the foundation. Its ndarray stores numbers contiguously in memory and operates on the whole array at once - this is vectorization, and it is dramatically faster than Python loops.
⚔️ Skills You’ll Forge in This Chapter
- Creating and reshaping arrays
- Vectorized arithmetic and broadcasting
- Boolean masking to filter data
🏗️ Arrays in Action
import numpy as np
# Create arrays
a = np.array([1, 2, 3, 4, 5])
grid = np.arange(12).reshape(3, 4) # 3 rows, 4 columns
# Vectorized math: one operation applies to every element
print(a * 2) # [ 2 4 6 8 10] — no loop needed
print(a.mean(), a.std())
# Broadcasting: a scalar (or smaller array) stretches across a larger one
print(grid + 100)
# Boolean masking: select elements that satisfy a condition
print(a[a > 2]) # [3 4 5]
A loop summing a million numbers in pure Python is slow; arr.sum() runs the same work in optimized C. Vectorize whenever you can.
🔍 Knowledge Check: NumPy
- What does
reshape(3, 4)require about the original array size? - Why is
a * 2faster than a Pythonforloop? - What does
a[a > 2]return?
⚡ Quick Wins and Checkpoints
- Imported NumPy:
import numpy as npworks - Vectorized a calculation: Replaced a loop with an array operation
🧙♂️ Chapter 2: Pandas - Giving Data a Name and Shape
A pandas DataFrame is a labeled table: rows have an index, columns have names. It is the workhorse of data science.
⚔️ Skills You’ll Forge in This Chapter
- Loading data into a DataFrame
- Selecting, filtering, and creating columns
- Grouping and aggregating
🏗️ DataFrames in Action
import pandas as pd
from sklearn.datasets import load_iris
# Build a DataFrame from a real dataset
data = load_iris(as_frame=True)
df = data.frame # features + target as named columns
df["species"] = df["target"].map(dict(enumerate(data.target_names)))
# Inspect
print(df.head())
print(df.shape) # (rows, columns)
print(df.describe()) # summary statistics per column
# Filter rows and select columns
big_petals = df[df["petal length (cm)"] > 4.0]
print(big_petals[["species", "petal length (cm)"]].head())
# Group and aggregate: average measurements per species
summary = df.groupby("species").mean(numeric_only=True)
print(summary)
The split-apply-combine pattern (groupby) is the workhorse of analysis in pandas: split rows into groups, apply a function to each, and combine the results into one table.
🔍 Knowledge Check: Pandas
- What does
df.describe()compute? - How do you select all rows where a column exceeds a threshold?
- What three steps does
groupbyperform?
🧙♂️ Chapter 3: Exploratory Data Analysis and Cleaning
Real data is dirty. EDA is the detective work of understanding a dataset before modeling: its shape, its distributions, and its gaps.
⚔️ Skills You’ll Forge in This Chapter
- Detecting and handling missing values
- Spotting outliers and distributions
- Deciding to drop versus impute
🏗️ Cleaning in Action
import pandas as pd
import numpy as np
# Simulate a messy dataset
df = pd.DataFrame({
"age": [25, 32, np.nan, 47, 51, np.nan, 29],
"income": [40000, 52000, 61000, np.nan, 88000, 47000, 50000],
"city": ["NYC", "LA", "NYC", "SF", np.nan, "LA", "SF"],
})
# 1. Find missing values
print(df.isna().sum())
# 2. Impute numeric columns with the median (robust to outliers)
df["age"] = df["age"].fillna(df["age"].median())
df["income"] = df["income"].fillna(df["income"].median())
# 3. Fill categorical with the most frequent value (mode)
df["city"] = df["city"].fillna(df["city"].mode()[0])
print(df.isna().sum()) # all zeros now
print(df)
When more than half a column is missing, dropping it often beats imputing fabricated values. Always document the choice - silent imputation hides assumptions.
🔍 Knowledge Check: EDA
- Why prefer the median over the mean when imputing skewed data?
- When might you drop a column instead of filling it?
- How do you count missing values per column?
🧙♂️ Chapter 4: Visualization - Making Data Speak
A chart turns a table into an insight. Matplotlib is the foundational plotting library; every label and axis you add increases its honesty.
⚔️ Skills You’ll Forge in This Chapter
- Building line, bar, scatter, and histogram plots
- Labeling axes and titles for clarity
🏗️ Charts in Action
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
data = load_iris(as_frame=True)
df = data.frame
fig, ax = plt.subplots(1, 2, figsize=(10, 4))
# Histogram of one feature
ax[0].hist(df["sepal length (cm)"], bins=15, color="steelblue", edgecolor="white")
ax[0].set_title("Distribution of Sepal Length")
ax[0].set_xlabel("Sepal length (cm)")
ax[0].set_ylabel("Count")
# Scatter colored by species
scatter = ax[1].scatter(
df["petal length (cm)"], df["petal width (cm)"],
c=df["target"], cmap="viridis"
)
ax[1].set_title("Petal Length vs. Width")
ax[1].set_xlabel("Petal length (cm)")
ax[1].set_ylabel("Petal width (cm)")
plt.tight_layout()
plt.savefig("iris_eda.png", dpi=120) # or plt.show() in a notebook
print("Saved iris_eda.png")
A chart without axis labels is a riddle. Always title it and label both axes.
🔍 Knowledge Check: Visualization
- Which plot reveals the distribution of a single variable?
- Which plot reveals the relationship between two variables?
- Why must every axis carry a label?
🎮 Mastery Challenges
🟢 Novice Challenge: Summarize a Dataset
Objective: Load a built-in dataset and produce a one-screen summary.
Requirements:
- Load
load_wine(as_frame=True).frameinto a DataFrame - Print
.shape,.head(), and.describe() - Report which feature has the largest spread (std)
Validation: Your script prints the summary without errors.
🟡 Intermediate Challenge: Clean and Group
Objective: Handle missing data and compute group statistics.
Requirements:
- Inject some
np.nanvalues into a copy of the wine DataFrame - Impute them with a documented strategy
- Group by the target class and report mean alcohol content
Validation: No missing values remain and the grouped table prints.
🔴 Advanced Challenge: Full EDA + First Model
Objective: Connect data wrangling to machine learning.
Requirements:
- Clean a dataset and visualize at least two features
- Split into features
Xand targety - Fit a
LogisticRegressionand print its.score()on a test split
Validation: A trained model reports an accuracy on held-out data.
🏆 Quest Rewards & Achievements
🎖️ Badges Earned:
- 🏆 Data Artisan - You wrangled, cleaned, and visualized real data
- 📊 Insight Seeker - You completed a full exploratory data analysis
🛠️ Skills Unlocked:
- NumPy & Pandas Mastery - Vectorized computing and tabular wrangling
- Matplotlib Visualization - Turning numbers into readable pictures
🔓 Unlocked Quests:
- Machine Learning Fundamentals - Feed your clean data into real models
- Neural Networks Deep Dive - Build the machinery behind deep learning
📊 Progression Points: +60 XP
🗺️ Next Steps in Your Journey
Continue the Main Story:
- 🎯 Machine Learning Fundamentals - Train your first models
Explore Side Adventures:
- ⚔️ Computer Vision - Work with image data
- ⚔️ Natural Language Processing - Work with text data
Character Class Recommendations
💻 Software Developer: Continue to Machine Learning Fundamentals
🏗️ System Engineer: Explore MLOps Engineering
📊 Data Scientist: Advance to Machine Learning Fundamentals
📚 Resources
Official Documentation
- NumPy User Guide - The array library reference
- Pandas User Guide - DataFrame operations
- Matplotlib Tutorials - Plotting fundamentals
Community Resources
- Kaggle Learn: Pandas - Hands-on micro-course
- Stack Overflow: pandas tag - Q&A
- 10 Minutes to Pandas - Quick start
Learning Materials
- Python Data Science Handbook - Free online book
- Seaborn Gallery - Statistical visualization examples
🤝 Quest Completion Checklist
- ✅ Completed all primary objectives
- ✅ Cleaned and visualized a real dataset
- ✅ Answered all knowledge check questions
- ✅ Completed at least one mastery challenge
- ✅ Explored the resource library
- ✅ Identified your next quest in the journey
🕸️ Knowledge Graph
Structured wiki-links connect this quest to the IT-Journey knowledge graph. Open the Obsidian Graph View to explore connections.
Level hub: [[Level 1101 - Machine Learning & AI]] Overworld: [[🏰 Overworld - Master Quest Map]] Unlocks: [[Machine Learning Fundamentals: Supervised & Unsupervised Learning with Scikit-Learn]] Obsidian docs: [[Obsidian Knowledge Graph and Wiki Links]]
🎁 Rewards
Badges
- 🏆 Data Artisan - Wrangled, cleaned, and visualized real data
- 📊 Insight Seeker - Completed a full exploratory data analysis
Skills unlocked
- 🛠️ NumPy & Pandas Mastery
- 📈 Matplotlib Visualization
Features unlocked
- Readiness for the Machine Learning Fundamentals quest
🕸️ Quest Network
Click a node to open the quest · ⌘/Ctrl-click for a new tab · drag to reposition · scroll to zoom.
Referenced by
- Loading…