Pandas for ML Data Prep
Pandas is the Swiss Army knife of data manipulation in Python. Before any data reaches your model, it goes through Pandas. In real-world ML, you'll spend 60-80% of your time on data preparation — and Pandas is how you do it.
Why Pandas?
Loading Data
The most common way to get data into Pandas is from a CSV file:
1import pandas as pd
2import numpy as np
3
4# Load a CSV file
5# df = pd.read_csv("data/housing.csv")
6
7# For this lesson, let's create a realistic dataset
8np.random.seed(42)
9n = 200
10
11df = pd.DataFrame({
12 "age": np.random.randint(18, 75, n),
13 "income": np.random.normal(55000, 20000, n).round(2),
14 "education": np.random.choice(["high_school", "bachelors", "masters", "phd"], n,
15 p=[0.3, 0.4, 0.2, 0.1]),
16 "city": np.random.choice(["New York", "San Francisco", "Austin", "Chicago"], n),
17 "experience_years": np.random.randint(0, 30, n),
18 "purchased": np.random.choice([0, 1], n, p=[0.6, 0.4]),
19})
20
21# Introduce some missing values (realistic!)
22mask_income = np.random.random(n) < 0.05
23mask_education = np.random.random(n) < 0.08
24df.loc[mask_income, "income"] = np.nan
25df.loc[mask_education, "education"] = np.nan
26
27print(df.head())First-Look Methods
Every time you load a new dataset, run these five commands immediately. They give you a complete picture of what you're working with.
1# 1. head() — see the first few rows
2print(df.head())
3# Shows the first 5 rows — a quick sanity check
4
5# 2. shape — how big is the data?
6print(df.shape)
7# (200, 6) — 200 rows, 6 columns
8
9# 3. info() — column types and null counts
10print(df.info())
11# Shows each column's name, non-null count, and data type
12# This is where you spot missing values and wrong types
13
14# 4. describe() — summary statistics for numeric columns
15print(df.describe())
16# count, mean, std, min, 25%, 50%, 75%, max
17# Look for: outliers, impossible values, unexpected ranges
18
19# 5. isnull().sum() — count missing values per column
20print(df.isnull().sum())
21# Quick view: which columns have nulls and how many?The Five-Command First Look
Selecting and Filtering Data
Pandas gives you many ways to select subsets of your data:
1# Select a single column (returns a Series)
2ages = df["age"]
3
4# Select multiple columns (returns a DataFrame)
5subset = df[["age", "income", "purchased"]]
6
7# Filter rows with boolean conditions
8high_income = df[df["income"] > 70000]
9young_buyers = df[(df["age"] < 30) & (df["purchased"] == 1)]
10
11# .loc — label-based selection (rows and columns)
12first_ten = df.loc[0:9, ["age", "income"]]
13
14# .iloc — integer position-based selection
15first_five_rows_three_cols = df.iloc[0:5, 0:3]
16
17# Filter with isin()
18target_cities = df[df["city"].isin(["Austin", "Chicago"])]
19
20# Filter with string methods
21# phd_holders = df[df["education"].str.contains("phd", na=False)]
22
23# Value counts — distribution of a categorical column
24print(df["education"].value_counts())
25print(df["purchased"].value_counts(normalize=True)) # as percentagesHandling Missing Values
Missing data is everywhere in real datasets. How you handle it directly impacts model performance.
1import pandas as pd
2import numpy as np
3
4# Check for missing values
5print(df.isnull().sum())
6
7# Strategy 1: Drop rows with any null
8df_dropped = df.dropna()
9print(f"Before: {len(df)}, After: {len(df_dropped)}")
10
11# Strategy 2: Drop rows where a specific column is null
12df_no_null_target = df.dropna(subset=["purchased"])
13
14# Strategy 3: Fill with a constant
15df["education"] = df["education"].fillna("unknown")
16
17# Strategy 4: Fill numeric columns with median (robust to outliers)
18df["income"] = df["income"].fillna(df["income"].median())
19
20# Strategy 5: Fill with mean
21# df["income"] = df["income"].fillna(df["income"].mean())
22
23# Strategy 6: Forward/backward fill (for time series)
24# df["price"] = df["price"].fillna(method="ffill")
25
26# Verify no more nulls
27print(df.isnull().sum())When to Drop vs Fill
Encoding Categorical Variables
ML models need numbers, not strings. There are two main encoding strategies:
1import pandas as pd
2import numpy as np
3
4# --- One-Hot Encoding (pd.get_dummies) ---
5# Best for: nominal categories (no natural order)
6# Example: city names, colors, product types
7
8df_encoded = pd.get_dummies(df, columns=["city"], prefix="city")
9print(df_encoded.head())
10# Creates: city_Austin, city_Chicago, city_New York, city_San Francisco
11# Each is 0 or 1
12
13# --- Label Encoding ---
14# Best for: ordinal categories (natural order exists)
15# Example: education level, size (S/M/L), satisfaction rating
16
17education_order = {"high_school": 0, "bachelors": 1, "masters": 2, "phd": 3, "unknown": -1}
18df["education_encoded"] = df["education"].map(education_order)
19print(df[["education", "education_encoded"]].head(10))
20
21# Alternative: use sklearn's LabelEncoder for arbitrary mappings
22# from sklearn.preprocessing import LabelEncoder
23# le = LabelEncoder()
24# df["city_encoded"] = le.fit_transform(df["city"])One-Hot vs Label Encoding
Feature Engineering
Feature engineering is the art of creating new columns from existing ones to help models learn better patterns.
1import pandas as pd
2import numpy as np
3
4# Ratio features
5df["income_per_year_exp"] = df["income"] / (df["experience_years"] + 1)
6# Adding 1 avoids division by zero
7
8# Binning continuous variables
9df["age_group"] = pd.cut(df["age"], bins=[0, 25, 35, 50, 100],
10 labels=["young", "early_career", "mid_career", "senior"])
11print(df["age_group"].value_counts())
12
13# Interaction features
14df["income_x_experience"] = df["income"] * df["experience_years"]
15
16# Log transform (for skewed distributions)
17df["log_income"] = np.log1p(df["income"]) # log1p handles 0 values safely
18
19# Boolean features
20df["is_high_income"] = (df["income"] > 70000).astype(int)
21df["is_experienced"] = (df["experience_years"] > 10).astype(int)
22
23print(df.head())Preparing Data for ML
The final step: converting your cleaned, engineered DataFrame into arrays that ML models can consume.
1import pandas as pd
2import numpy as np
3
4# Assume df is fully cleaned and encoded at this point
5
6# 1. Separate features (X) and target (y)
7target = "purchased"
8feature_cols = ["age", "income", "experience_years", "education_encoded",
9 "income_per_year_exp", "is_high_income"]
10
11X = df[feature_cols].values # Convert to NumPy array
12y = df[target].values # Convert to NumPy array
13
14print(f"Features shape: {X.shape}") # (200, 6)
15print(f"Target shape: {y.shape}") # (200,)
16print(f"Features dtype: {X.dtype}") # float64
17print(f"Target dtype: {y.dtype}") # int64
18
19# 2. Train/test split (using sklearn)
20# from sklearn.model_selection import train_test_split
21# X_train, X_test, y_train, y_test = train_test_split(
22# X, y, test_size=0.2, random_state=42, stratify=y
23# )
24
25# Manual split (without sklearn)
26np.random.seed(42)
27indices = np.random.permutation(len(X))
28split = int(0.8 * len(X))
29train_idx, test_idx = indices[:split], indices[split:]
30
31X_train, X_test = X[train_idx], X[test_idx]
32y_train, y_test = y[train_idx], y[test_idx]
33
34print(f"Train: {X_train.shape}, Test: {X_test.shape}")
35# Train: (160, 6), Test: (40, 6)
36
37# 3. Scale features (using our StandardScaler from Lesson 1!)
38mean = X_train.mean(axis=0)
39std = X_train.std(axis=0)
40X_train_scaled = (X_train - mean) / std
41X_test_scaled = (X_test - mean) / std # Use TRAIN stats on test!
42
43print("Ready for ML!")
44print(f"Train mean: {X_train_scaled.mean(axis=0).round(4)}")
45print(f"Train std: {X_train_scaled.std(axis=0).round(4)}")