Pandas for ML Data Prep

Pandas is the Swiss Army knife of data manipulation in Python. Before any data reaches your model, it goes through Pandas. In real-world ML, you'll spend 60-80% of your time on data preparation — and Pandas is how you do it.

Why Pandas?

DataFrames: Labeled, tabular data with mixed types (like a spreadsheet)

I/O: Read CSV, JSON, SQL, Excel, Parquet, and more

Cleaning: Handle missing values, duplicates, and inconsistencies

Transformation: Filter, group, merge, pivot, and reshape data

Integration: Converts easily to NumPy arrays for ML models

Loading Data

The most common way to get data into Pandas is from a CSV file:

python

1import pandas as pd
2import numpy as np
3
4# Load a CSV file
5# df = pd.read_csv("data/housing.csv")
6
7# For this lesson, let's create a realistic dataset
8np.random.seed(42)
9n = 200
10
11df = pd.DataFrame({
12    "age": np.random.randint(18, 75, n),
13    "income": np.random.normal(55000, 20000, n).round(2),
14    "education": np.random.choice(["high_school", "bachelors", "masters", "phd"], n,
15                                   p=[0.3, 0.4, 0.2, 0.1]),
16    "city": np.random.choice(["New York", "San Francisco", "Austin", "Chicago"], n),
17    "experience_years": np.random.randint(0, 30, n),
18    "purchased": np.random.choice([0, 1], n, p=[0.6, 0.4]),
19})
20
21# Introduce some missing values (realistic!)
22mask_income = np.random.random(n) < 0.05
23mask_education = np.random.random(n) < 0.08
24df.loc[mask_income, "income"] = np.nan
25df.loc[mask_education, "education"] = np.nan
26
27print(df.head())

First-Look Methods

Every time you load a new dataset, run these five commands immediately. They give you a complete picture of what you're working with.

python

1# 1. head() — see the first few rows
2print(df.head())
3# Shows the first 5 rows — a quick sanity check
4
5# 2. shape — how big is the data?
6print(df.shape)
7# (200, 6) — 200 rows, 6 columns
8
9# 3. info() — column types and null counts
10print(df.info())
11# Shows each column's name, non-null count, and data type
12# This is where you spot missing values and wrong types
13
14# 4. describe() — summary statistics for numeric columns
15print(df.describe())
16# count, mean, std, min, 25%, 50%, 75%, max
17# Look for: outliers, impossible values, unexpected ranges
18
19# 5. isnull().sum() — count missing values per column
20print(df.isnull().sum())
21# Quick view: which columns have nulls and how many?

The Five-Command First Look

Always run these in order on a new dataset: 1. `df.head()` — see actual values 2. `df.shape` — dimensions 3. `df.info()` — types and null counts 4. `df.describe()` — summary statistics 5. `df.isnull().sum()` — missing value counts This takes 30 seconds and prevents hours of debugging later.

Selecting and Filtering Data

Pandas gives you many ways to select subsets of your data:

python

1# Select a single column (returns a Series)
2ages = df["age"]
3
4# Select multiple columns (returns a DataFrame)
5subset = df[["age", "income", "purchased"]]
6
7# Filter rows with boolean conditions
8high_income = df[df["income"] > 70000]
9young_buyers = df[(df["age"] < 30) & (df["purchased"] == 1)]
10
11# .loc — label-based selection (rows and columns)
12first_ten = df.loc[0:9, ["age", "income"]]
13
14# .iloc — integer position-based selection
15first_five_rows_three_cols = df.iloc[0:5, 0:3]
16
17# Filter with isin()
18target_cities = df[df["city"].isin(["Austin", "Chicago"])]
19
20# Filter with string methods
21# phd_holders = df[df["education"].str.contains("phd", na=False)]
22
23# Value counts — distribution of a categorical column
24print(df["education"].value_counts())
25print(df["purchased"].value_counts(normalize=True))  # as percentages

Handling Missing Values

Missing data is everywhere in real datasets. How you handle it directly impacts model performance.

python

1import pandas as pd
2import numpy as np
3
4# Check for missing values
5print(df.isnull().sum())
6
7# Strategy 1: Drop rows with any null
8df_dropped = df.dropna()
9print(f"Before: {len(df)}, After: {len(df_dropped)}")
10
11# Strategy 2: Drop rows where a specific column is null
12df_no_null_target = df.dropna(subset=["purchased"])
13
14# Strategy 3: Fill with a constant
15df["education"] = df["education"].fillna("unknown")
16
17# Strategy 4: Fill numeric columns with median (robust to outliers)
18df["income"] = df["income"].fillna(df["income"].median())
19
20# Strategy 5: Fill with mean
21# df["income"] = df["income"].fillna(df["income"].mean())
22
23# Strategy 6: Forward/backward fill (for time series)
24# df["price"] = df["price"].fillna(method="ffill")
25
26# Verify no more nulls
27print(df.isnull().sum())

When to Drop vs Fill

**Use dropna** when: - The target column has missing values (you can't predict with unknown labels) - The row is missing most of its values - You have plenty of data **Use fillna** when: - The feature can be reasonably imputed (median for numeric, mode for categorical) - Dropping would lose too much data - The missingness itself might be informative (consider adding an "is_missing" column) Never fill the target/label column — if you don't know the answer, drop the row.

Encoding Categorical Variables

ML models need numbers, not strings. There are two main encoding strategies:

python

1import pandas as pd
2import numpy as np
3
4# --- One-Hot Encoding (pd.get_dummies) ---
5# Best for: nominal categories (no natural order)
6# Example: city names, colors, product types
7
8df_encoded = pd.get_dummies(df, columns=["city"], prefix="city")
9print(df_encoded.head())
10# Creates: city_Austin, city_Chicago, city_New York, city_San Francisco
11# Each is 0 or 1
12
13# --- Label Encoding ---
14# Best for: ordinal categories (natural order exists)
15# Example: education level, size (S/M/L), satisfaction rating
16
17education_order = {"high_school": 0, "bachelors": 1, "masters": 2, "phd": 3, "unknown": -1}
18df["education_encoded"] = df["education"].map(education_order)
19print(df[["education", "education_encoded"]].head(10))
20
21# Alternative: use sklearn's LabelEncoder for arbitrary mappings
22# from sklearn.preprocessing import LabelEncoder
23# le = LabelEncoder()
24# df["city_encoded"] = le.fit_transform(df["city"])

One-Hot vs Label Encoding

**One-Hot Encoding** (pd.get_dummies): - Creates one binary column per category - Use for nominal (unordered) categories - Downside: can create many columns with high cardinality **Label Encoding** (map/replace): - Assigns an integer to each category - Use for ordinal (ordered) categories - Warning: don't use for nominal data — the model may assume 2 > 1 > 0 has meaning Rule of thumb: if the categories have a natural order, use label encoding. Otherwise, use one-hot.

Feature Engineering

Feature engineering is the art of creating new columns from existing ones to help models learn better patterns.

python

1import pandas as pd
2import numpy as np
3
4# Ratio features
5df["income_per_year_exp"] = df["income"] / (df["experience_years"] + 1)
6# Adding 1 avoids division by zero
7
8# Binning continuous variables
9df["age_group"] = pd.cut(df["age"], bins=[0, 25, 35, 50, 100],
10                          labels=["young", "early_career", "mid_career", "senior"])
11print(df["age_group"].value_counts())
12
13# Interaction features
14df["income_x_experience"] = df["income"] * df["experience_years"]
15
16# Log transform (for skewed distributions)
17df["log_income"] = np.log1p(df["income"])  # log1p handles 0 values safely
18
19# Boolean features
20df["is_high_income"] = (df["income"] > 70000).astype(int)
21df["is_experienced"] = (df["experience_years"] > 10).astype(int)
22
23print(df.head())

Preparing Data for ML

The final step: converting your cleaned, engineered DataFrame into arrays that ML models can consume.

python

1import pandas as pd
2import numpy as np
3
4# Assume df is fully cleaned and encoded at this point
5
6# 1. Separate features (X) and target (y)
7target = "purchased"
8feature_cols = ["age", "income", "experience_years", "education_encoded",
9                "income_per_year_exp", "is_high_income"]
10
11X = df[feature_cols].values   # Convert to NumPy array
12y = df[target].values         # Convert to NumPy array
13
14print(f"Features shape: {X.shape}")   # (200, 6)
15print(f"Target shape: {y.shape}")     # (200,)
16print(f"Features dtype: {X.dtype}")   # float64
17print(f"Target dtype: {y.dtype}")     # int64
18
19# 2. Train/test split (using sklearn)
20# from sklearn.model_selection import train_test_split
21# X_train, X_test, y_train, y_test = train_test_split(
22#     X, y, test_size=0.2, random_state=42, stratify=y
23# )
24
25# Manual split (without sklearn)
26np.random.seed(42)
27indices = np.random.permutation(len(X))
28split = int(0.8 * len(X))
29train_idx, test_idx = indices[:split], indices[split:]
30
31X_train, X_test = X[train_idx], X[test_idx]
32y_train, y_test = y[train_idx], y[test_idx]
33
34print(f"Train: {X_train.shape}, Test: {X_test.shape}")
35# Train: (160, 6), Test: (40, 6)
36
37# 3. Scale features (using our StandardScaler from Lesson 1!)
38mean = X_train.mean(axis=0)
39std = X_train.std(axis=0)
40X_train_scaled = (X_train - mean) / std
41X_test_scaled = (X_test - mean) / std  # Use TRAIN stats on test!
42
43print("Ready for ML!")
44print(f"Train mean: {X_train_scaled.mean(axis=0).round(4)}")
45print(f"Train std: {X_train_scaled.std(axis=0).round(4)}")

Data Leakage Warning

Always fit your scaler (and any transformations) on the **training data only**, then apply the same transformation to test data. **Wrong:** `scaler.fit(X_all)` then split **Right:** Split first, then `scaler.fit(X_train)` and `scaler.transform(X_test)` If you use statistics from the test set during training, your model has "seen" test data — this is data leakage and your evaluation metrics will be misleadingly optimistic.

Why Pandas?

DataFrames: Labeled, tabular data with mixed types (like a spreadsheet)

I/O: Read CSV, JSON, SQL, Excel, Parquet, and more

Cleaning: Handle missing values, duplicates, and inconsistencies

Transformation: Filter, group, merge, pivot, and reshape data

Integration: Converts easily to NumPy arrays for ML models

1import pandas as pd 2import numpy as np 3 4# Load a CSV file 5# df = pd.read_csv("data/housing.csv") 6 7# For this lesson, let's create a realistic dataset 8np.random.seed(42) 9n = 200 10 11df = pd.DataFrame({ 12 "age": np.random.randint(18, 75, n), 13 "income": np.random.normal(55000, 20000, n).round(2), 14 "education": np.random.choice(["high_school", "bachelors", "masters", "phd"], n, 15 p=[0.3, 0.4, 0.2, 0.1]), 16 "city": np.random.choice(["New York", "San Francisco", "Austin", "Chicago"], n), 17 "experience_years": np.random.randint(0, 30, n), 18 "purchased": np.random.choice([0, 1], n, p=[0.6, 0.4]), 19}) 20 21# Introduce some missing values (realistic!) 22mask_income = np.random.random(n) < 0.05 23mask_education = np.random.random(n) < 0.08 24df.loc[mask_income, "income"] = np.nan 25df.loc[mask_education, "education"] = np.nan 26 27print(df.head())

1# 1. head() — see the first few rows 2print(df.head()) 3# Shows the first 5 rows — a quick sanity check 4 5# 2. shape — how big is the data? 6print(df.shape) 7# (200, 6) — 200 rows, 6 columns 8 9# 3. info() — column types and null counts 10print(df.info()) 11# Shows each column's name, non-null count, and data type 12# This is where you spot missing values and wrong types 13 14# 4. describe() — summary statistics for numeric columns 15print(df.describe()) 16# count, mean, std, min, 25%, 50%, 75%, max 17# Look for: outliers, impossible values, unexpected ranges 18 19# 5. isnull().sum() — count missing values per column 20print(df.isnull().sum()) 21# Quick view: which columns have nulls and how many?

1# Select a single column (returns a Series) 2ages = df["age"] 3 4# Select multiple columns (returns a DataFrame) 5subset = df[["age", "income", "purchased"]] 6 7# Filter rows with boolean conditions 8high_income = df[df["income"] > 70000] 9young_buyers = df[(df["age"] < 30) & (df["purchased"] == 1)] 10 11# .loc — label-based selection (rows and columns) 12first_ten = df.loc[0:9, ["age", "income"]] 13 14# .iloc — integer position-based selection 15first_five_rows_three_cols = df.iloc[0:5, 0:3] 16 17# Filter with isin() 18target_cities = df[df["city"].isin(["Austin", "Chicago"])] 19 20# Filter with string methods 21# phd_holders = df[df["education"].str.contains("phd", na=False)] 22 23# Value counts — distribution of a categorical column 24print(df["education"].value_counts()) 25print(df["purchased"].value_counts(normalize=True)) # as percentages

1import pandas as pd 2import numpy as np 3 4# Check for missing values 5print(df.isnull().sum()) 6 7# Strategy 1: Drop rows with any null 8df_dropped = df.dropna() 9print(f"Before: {len(df)}, After: {len(df_dropped)}") 10 11# Strategy 2: Drop rows where a specific column is null 12df_no_null_target = df.dropna(subset=["purchased"]) 13 14# Strategy 3: Fill with a constant 15df["education"] = df["education"].fillna("unknown") 16 17# Strategy 4: Fill numeric columns with median (robust to outliers) 18df["income"] = df["income"].fillna(df["income"].median()) 19 20# Strategy 5: Fill with mean 21# df["income"] = df["income"].fillna(df["income"].mean()) 22 23# Strategy 6: Forward/backward fill (for time series) 24# df["price"] = df["price"].fillna(method="ffill") 25 26# Verify no more nulls 27print(df.isnull().sum())

1import pandas as pd 2import numpy as np 3 4# --- One-Hot Encoding (pd.get_dummies) --- 5# Best for: nominal categories (no natural order) 6# Example: city names, colors, product types 7 8df_encoded = pd.get_dummies(df, columns=["city"], prefix="city") 9print(df_encoded.head()) 10# Creates: city_Austin, city_Chicago, city_New York, city_San Francisco 11# Each is 0 or 1 12 13# --- Label Encoding --- 14# Best for: ordinal categories (natural order exists) 15# Example: education level, size (S/M/L), satisfaction rating 16 17education_order = {"high_school": 0, "bachelors": 1, "masters": 2, "phd": 3, "unknown": -1} 18df["education_encoded"] = df["education"].map(education_order) 19print(df[["education", "education_encoded"]].head(10)) 20 21# Alternative: use sklearn's LabelEncoder for arbitrary mappings 22# from sklearn.preprocessing import LabelEncoder 23# le = LabelEncoder() 24# df["city_encoded"] = le.fit_transform(df["city"])

1import pandas as pd 2import numpy as np 3 4# Ratio features 5df["income_per_year_exp"] = df["income"] / (df["experience_years"] + 1) 6# Adding 1 avoids division by zero 7 8# Binning continuous variables 9df["age_group"] = pd.cut(df["age"], bins=[0, 25, 35, 50, 100], 10 labels=["young", "early_career", "mid_career", "senior"]) 11print(df["age_group"].value_counts()) 12 13# Interaction features 14df["income_x_experience"] = df["income"] * df["experience_years"] 15 16# Log transform (for skewed distributions) 17df["log_income"] = np.log1p(df["income"]) # log1p handles 0 values safely 18 19# Boolean features 20df["is_high_income"] = (df["income"] > 70000).astype(int) 21df["is_experienced"] = (df["experience_years"] > 10).astype(int) 22 23print(df.head())

1import pandas as pd 2import numpy as np 3 4# Assume df is fully cleaned and encoded at this point 5 6# 1. Separate features (X) and target (y) 7target = "purchased" 8feature_cols = ["age", "income", "experience_years", "education_encoded", 9 "income_per_year_exp", "is_high_income"] 10 11X = df[feature_cols].values # Convert to NumPy array 12y = df[target].values # Convert to NumPy array 13 14print(f"Features shape: {X.shape}") # (200, 6) 15print(f"Target shape: {y.shape}") # (200,) 16print(f"Features dtype: {X.dtype}") # float64 17print(f"Target dtype: {y.dtype}") # int64 18 19# 2. Train/test split (using sklearn) 20# from sklearn.model_selection import train_test_split 21# X_train, X_test, y_train, y_test = train_test_split( 22# X, y, test_size=0.2, random_state=42, stratify=y 23# ) 24 25# Manual split (without sklearn) 26np.random.seed(42) 27indices = np.random.permutation(len(X)) 28split = int(0.8 * len(X)) 29train_idx, test_idx = indices[:split], indices[split:] 30 31X_train, X_test = X[train_idx], X[test_idx] 32y_train, y_test = y[train_idx], y[test_idx] 33 34print(f"Train: {X_train.shape}, Test: {X_test.shape}") 35# Train: (160, 6), Test: (40, 6) 36 37# 3. Scale features (using our StandardScaler from Lesson 1!) 38mean = X_train.mean(axis=0) 39std = X_train.std(axis=0) 40X_train_scaled = (X_train - mean) / std 41X_test_scaled = (X_test - mean) / std # Use TRAIN stats on test! 42 43print("Ready for ML!") 44print(f"Train mean: {X_train_scaled.mean(axis=0).round(4)}") 45print(f"Train std: {X_train_scaled.std(axis=0).round(4)}")