Skip to main content

Pandas for ML Data Prep

Learn to load, explore, clean, and transform data for machine learning with Pandas

~40 min
Listen to this lesson

Pandas for ML Data Prep

Pandas is the Swiss Army knife of data manipulation in Python. Before any data reaches your model, it goes through Pandas. In real-world ML, you'll spend 60-80% of your time on data preparation — and Pandas is how you do it.

Why Pandas?

  • DataFrames: Labeled, tabular data with mixed types (like a spreadsheet)
  • I/O: Read CSV, JSON, SQL, Excel, Parquet, and more
  • Cleaning: Handle missing values, duplicates, and inconsistencies
  • Transformation: Filter, group, merge, pivot, and reshape data
  • Integration: Converts easily to NumPy arrays for ML models
  • Loading Data

    The most common way to get data into Pandas is from a CSV file:

    python
    1import pandas as pd
    2import numpy as np
    3
    4# Load a CSV file
    5# df = pd.read_csv("data/housing.csv")
    6
    7# For this lesson, let's create a realistic dataset
    8np.random.seed(42)
    9n = 200
    10
    11df = pd.DataFrame({
    12    "age": np.random.randint(18, 75, n),
    13    "income": np.random.normal(55000, 20000, n).round(2),
    14    "education": np.random.choice(["high_school", "bachelors", "masters", "phd"], n,
    15                                   p=[0.3, 0.4, 0.2, 0.1]),
    16    "city": np.random.choice(["New York", "San Francisco", "Austin", "Chicago"], n),
    17    "experience_years": np.random.randint(0, 30, n),
    18    "purchased": np.random.choice([0, 1], n, p=[0.6, 0.4]),
    19})
    20
    21# Introduce some missing values (realistic!)
    22mask_income = np.random.random(n) < 0.05
    23mask_education = np.random.random(n) < 0.08
    24df.loc[mask_income, "income"] = np.nan
    25df.loc[mask_education, "education"] = np.nan
    26
    27print(df.head())

    First-Look Methods

    Every time you load a new dataset, run these five commands immediately. They give you a complete picture of what you're working with.

    python
    1# 1. head() — see the first few rows
    2print(df.head())
    3# Shows the first 5 rows — a quick sanity check
    4
    5# 2. shape — how big is the data?
    6print(df.shape)
    7# (200, 6) — 200 rows, 6 columns
    8
    9# 3. info() — column types and null counts
    10print(df.info())
    11# Shows each column's name, non-null count, and data type
    12# This is where you spot missing values and wrong types
    13
    14# 4. describe() — summary statistics for numeric columns
    15print(df.describe())
    16# count, mean, std, min, 25%, 50%, 75%, max
    17# Look for: outliers, impossible values, unexpected ranges
    18
    19# 5. isnull().sum() — count missing values per column
    20print(df.isnull().sum())
    21# Quick view: which columns have nulls and how many?

    The Five-Command First Look

    Always run these in order on a new dataset: 1. `df.head()` — see actual values 2. `df.shape` — dimensions 3. `df.info()` — types and null counts 4. `df.describe()` — summary statistics 5. `df.isnull().sum()` — missing value counts This takes 30 seconds and prevents hours of debugging later.

    Selecting and Filtering Data

    Pandas gives you many ways to select subsets of your data:

    python
    1# Select a single column (returns a Series)
    2ages = df["age"]
    3
    4# Select multiple columns (returns a DataFrame)
    5subset = df[["age", "income", "purchased"]]
    6
    7# Filter rows with boolean conditions
    8high_income = df[df["income"] > 70000]
    9young_buyers = df[(df["age"] < 30) & (df["purchased"] == 1)]
    10
    11# .loc — label-based selection (rows and columns)
    12first_ten = df.loc[0:9, ["age", "income"]]
    13
    14# .iloc — integer position-based selection
    15first_five_rows_three_cols = df.iloc[0:5, 0:3]
    16
    17# Filter with isin()
    18target_cities = df[df["city"].isin(["Austin", "Chicago"])]
    19
    20# Filter with string methods
    21# phd_holders = df[df["education"].str.contains("phd", na=False)]
    22
    23# Value counts — distribution of a categorical column
    24print(df["education"].value_counts())
    25print(df["purchased"].value_counts(normalize=True))  # as percentages

    Handling Missing Values

    Missing data is everywhere in real datasets. How you handle it directly impacts model performance.

    python
    1import pandas as pd
    2import numpy as np
    3
    4# Check for missing values
    5print(df.isnull().sum())
    6
    7# Strategy 1: Drop rows with any null
    8df_dropped = df.dropna()
    9print(f"Before: {len(df)}, After: {len(df_dropped)}")
    10
    11# Strategy 2: Drop rows where a specific column is null
    12df_no_null_target = df.dropna(subset=["purchased"])
    13
    14# Strategy 3: Fill with a constant
    15df["education"] = df["education"].fillna("unknown")
    16
    17# Strategy 4: Fill numeric columns with median (robust to outliers)
    18df["income"] = df["income"].fillna(df["income"].median())
    19
    20# Strategy 5: Fill with mean
    21# df["income"] = df["income"].fillna(df["income"].mean())
    22
    23# Strategy 6: Forward/backward fill (for time series)
    24# df["price"] = df["price"].fillna(method="ffill")
    25
    26# Verify no more nulls
    27print(df.isnull().sum())

    When to Drop vs Fill

    **Use dropna** when: - The target column has missing values (you can't predict with unknown labels) - The row is missing most of its values - You have plenty of data **Use fillna** when: - The feature can be reasonably imputed (median for numeric, mode for categorical) - Dropping would lose too much data - The missingness itself might be informative (consider adding an "is_missing" column) Never fill the target/label column — if you don't know the answer, drop the row.

    Encoding Categorical Variables

    ML models need numbers, not strings. There are two main encoding strategies:

    python
    1import pandas as pd
    2import numpy as np
    3
    4# --- One-Hot Encoding (pd.get_dummies) ---
    5# Best for: nominal categories (no natural order)
    6# Example: city names, colors, product types
    7
    8df_encoded = pd.get_dummies(df, columns=["city"], prefix="city")
    9print(df_encoded.head())
    10# Creates: city_Austin, city_Chicago, city_New York, city_San Francisco
    11# Each is 0 or 1
    12
    13# --- Label Encoding ---
    14# Best for: ordinal categories (natural order exists)
    15# Example: education level, size (S/M/L), satisfaction rating
    16
    17education_order = {"high_school": 0, "bachelors": 1, "masters": 2, "phd": 3, "unknown": -1}
    18df["education_encoded"] = df["education"].map(education_order)
    19print(df[["education", "education_encoded"]].head(10))
    20
    21# Alternative: use sklearn's LabelEncoder for arbitrary mappings
    22# from sklearn.preprocessing import LabelEncoder
    23# le = LabelEncoder()
    24# df["city_encoded"] = le.fit_transform(df["city"])

    One-Hot vs Label Encoding

    **One-Hot Encoding** (pd.get_dummies): - Creates one binary column per category - Use for nominal (unordered) categories - Downside: can create many columns with high cardinality **Label Encoding** (map/replace): - Assigns an integer to each category - Use for ordinal (ordered) categories - Warning: don't use for nominal data — the model may assume 2 > 1 > 0 has meaning Rule of thumb: if the categories have a natural order, use label encoding. Otherwise, use one-hot.

    Feature Engineering

    Feature engineering is the art of creating new columns from existing ones to help models learn better patterns.

    python
    1import pandas as pd
    2import numpy as np
    3
    4# Ratio features
    5df["income_per_year_exp"] = df["income"] / (df["experience_years"] + 1)
    6# Adding 1 avoids division by zero
    7
    8# Binning continuous variables
    9df["age_group"] = pd.cut(df["age"], bins=[0, 25, 35, 50, 100],
    10                          labels=["young", "early_career", "mid_career", "senior"])
    11print(df["age_group"].value_counts())
    12
    13# Interaction features
    14df["income_x_experience"] = df["income"] * df["experience_years"]
    15
    16# Log transform (for skewed distributions)
    17df["log_income"] = np.log1p(df["income"])  # log1p handles 0 values safely
    18
    19# Boolean features
    20df["is_high_income"] = (df["income"] > 70000).astype(int)
    21df["is_experienced"] = (df["experience_years"] > 10).astype(int)
    22
    23print(df.head())

    Preparing Data for ML

    The final step: converting your cleaned, engineered DataFrame into arrays that ML models can consume.

    python
    1import pandas as pd
    2import numpy as np
    3
    4# Assume df is fully cleaned and encoded at this point
    5
    6# 1. Separate features (X) and target (y)
    7target = "purchased"
    8feature_cols = ["age", "income", "experience_years", "education_encoded",
    9                "income_per_year_exp", "is_high_income"]
    10
    11X = df[feature_cols].values   # Convert to NumPy array
    12y = df[target].values         # Convert to NumPy array
    13
    14print(f"Features shape: {X.shape}")   # (200, 6)
    15print(f"Target shape: {y.shape}")     # (200,)
    16print(f"Features dtype: {X.dtype}")   # float64
    17print(f"Target dtype: {y.dtype}")     # int64
    18
    19# 2. Train/test split (using sklearn)
    20# from sklearn.model_selection import train_test_split
    21# X_train, X_test, y_train, y_test = train_test_split(
    22#     X, y, test_size=0.2, random_state=42, stratify=y
    23# )
    24
    25# Manual split (without sklearn)
    26np.random.seed(42)
    27indices = np.random.permutation(len(X))
    28split = int(0.8 * len(X))
    29train_idx, test_idx = indices[:split], indices[split:]
    30
    31X_train, X_test = X[train_idx], X[test_idx]
    32y_train, y_test = y[train_idx], y[test_idx]
    33
    34print(f"Train: {X_train.shape}, Test: {X_test.shape}")
    35# Train: (160, 6), Test: (40, 6)
    36
    37# 3. Scale features (using our StandardScaler from Lesson 1!)
    38mean = X_train.mean(axis=0)
    39std = X_train.std(axis=0)
    40X_train_scaled = (X_train - mean) / std
    41X_test_scaled = (X_test - mean) / std  # Use TRAIN stats on test!
    42
    43print("Ready for ML!")
    44print(f"Train mean: {X_train_scaled.mean(axis=0).round(4)}")
    45print(f"Train std: {X_train_scaled.std(axis=0).round(4)}")

    Data Leakage Warning

    Always fit your scaler (and any transformations) on the **training data only**, then apply the same transformation to test data. **Wrong:** `scaler.fit(X_all)` then split **Right:** Split first, then `scaler.fit(X_train)` and `scaler.transform(X_test)` If you use statistics from the test set during training, your model has "seen" test data — this is data leakage and your evaluation metrics will be misleadingly optimistic.