Skip to main content

Model Evaluation & Hyperparameter Tuning

Master metrics, diagnostic tools, and advanced tuning strategies for production-ready models

~50 min
Listen to this lesson

Model Evaluation & Hyperparameter Tuning

Building a model is only half the battle. Evaluating it properly and tuning it systematically is what separates amateur from professional ML work. In this lesson, you'll learn the metrics, diagnostics, and tuning strategies used in industry.

The Confusion Matrix

For binary classification, the confusion matrix shows the four possible outcomes:

Predicted PositivePredicted Negative
Actually PositiveTrue Positive (TP)False Negative (FN)
Actually NegativeFalse Positive (FP)True Negative (TN)
From these four numbers, we derive all classification metrics.

python
1from sklearn.datasets import load_breast_cancer
2from sklearn.linear_model import LogisticRegression
3from sklearn.model_selection import train_test_split
4from sklearn.metrics import (
5    confusion_matrix, classification_report,
6    precision_score, recall_score, f1_score, accuracy_score
7)
8from sklearn.preprocessing import StandardScaler
9from sklearn.pipeline import Pipeline
10
11# Load and split data
12X, y = load_breast_cancer(return_X_y=True)
13X_train, X_test, y_train, y_test = train_test_split(
14    X, y, test_size=0.2, random_state=42, stratify=y
15)
16
17# Train model
18pipe = Pipeline([
19    ("scaler", StandardScaler()),
20    ("clf", LogisticRegression(max_iter=5000, random_state=42))
21])
22pipe.fit(X_train, y_train)
23y_pred = pipe.predict(X_test)
24
25# Confusion matrix
26cm = confusion_matrix(y_test, y_pred)
27print("Confusion Matrix:")
28print(cm)
29print(f"\nTP={cm[1,1]}, FP={cm[0,1]}, FN={cm[1,0]}, TN={cm[0,0]}")
30
31# Detailed classification report
32print(f"\n{classification_report(y_test, y_pred, target_names=['malignant', 'benign'])}")

Precision, Recall, and F1 Score

These three metrics address different aspects of classification quality:

Precision = TP / (TP + FP) -- "Of all positive predictions, how many were correct?"

  • High precision means few false alarms
  • Optimize for precision when false positives are costly (spam filter: don't send real email to junk)
  • Recall = TP / (TP + FN) -- "Of all actual positives, how many were found?"

  • High recall means few missed cases
  • Optimize for recall when false negatives are costly (cancer detection: don't miss a real case)
  • F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

  • Harmonic mean of precision and recall
  • Use when you need a balance between precision and recall
  • Better than accuracy for imbalanced datasets
  • Precision vs Recall Trade-off

    You can always increase recall by predicting positive more often (at the expense of precision), and vice versa. The key question is: What is more costly in your application -- a false positive or a false negative? For medical diagnosis, a false negative (missing cancer) is much worse than a false positive (an unnecessary biopsy). For spam filtering, a false positive (blocking a real email) may be worse than a false negative (letting some spam through).

    ROC Curve and AUC

    The ROC (Receiver Operating Characteristic) curve plots the True Positive Rate (recall) against the False Positive Rate at various classification thresholds.

  • AUC (Area Under the Curve) summarizes the ROC curve as a single number (0 to 1)
  • AUC = 1.0: Perfect classifier
  • AUC = 0.5: Random guessing (the diagonal line)
  • AUC is threshold-independent -- it evaluates the model across all possible thresholds
  • python
    1from sklearn.metrics import roc_curve, roc_auc_score
    2import numpy as np
    3
    4# Get predicted probabilities
    5y_proba = pipe.predict_proba(X_test)[:, 1]
    6
    7# Compute ROC curve
    8fpr, tpr, thresholds = roc_curve(y_test, y_proba)
    9auc = roc_auc_score(y_test, y_proba)
    10
    11print(f"AUC Score: {auc:.4f}")
    12
    13# Show key points on the ROC curve
    14print(f"\n{'Threshold':>12} {'FPR':>8} {'TPR':>8}")
    15print("-" * 30)
    16for i in range(0, len(thresholds), max(1, len(thresholds) // 8)):
    17    print(f"{thresholds[i]:12.4f} {fpr[i]:8.4f} {tpr[i]:8.4f}")
    18
    19# Find optimal threshold (closest to top-left corner)
    20distances = np.sqrt(fpr**2 + (1 - tpr)**2)
    21best_idx = np.argmin(distances)
    22print(f"\nOptimal threshold: {thresholds[best_idx]:.4f}")
    23print(f"  TPR: {tpr[best_idx]:.4f}, FPR: {fpr[best_idx]:.4f}")

    Learning Curves: Diagnosing Bias vs Variance

    Learning curves show how training and validation scores change as you add more training data. They're the best tool for diagnosing whether your model suffers from:

  • High Bias (Underfitting): Both train and validation scores are low and converge. More data won't help -- you need a more complex model.
  • High Variance (Overfitting): Train score is high but validation score is much lower. More data might help, or you need regularization.
  • python
    1from sklearn.model_selection import learning_curve
    2from sklearn.ensemble import RandomForestClassifier
    3import numpy as np
    4
    5model = RandomForestClassifier(n_estimators=50, random_state=42)
    6
    7train_sizes, train_scores, val_scores = learning_curve(
    8    model, X_train, y_train,
    9    train_sizes=np.linspace(0.1, 1.0, 10),
    10    cv=5,
    11    scoring="accuracy",
    12    n_jobs=-1
    13)
    14
    15print(f"{'Train Size':>12} {'Train Acc':>12} {'Val Acc':>12} {'Gap':>8}")
    16print("-" * 48)
    17for size, train_s, val_s in zip(train_sizes, train_scores.mean(axis=1), val_scores.mean(axis=1)):
    18    gap = train_s - val_s
    19    print(f"{size:12d} {train_s:12.4f} {val_s:12.4f} {gap:8.4f}")

    The Bias-Variance Tradeoff

    Bias is error from overly simple models (underfitting). Variance is error from overly complex models (overfitting). The goal is to find the sweet spot between the two. Simple models have high bias, low variance. Complex models have low bias, high variance. Regularization and cross-validation help you find the right balance.

    Hyperparameter Tuning Strategies

    GridSearchCV (Exhaustive Search)

    Tries every combination. Guaranteed to find the best combo in the grid, but slow for large grids.

    RandomizedSearchCV

    Samples random combinations. Much faster for large parameter spaces. Often finds a good solution with far fewer iterations.

    Bayesian Optimization (Optuna, Hyperopt)

    Uses past evaluations to intelligently choose the next set of parameters to try. Much more efficient than grid or random search.

    python
    1from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
    2from sklearn.ensemble import RandomForestClassifier
    3from scipy.stats import randint
    4import time
    5
    6# GridSearchCV: tries every combination
    7param_grid = {
    8    "n_estimators": [50, 100, 200],
    9    "max_depth": [5, 10, 20],
    10    "min_samples_leaf": [1, 2, 5]
    11}
    12
    13start = time.time()
    14grid = GridSearchCV(
    15    RandomForestClassifier(random_state=42),
    16    param_grid, cv=5, scoring="accuracy", n_jobs=-1
    17)
    18grid.fit(X_train, y_train)
    19grid_time = time.time() - start
    20
    21print(f"GridSearchCV ({len(grid.cv_results_['params'])} combos)")
    22print(f"  Best params: {grid.best_params_}")
    23print(f"  Best CV score: {grid.best_score_:.4f}")
    24print(f"  Time: {grid_time:.2f}s")
    25
    26# RandomizedSearchCV: samples random combinations
    27param_dist = {
    28    "n_estimators": randint(50, 300),
    29    "max_depth": randint(3, 30),
    30    "min_samples_leaf": randint(1, 10)
    31}
    32
    33start = time.time()
    34random_search = RandomizedSearchCV(
    35    RandomForestClassifier(random_state=42),
    36    param_dist, n_iter=20, cv=5, scoring="accuracy",
    37    random_state=42, n_jobs=-1
    38)
    39random_search.fit(X_train, y_train)
    40rand_time = time.time() - start
    41
    42print(f"\nRandomizedSearchCV (20 iterations)")
    43print(f"  Best params: {random_search.best_params_}")
    44print(f"  Best CV score: {random_search.best_score_:.4f}")
    45print(f"  Time: {rand_time:.2f}s")

    Bayesian Optimization with Optuna

    Optuna is a modern hyperparameter optimization framework that uses Bayesian methods to efficiently search the parameter space. Instead of blind search, it builds a probabilistic model of which parameters are likely to work well.

    python
    1import optuna
    2from sklearn.ensemble import RandomForestClassifier
    3from sklearn.model_selection import cross_val_score
    4
    5# Suppress Optuna's verbose output
    6optuna.logging.set_verbosity(optuna.logging.WARNING)
    7
    8def objective(trial):
    9    """Optuna objective function: returns the metric to maximize."""
    10    params = {
    11        "n_estimators": trial.suggest_int("n_estimators", 50, 300),
    12        "max_depth": trial.suggest_int("max_depth", 3, 30),
    13        "min_samples_leaf": trial.suggest_int("min_samples_leaf", 1, 10),
    14        "min_samples_split": trial.suggest_int("min_samples_split", 2, 20),
    15        "max_features": trial.suggest_categorical("max_features", ["sqrt", "log2", None]),
    16    }
    17
    18    model = RandomForestClassifier(**params, random_state=42, n_jobs=-1)
    19    scores = cross_val_score(model, X_train, y_train, cv=5, scoring="accuracy")
    20    return scores.mean()
    21
    22# Run optimization
    23study = optuna.create_study(direction="maximize")
    24study.optimize(objective, n_trials=50)
    25
    26print(f"Best trial:")
    27print(f"  Score: {study.best_trial.value:.4f}")
    28print(f"  Params: {study.best_trial.params}")
    29
    30# Train final model with best params
    31best_model = RandomForestClassifier(**study.best_trial.params, random_state=42)
    32best_model.fit(X_train, y_train)
    33print(f"  Test accuracy: {best_model.score(X_test, y_test):.4f}")

    When to Use Each Tuning Strategy

    Use GridSearchCV when you have a small parameter space (< 100 combinations) and want to be thorough. Use RandomizedSearchCV when the space is large and you want a good-enough result quickly. Use Optuna/Bayesian optimization for large spaces where you need the best possible result efficiently. In practice, start with RandomizedSearch to narrow the range, then refine with GridSearch or Optuna.

    Putting It All Together: The Complete Evaluation Workflow

    1. Split data into train and test sets (and optionally a validation set) 2. Baseline: Establish a baseline with a simple model (DummyClassifier) 3. Train your model with default parameters 4. Evaluate with appropriate metrics (accuracy, F1, AUC depending on the problem) 5. Diagnose with learning curves and confusion matrix 6. Tune hyperparameters with cross-validated search 7. Final evaluation on the held-out test set (only once!) 8. Report results with confidence intervals

    Only Touch the Test Set Once

    The test set should be used exactly once -- for your final evaluation. If you tune hyperparameters by checking test set performance repeatedly, you're implicitly overfitting to the test set. Use cross-validation on the training set for all model selection decisions, and save the test set for the very end.