Model Evaluation & Hyperparameter Tuning
Building a model is only half the battle. Evaluating it properly and tuning it systematically is what separates amateur from professional ML work. In this lesson, you'll learn the metrics, diagnostics, and tuning strategies used in industry.
The Confusion Matrix
For binary classification, the confusion matrix shows the four possible outcomes:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actually Positive | True Positive (TP) | False Negative (FN) |
| Actually Negative | False Positive (FP) | True Negative (TN) |
1from sklearn.datasets import load_breast_cancer
2from sklearn.linear_model import LogisticRegression
3from sklearn.model_selection import train_test_split
4from sklearn.metrics import (
5 confusion_matrix, classification_report,
6 precision_score, recall_score, f1_score, accuracy_score
7)
8from sklearn.preprocessing import StandardScaler
9from sklearn.pipeline import Pipeline
10
11# Load and split data
12X, y = load_breast_cancer(return_X_y=True)
13X_train, X_test, y_train, y_test = train_test_split(
14 X, y, test_size=0.2, random_state=42, stratify=y
15)
16
17# Train model
18pipe = Pipeline([
19 ("scaler", StandardScaler()),
20 ("clf", LogisticRegression(max_iter=5000, random_state=42))
21])
22pipe.fit(X_train, y_train)
23y_pred = pipe.predict(X_test)
24
25# Confusion matrix
26cm = confusion_matrix(y_test, y_pred)
27print("Confusion Matrix:")
28print(cm)
29print(f"\nTP={cm[1,1]}, FP={cm[0,1]}, FN={cm[1,0]}, TN={cm[0,0]}")
30
31# Detailed classification report
32print(f"\n{classification_report(y_test, y_pred, target_names=['malignant', 'benign'])}")Precision, Recall, and F1 Score
These three metrics address different aspects of classification quality:
Precision = TP / (TP + FP) -- "Of all positive predictions, how many were correct?"
Recall = TP / (TP + FN) -- "Of all actual positives, how many were found?"
F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
Precision vs Recall Trade-off
ROC Curve and AUC
The ROC (Receiver Operating Characteristic) curve plots the True Positive Rate (recall) against the False Positive Rate at various classification thresholds.
1from sklearn.metrics import roc_curve, roc_auc_score
2import numpy as np
3
4# Get predicted probabilities
5y_proba = pipe.predict_proba(X_test)[:, 1]
6
7# Compute ROC curve
8fpr, tpr, thresholds = roc_curve(y_test, y_proba)
9auc = roc_auc_score(y_test, y_proba)
10
11print(f"AUC Score: {auc:.4f}")
12
13# Show key points on the ROC curve
14print(f"\n{'Threshold':>12} {'FPR':>8} {'TPR':>8}")
15print("-" * 30)
16for i in range(0, len(thresholds), max(1, len(thresholds) // 8)):
17 print(f"{thresholds[i]:12.4f} {fpr[i]:8.4f} {tpr[i]:8.4f}")
18
19# Find optimal threshold (closest to top-left corner)
20distances = np.sqrt(fpr**2 + (1 - tpr)**2)
21best_idx = np.argmin(distances)
22print(f"\nOptimal threshold: {thresholds[best_idx]:.4f}")
23print(f" TPR: {tpr[best_idx]:.4f}, FPR: {fpr[best_idx]:.4f}")Learning Curves: Diagnosing Bias vs Variance
Learning curves show how training and validation scores change as you add more training data. They're the best tool for diagnosing whether your model suffers from:
1from sklearn.model_selection import learning_curve
2from sklearn.ensemble import RandomForestClassifier
3import numpy as np
4
5model = RandomForestClassifier(n_estimators=50, random_state=42)
6
7train_sizes, train_scores, val_scores = learning_curve(
8 model, X_train, y_train,
9 train_sizes=np.linspace(0.1, 1.0, 10),
10 cv=5,
11 scoring="accuracy",
12 n_jobs=-1
13)
14
15print(f"{'Train Size':>12} {'Train Acc':>12} {'Val Acc':>12} {'Gap':>8}")
16print("-" * 48)
17for size, train_s, val_s in zip(train_sizes, train_scores.mean(axis=1), val_scores.mean(axis=1)):
18 gap = train_s - val_s
19 print(f"{size:12d} {train_s:12.4f} {val_s:12.4f} {gap:8.4f}")The Bias-Variance Tradeoff
Hyperparameter Tuning Strategies
GridSearchCV (Exhaustive Search)
Tries every combination. Guaranteed to find the best combo in the grid, but slow for large grids.RandomizedSearchCV
Samples random combinations. Much faster for large parameter spaces. Often finds a good solution with far fewer iterations.Bayesian Optimization (Optuna, Hyperopt)
Uses past evaluations to intelligently choose the next set of parameters to try. Much more efficient than grid or random search.1from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
2from sklearn.ensemble import RandomForestClassifier
3from scipy.stats import randint
4import time
5
6# GridSearchCV: tries every combination
7param_grid = {
8 "n_estimators": [50, 100, 200],
9 "max_depth": [5, 10, 20],
10 "min_samples_leaf": [1, 2, 5]
11}
12
13start = time.time()
14grid = GridSearchCV(
15 RandomForestClassifier(random_state=42),
16 param_grid, cv=5, scoring="accuracy", n_jobs=-1
17)
18grid.fit(X_train, y_train)
19grid_time = time.time() - start
20
21print(f"GridSearchCV ({len(grid.cv_results_['params'])} combos)")
22print(f" Best params: {grid.best_params_}")
23print(f" Best CV score: {grid.best_score_:.4f}")
24print(f" Time: {grid_time:.2f}s")
25
26# RandomizedSearchCV: samples random combinations
27param_dist = {
28 "n_estimators": randint(50, 300),
29 "max_depth": randint(3, 30),
30 "min_samples_leaf": randint(1, 10)
31}
32
33start = time.time()
34random_search = RandomizedSearchCV(
35 RandomForestClassifier(random_state=42),
36 param_dist, n_iter=20, cv=5, scoring="accuracy",
37 random_state=42, n_jobs=-1
38)
39random_search.fit(X_train, y_train)
40rand_time = time.time() - start
41
42print(f"\nRandomizedSearchCV (20 iterations)")
43print(f" Best params: {random_search.best_params_}")
44print(f" Best CV score: {random_search.best_score_:.4f}")
45print(f" Time: {rand_time:.2f}s")Bayesian Optimization with Optuna
Optuna is a modern hyperparameter optimization framework that uses Bayesian methods to efficiently search the parameter space. Instead of blind search, it builds a probabilistic model of which parameters are likely to work well.
1import optuna
2from sklearn.ensemble import RandomForestClassifier
3from sklearn.model_selection import cross_val_score
4
5# Suppress Optuna's verbose output
6optuna.logging.set_verbosity(optuna.logging.WARNING)
7
8def objective(trial):
9 """Optuna objective function: returns the metric to maximize."""
10 params = {
11 "n_estimators": trial.suggest_int("n_estimators", 50, 300),
12 "max_depth": trial.suggest_int("max_depth", 3, 30),
13 "min_samples_leaf": trial.suggest_int("min_samples_leaf", 1, 10),
14 "min_samples_split": trial.suggest_int("min_samples_split", 2, 20),
15 "max_features": trial.suggest_categorical("max_features", ["sqrt", "log2", None]),
16 }
17
18 model = RandomForestClassifier(**params, random_state=42, n_jobs=-1)
19 scores = cross_val_score(model, X_train, y_train, cv=5, scoring="accuracy")
20 return scores.mean()
21
22# Run optimization
23study = optuna.create_study(direction="maximize")
24study.optimize(objective, n_trials=50)
25
26print(f"Best trial:")
27print(f" Score: {study.best_trial.value:.4f}")
28print(f" Params: {study.best_trial.params}")
29
30# Train final model with best params
31best_model = RandomForestClassifier(**study.best_trial.params, random_state=42)
32best_model.fit(X_train, y_train)
33print(f" Test accuracy: {best_model.score(X_test, y_test):.4f}")When to Use Each Tuning Strategy
Putting It All Together: The Complete Evaluation Workflow
1. Split data into train and test sets (and optionally a validation set) 2. Baseline: Establish a baseline with a simple model (DummyClassifier) 3. Train your model with default parameters 4. Evaluate with appropriate metrics (accuracy, F1, AUC depending on the problem) 5. Diagnose with learning curves and confusion matrix 6. Tune hyperparameters with cross-validated search 7. Final evaluation on the held-out test set (only once!) 8. Report results with confidence intervals