Model Evaluation & Hyperparameter Tuning

Building a model is only half the battle. Evaluating it properly and tuning it systematically is what separates amateur from professional ML work. In this lesson, you'll learn the metrics, diagnostics, and tuning strategies used in industry.

The Confusion Matrix

For binary classification, the confusion matrix shows the four possible outcomes:

Predicted Positive	Predicted Negative
Actually Positive	True Positive (TP)	False Negative (FN)
Actually Negative	False Positive (FP)	True Negative (TN)

From these four numbers, we derive all classification metrics.

python

1from sklearn.datasets import load_breast_cancer
2from sklearn.linear_model import LogisticRegression
3from sklearn.model_selection import train_test_split
4from sklearn.metrics import (
5    confusion_matrix, classification_report,
6    precision_score, recall_score, f1_score, accuracy_score
7)
8from sklearn.preprocessing import StandardScaler
9from sklearn.pipeline import Pipeline
10
11# Load and split data
12X, y = load_breast_cancer(return_X_y=True)
13X_train, X_test, y_train, y_test = train_test_split(
14    X, y, test_size=0.2, random_state=42, stratify=y
15)
16
17# Train model
18pipe = Pipeline([
19    ("scaler", StandardScaler()),
20    ("clf", LogisticRegression(max_iter=5000, random_state=42))
21])
22pipe.fit(X_train, y_train)
23y_pred = pipe.predict(X_test)
24
25# Confusion matrix
26cm = confusion_matrix(y_test, y_pred)
27print("Confusion Matrix:")
28print(cm)
29print(f"\nTP={cm[1,1]}, FP={cm[0,1]}, FN={cm[1,0]}, TN={cm[0,0]}")
30
31# Detailed classification report
32print(f"\n{classification_report(y_test, y_pred, target_names=['malignant', 'benign'])}")

Precision, Recall, and F1 Score

These three metrics address different aspects of classification quality:

Precision = TP / (TP + FP) -- "Of all positive predictions, how many were correct?"

High precision means few false alarms

Optimize for precision when false positives are costly (spam filter: don't send real email to junk)

Recall = TP / (TP + FN) -- "Of all actual positives, how many were found?"

High recall means few missed cases

Optimize for recall when false negatives are costly (cancer detection: don't miss a real case)

F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

Harmonic mean of precision and recall

Use when you need a balance between precision and recall

Better than accuracy for imbalanced datasets

Precision vs Recall Trade-off

You can always increase recall by predicting positive more often (at the expense of precision), and vice versa. The key question is: What is more costly in your application -- a false positive or a false negative? For medical diagnosis, a false negative (missing cancer) is much worse than a false positive (an unnecessary biopsy). For spam filtering, a false positive (blocking a real email) may be worse than a false negative (letting some spam through).

ROC Curve and AUC

The ROC (Receiver Operating Characteristic) curve plots the True Positive Rate (recall) against the False Positive Rate at various classification thresholds.

AUC (Area Under the Curve) summarizes the ROC curve as a single number (0 to 1)

AUC = 1.0: Perfect classifier

AUC = 0.5: Random guessing (the diagonal line)

AUC is threshold-independent -- it evaluates the model across all possible thresholds

python

1from sklearn.metrics import roc_curve, roc_auc_score
2import numpy as np
3
4# Get predicted probabilities
5y_proba = pipe.predict_proba(X_test)[:, 1]
6
7# Compute ROC curve
8fpr, tpr, thresholds = roc_curve(y_test, y_proba)
9auc = roc_auc_score(y_test, y_proba)
10
11print(f"AUC Score: {auc:.4f}")
12
13# Show key points on the ROC curve
14print(f"\n{'Threshold':>12} {'FPR':>8} {'TPR':>8}")
15print("-" * 30)
16for i in range(0, len(thresholds), max(1, len(thresholds) // 8)):
17    print(f"{thresholds[i]:12.4f} {fpr[i]:8.4f} {tpr[i]:8.4f}")
18
19# Find optimal threshold (closest to top-left corner)
20distances = np.sqrt(fpr**2 + (1 - tpr)**2)
21best_idx = np.argmin(distances)
22print(f"\nOptimal threshold: {thresholds[best_idx]:.4f}")
23print(f"  TPR: {tpr[best_idx]:.4f}, FPR: {fpr[best_idx]:.4f}")

Learning Curves: Diagnosing Bias vs Variance

Learning curves show how training and validation scores change as you add more training data. They're the best tool for diagnosing whether your model suffers from:

High Bias (Underfitting): Both train and validation scores are low and converge. More data won't help -- you need a more complex model.

High Variance (Overfitting): Train score is high but validation score is much lower. More data might help, or you need regularization.

python

1from sklearn.model_selection import learning_curve
2from sklearn.ensemble import RandomForestClassifier
3import numpy as np
4
5model = RandomForestClassifier(n_estimators=50, random_state=42)
6
7train_sizes, train_scores, val_scores = learning_curve(
8    model, X_train, y_train,
9    train_sizes=np.linspace(0.1, 1.0, 10),
10    cv=5,
11    scoring="accuracy",
12    n_jobs=-1
13)
14
15print(f"{'Train Size':>12} {'Train Acc':>12} {'Val Acc':>12} {'Gap':>8}")
16print("-" * 48)
17for size, train_s, val_s in zip(train_sizes, train_scores.mean(axis=1), val_scores.mean(axis=1)):
18    gap = train_s - val_s
19    print(f"{size:12d} {train_s:12.4f} {val_s:12.4f} {gap:8.4f}")

The Bias-Variance Tradeoff

Bias is error from overly simple models (underfitting). Variance is error from overly complex models (overfitting). The goal is to find the sweet spot between the two. Simple models have high bias, low variance. Complex models have low bias, high variance. Regularization and cross-validation help you find the right balance.

Hyperparameter Tuning Strategies

GridSearchCV (Exhaustive Search)

Tries every combination. Guaranteed to find the best combo in the grid, but slow for large grids.

RandomizedSearchCV

Samples random combinations. Much faster for large parameter spaces. Often finds a good solution with far fewer iterations.

Bayesian Optimization (Optuna, Hyperopt)

Uses past evaluations to intelligently choose the next set of parameters to try. Much more efficient than grid or random search.

python

1from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
2from sklearn.ensemble import RandomForestClassifier
3from scipy.stats import randint
4import time
5
6# GridSearchCV: tries every combination
7param_grid = {
8    "n_estimators": [50, 100, 200],
9    "max_depth": [5, 10, 20],
10    "min_samples_leaf": [1, 2, 5]
11}
12
13start = time.time()
14grid = GridSearchCV(
15    RandomForestClassifier(random_state=42),
16    param_grid, cv=5, scoring="accuracy", n_jobs=-1
17)
18grid.fit(X_train, y_train)
19grid_time = time.time() - start
20
21print(f"GridSearchCV ({len(grid.cv_results_['params'])} combos)")
22print(f"  Best params: {grid.best_params_}")
23print(f"  Best CV score: {grid.best_score_:.4f}")
24print(f"  Time: {grid_time:.2f}s")
25
26# RandomizedSearchCV: samples random combinations
27param_dist = {
28    "n_estimators": randint(50, 300),
29    "max_depth": randint(3, 30),
30    "min_samples_leaf": randint(1, 10)
31}
32
33start = time.time()
34random_search = RandomizedSearchCV(
35    RandomForestClassifier(random_state=42),
36    param_dist, n_iter=20, cv=5, scoring="accuracy",
37    random_state=42, n_jobs=-1
38)
39random_search.fit(X_train, y_train)
40rand_time = time.time() - start
41
42print(f"\nRandomizedSearchCV (20 iterations)")
43print(f"  Best params: {random_search.best_params_}")
44print(f"  Best CV score: {random_search.best_score_:.4f}")
45print(f"  Time: {rand_time:.2f}s")

Bayesian Optimization with Optuna

Optuna is a modern hyperparameter optimization framework that uses Bayesian methods to efficiently search the parameter space. Instead of blind search, it builds a probabilistic model of which parameters are likely to work well.

python

1import optuna
2from sklearn.ensemble import RandomForestClassifier
3from sklearn.model_selection import cross_val_score
4
5# Suppress Optuna's verbose output
6optuna.logging.set_verbosity(optuna.logging.WARNING)
7
8def objective(trial):
9    """Optuna objective function: returns the metric to maximize."""
10    params = {
11        "n_estimators": trial.suggest_int("n_estimators", 50, 300),
12        "max_depth": trial.suggest_int("max_depth", 3, 30),
13        "min_samples_leaf": trial.suggest_int("min_samples_leaf", 1, 10),
14        "min_samples_split": trial.suggest_int("min_samples_split", 2, 20),
15        "max_features": trial.suggest_categorical("max_features", ["sqrt", "log2", None]),
16    }
17
18    model = RandomForestClassifier(**params, random_state=42, n_jobs=-1)
19    scores = cross_val_score(model, X_train, y_train, cv=5, scoring="accuracy")
20    return scores.mean()
21
22# Run optimization
23study = optuna.create_study(direction="maximize")
24study.optimize(objective, n_trials=50)
25
26print(f"Best trial:")
27print(f"  Score: {study.best_trial.value:.4f}")
28print(f"  Params: {study.best_trial.params}")
29
30# Train final model with best params
31best_model = RandomForestClassifier(**study.best_trial.params, random_state=42)
32best_model.fit(X_train, y_train)
33print(f"  Test accuracy: {best_model.score(X_test, y_test):.4f}")

When to Use Each Tuning Strategy

Use GridSearchCV when you have a small parameter space (< 100 combinations) and want to be thorough. Use RandomizedSearchCV when the space is large and you want a good-enough result quickly. Use Optuna/Bayesian optimization for large spaces where you need the best possible result efficiently. In practice, start with RandomizedSearch to narrow the range, then refine with GridSearch or Optuna.

Putting It All Together: The Complete Evaluation Workflow

1. Split data into train and test sets (and optionally a validation set) 2. Baseline: Establish a baseline with a simple model (DummyClassifier) 3. Train your model with default parameters 4. Evaluate with appropriate metrics (accuracy, F1, AUC depending on the problem) 5. Diagnose with learning curves and confusion matrix 6. Tune hyperparameters with cross-validated search 7. Final evaluation on the held-out test set (only once!) 8. Report results with confidence intervals

Only Touch the Test Set Once

The test set should be used exactly once -- for your final evaluation. If you tune hyperparameters by checking test set performance repeatedly, you're implicitly overfitting to the test set. Use cross-validation on the training set for all model selection decisions, and save the test set for the very end.

Model Evaluation & Hyperparameter Tuning

The Confusion Matrix

For binary classification, the confusion matrix shows the four possible outcomes:

Predicted Positive	Predicted Negative
Actually Positive	True Positive (TP)	False Negative (FN)
Actually Negative	False Positive (FP)	True Negative (TN)

From these four numbers, we derive all classification metrics.

python

1from sklearn.datasets import load_breast_cancer
2from sklearn.linear_model import LogisticRegression
3from sklearn.model_selection import train_test_split
4from sklearn.metrics import (
5    confusion_matrix, classification_report,
6    precision_score, recall_score, f1_score, accuracy_score
7)
8from sklearn.preprocessing import StandardScaler
9from sklearn.pipeline import Pipeline
10
11# Load and split data
12X, y = load_breast_cancer(return_X_y=True)
13X_train, X_test, y_train, y_test = train_test_split(
14    X, y, test_size=0.2, random_state=42, stratify=y
15)
16
17# Train model
18pipe = Pipeline([
19    ("scaler", StandardScaler()),
20    ("clf", LogisticRegression(max_iter=5000, random_state=42))
21])
22pipe.fit(X_train, y_train)
23y_pred = pipe.predict(X_test)
24
25# Confusion matrix
26cm = confusion_matrix(y_test, y_pred)
27print("Confusion Matrix:")
28print(cm)
29print(f"\nTP={cm[1,1]}, FP={cm[0,1]}, FN={cm[1,0]}, TN={cm[0,0]}")
30
31# Detailed classification report
32print(f"\n{classification_report(y_test, y_pred, target_names=['malignant', 'benign'])}")

Precision, Recall, and F1 Score

These three metrics address different aspects of classification quality:

Precision = TP / (TP + FP) -- "Of all positive predictions, how many were correct?"

High precision means few false alarms

Optimize for precision when false positives are costly (spam filter: don't send real email to junk)

Recall = TP / (TP + FN) -- "Of all actual positives, how many were found?"

High recall means few missed cases

Optimize for recall when false negatives are costly (cancer detection: don't miss a real case)

F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

Harmonic mean of precision and recall

Use when you need a balance between precision and recall

Better than accuracy for imbalanced datasets

Precision vs Recall Trade-off

ROC Curve and AUC

The ROC (Receiver Operating Characteristic) curve plots the True Positive Rate (recall) against the False Positive Rate at various classification thresholds.

AUC (Area Under the Curve) summarizes the ROC curve as a single number (0 to 1)

AUC = 1.0: Perfect classifier

AUC = 0.5: Random guessing (the diagonal line)

AUC is threshold-independent -- it evaluates the model across all possible thresholds

python

1from sklearn.metrics import roc_curve, roc_auc_score
2import numpy as np
3
4# Get predicted probabilities
5y_proba = pipe.predict_proba(X_test)[:, 1]
6
7# Compute ROC curve
8fpr, tpr, thresholds = roc_curve(y_test, y_proba)
9auc = roc_auc_score(y_test, y_proba)
10
11print(f"AUC Score: {auc:.4f}")
12
13# Show key points on the ROC curve
14print(f"\n{'Threshold':>12} {'FPR':>8} {'TPR':>8}")
15print("-" * 30)
16for i in range(0, len(thresholds), max(1, len(thresholds) // 8)):
17    print(f"{thresholds[i]:12.4f} {fpr[i]:8.4f} {tpr[i]:8.4f}")
18
19# Find optimal threshold (closest to top-left corner)
20distances = np.sqrt(fpr**2 + (1 - tpr)**2)
21best_idx = np.argmin(distances)
22print(f"\nOptimal threshold: {thresholds[best_idx]:.4f}")
23print(f"  TPR: {tpr[best_idx]:.4f}, FPR: {fpr[best_idx]:.4f}")

Learning Curves: Diagnosing Bias vs Variance

Learning curves show how training and validation scores change as you add more training data. They're the best tool for diagnosing whether your model suffers from:

High Bias (Underfitting): Both train and validation scores are low and converge. More data won't help -- you need a more complex model.

High Variance (Overfitting): Train score is high but validation score is much lower. More data might help, or you need regularization.

python

1from sklearn.model_selection import learning_curve
2from sklearn.ensemble import RandomForestClassifier
3import numpy as np
4
5model = RandomForestClassifier(n_estimators=50, random_state=42)
6
7train_sizes, train_scores, val_scores = learning_curve(
8    model, X_train, y_train,
9    train_sizes=np.linspace(0.1, 1.0, 10),
10    cv=5,
11    scoring="accuracy",
12    n_jobs=-1
13)
14
15print(f"{'Train Size':>12} {'Train Acc':>12} {'Val Acc':>12} {'Gap':>8}")
16print("-" * 48)
17for size, train_s, val_s in zip(train_sizes, train_scores.mean(axis=1), val_scores.mean(axis=1)):
18    gap = train_s - val_s
19    print(f"{size:12d} {train_s:12.4f} {val_s:12.4f} {gap:8.4f}")

The Bias-Variance Tradeoff

Hyperparameter Tuning Strategies

GridSearchCV (Exhaustive Search)

Tries every combination. Guaranteed to find the best combo in the grid, but slow for large grids.

RandomizedSearchCV

Samples random combinations. Much faster for large parameter spaces. Often finds a good solution with far fewer iterations.

Bayesian Optimization (Optuna, Hyperopt)

Uses past evaluations to intelligently choose the next set of parameters to try. Much more efficient than grid or random search.

python

1from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
2from sklearn.ensemble import RandomForestClassifier
3from scipy.stats import randint
4import time
5
6# GridSearchCV: tries every combination
7param_grid = {
8    "n_estimators": [50, 100, 200],
9    "max_depth": [5, 10, 20],
10    "min_samples_leaf": [1, 2, 5]
11}
12
13start = time.time()
14grid = GridSearchCV(
15    RandomForestClassifier(random_state=42),
16    param_grid, cv=5, scoring="accuracy", n_jobs=-1
17)
18grid.fit(X_train, y_train)
19grid_time = time.time() - start
20
21print(f"GridSearchCV ({len(grid.cv_results_['params'])} combos)")
22print(f"  Best params: {grid.best_params_}")
23print(f"  Best CV score: {grid.best_score_:.4f}")
24print(f"  Time: {grid_time:.2f}s")
25
26# RandomizedSearchCV: samples random combinations
27param_dist = {
28    "n_estimators": randint(50, 300),
29    "max_depth": randint(3, 30),
30    "min_samples_leaf": randint(1, 10)
31}
32
33start = time.time()
34random_search = RandomizedSearchCV(
35    RandomForestClassifier(random_state=42),
36    param_dist, n_iter=20, cv=5, scoring="accuracy",
37    random_state=42, n_jobs=-1
38)
39random_search.fit(X_train, y_train)
40rand_time = time.time() - start
41
42print(f"\nRandomizedSearchCV (20 iterations)")
43print(f"  Best params: {random_search.best_params_}")
44print(f"  Best CV score: {random_search.best_score_:.4f}")
45print(f"  Time: {rand_time:.2f}s")

Bayesian Optimization with Optuna

python

1import optuna
2from sklearn.ensemble import RandomForestClassifier
3from sklearn.model_selection import cross_val_score
4
5# Suppress Optuna's verbose output
6optuna.logging.set_verbosity(optuna.logging.WARNING)
7
8def objective(trial):
9    """Optuna objective function: returns the metric to maximize."""
10    params = {
11        "n_estimators": trial.suggest_int("n_estimators", 50, 300),
12        "max_depth": trial.suggest_int("max_depth", 3, 30),
13        "min_samples_leaf": trial.suggest_int("min_samples_leaf", 1, 10),
14        "min_samples_split": trial.suggest_int("min_samples_split", 2, 20),
15        "max_features": trial.suggest_categorical("max_features", ["sqrt", "log2", None]),
16    }
17
18    model = RandomForestClassifier(**params, random_state=42, n_jobs=-1)
19    scores = cross_val_score(model, X_train, y_train, cv=5, scoring="accuracy")
20    return scores.mean()
21
22# Run optimization
23study = optuna.create_study(direction="maximize")
24study.optimize(objective, n_trials=50)
25
26print(f"Best trial:")
27print(f"  Score: {study.best_trial.value:.4f}")
28print(f"  Params: {study.best_trial.params}")
29
30# Train final model with best params
31best_model = RandomForestClassifier(**study.best_trial.params, random_state=42)
32best_model.fit(X_train, y_train)
33print(f"  Test accuracy: {best_model.score(X_test, y_test):.4f}")