Model Monitoring & Observability

A deployed model is a living system. Unlike traditional software, ML models silently degrade — they keep returning predictions, but those predictions become increasingly wrong as the world changes around them.

Types of Drift

Understanding drift is fundamental to monitoring ML systems:

Data Drift (Covariate Shift)

The input data distribution changes, but the relationship between inputs and outputs remains the same.

*Example*: A spam classifier trained on 2023 emails encounters 2024 emails with new slang and formats. The emails look different, but what makes something spam hasn't changed.

Concept Drift

The relationship between inputs and outputs changes — the "rules" of the problem shift.

*Example*: A fraud model learns that transactions over $5,000 are suspicious. After a policy change, the fraud threshold shifts to $10,000. The concept of "fraud" itself has changed.

Prediction Drift

The distribution of model outputs changes, even if inputs look similar.

*Example*: A model that used to predict 30% positive / 70% negative suddenly shifts to 50/50. This could indicate upstream data issues or concept drift.

The Silent Failure Problem

Traditional software fails loudly — exceptions, crashes, error codes. ML models fail silently — they keep returning predictions with high confidence even when those predictions are wrong. This is why monitoring is not optional; it's essential.

Detecting Drift: Statistical Tests

The most common approach is comparing the distribution of recent data against a reference (training) distribution.

Kolmogorov-Smirnov (KS) Test

Compares two distributions by measuring the maximum difference between their cumulative distribution functions (CDFs). Works for continuous features.

Null hypothesis: The two samples come from the same distribution

Low p-value (< 0.05): Significant drift detected

High p-value (>= 0.05): No significant drift

python

1import numpy as np
2from scipy import stats
3
4# Reference (training) distribution
5np.random.seed(42)
6reference_data = np.random.normal(loc=50, scale=10, size=1000)
7
8# Scenario 1: No drift (similar distribution)
9current_no_drift = np.random.normal(loc=50, scale=10, size=500)
10
11# Scenario 2: Drift detected (mean shifted)
12current_with_drift = np.random.normal(loc=58, scale=12, size=500)
13
14# Run KS tests
15stat1, p_value1 = stats.ks_2samp(reference_data, current_no_drift)
16stat2, p_value2 = stats.ks_2samp(reference_data, current_with_drift)
17
18print("=== No Drift Scenario ===")
19print(f"KS Statistic: {stat1:.4f}, p-value: {p_value1:.4f}")
20print(f"Drift detected: {p_value1 < 0.05}")
21
22print("\n=== Drift Scenario ===")
23print(f"KS Statistic: {stat2:.4f}, p-value: {p_value2:.4f}")
24print(f"Drift detected: {p_value2 < 0.05}")
25
26# For categorical features, use Chi-squared test
27from scipy.stats import chi2_contingency
28
29# Reference category counts vs current
30observed = np.array([[300, 200, 500],   # reference
31                     [400, 100, 500]])   # current (category shift)
32chi2, p_val, dof, expected = chi2_contingency(observed)
33print(f"\nChi-squared test p-value: {p_val:.4f}")
34print(f"Categorical drift detected: {p_val < 0.05}")

Evidently AI

Evidently is an open-source library purpose-built for ML monitoring. It generates rich reports and dashboards for data drift, model quality, and target drift.

python

1from evidently.report import Report
2from evidently.metric_preset import (
3    DataDriftPreset,
4    DataQualityPreset,
5    TargetDriftPreset
6)
7import pandas as pd
8
9# Reference and current datasets
10reference = pd.DataFrame({
11    "feature_1": np.random.normal(0, 1, 1000),
12    "feature_2": np.random.normal(5, 2, 1000),
13    "prediction": np.random.choice([0, 1], 1000, p=[0.7, 0.3]),
14})
15
16current = pd.DataFrame({
17    "feature_1": np.random.normal(0.5, 1.2, 500),   # shifted!
18    "feature_2": np.random.normal(5, 2, 500),        # stable
19    "prediction": np.random.choice([0, 1], 500, p=[0.5, 0.5]),  # shifted!
20})
21
22# Generate a data drift report
23report = Report(metrics=[
24    DataDriftPreset(),
25    DataQualityPreset(),
26])
27report.run(reference_data=reference, current_data=current)
28
29# Save as HTML
30report.save_html("drift_report.html")
31
32# Or get results as a dictionary
33result = report.as_dict()
34for feature, info in result['metrics'][0]['result']['drift_by_columns'].items():
35    print(f"{feature}: drift={info['drift_detected']}, "
36          f"p-value={info.get('p_value', 'N/A')}")

Operational Monitoring Metrics

Beyond data drift, you need to monitor the operational health of your serving infrastructure:

Metric	What It Measures	Alert Threshold Example
Latency (p50, p95, p99)	Response time distribution	p99 > 200ms
Throughput	Requests per second	< 100 rps (under capacity)
Error rate	Failed predictions / total	> 1%
Memory usage	Model server RAM	> 85%
GPU utilization	GPU compute usage	< 20% (over-provisioned)
Queue depth	Pending requests	> 1000 (backlog)
Model version	Which model is serving	Mismatch with expected

Prometheus + Grafana for ML

Prometheus scrapes metrics from your serving endpoints. Grafana visualizes them as dashboards.

python

1from prometheus_client import (
2    Counter, Histogram, Gauge, start_http_server, Summary
3)
4import time
5import random
6
7# Define metrics
8PREDICTION_COUNTER = Counter(
9    'ml_predictions_total',
10    'Total number of predictions',
11    ['model_name', 'model_version', 'predicted_class']
12)
13
14PREDICTION_LATENCY = Histogram(
15    'ml_prediction_latency_seconds',
16    'Time to generate a prediction',
17    ['model_name'],
18    buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0]
19)
20
21PREDICTION_CONFIDENCE = Summary(
22    'ml_prediction_confidence',
23    'Model prediction confidence scores',
24    ['model_name']
25)
26
27DATA_DRIFT_SCORE = Gauge(
28    'ml_data_drift_score',
29    'Current data drift score (0=no drift, 1=full drift)',
30    ['model_name', 'feature_name']
31)
32
33# Start the Prometheus metrics server
34start_http_server(8000)
35
36# In your prediction endpoint:
37def predict(features):
38    start_time = time.time()
39
40    # Run model inference
41    prediction = model.predict(features)
42    predicted_class = int(prediction.argmax())
43    confidence = float(prediction.max())
44
45    # Record metrics
46    latency = time.time() - start_time
47    PREDICTION_COUNTER.labels(
48        model_name='fraud-detector',
49        model_version='v3',
50        predicted_class=str(predicted_class)
51    ).inc()
52    PREDICTION_LATENCY.labels(model_name='fraud-detector').observe(latency)
53    PREDICTION_CONFIDENCE.labels(model_name='fraud-detector').observe(confidence)
54
55    return {"class": predicted_class, "confidence": confidence}

Alerting Strategies

Set up alerts at multiple levels:

1. Immediate (PagerDuty): Error rate > 5%, service down 2. Urgent (Slack): Latency p99 > 500ms, drift score > 0.7 3. Informational (Email): Weekly drift report, accuracy trend

Example Grafana Alert Rules:

# High latency alert
ALERT: ml_prediction_latency_seconds{quantile="0.99"} > 0.5
FOR: 5m
SEVERITY: warning
Data drift alert
ALERT: ml_data_drift_score > 0.7
FOR: 1h
SEVERITY: critical
ACTION: trigger_retraining_pipeline

Automated Retraining Triggers

Monitoring should close the loop by triggering retraining when needed:

Monitor → Detect Drift → Alert → Trigger Pipeline → Retrain → Validate → Deploy

Common triggers:

Data drift: Statistical drift score exceeds threshold

Performance degradation: Accuracy/F1 drops below baseline

Schedule: Weekly/monthly regardless of drift

Data volume: When enough new labeled data accumulates

Shadow Deployments and Canary Releases

Shadow Deployment

Route 100% of traffic to the current model, but also send a copy of each request to the new model. Compare predictions without affecting users.

                   ┌──── Production Model ──── Response to User
User Request ────►│
                   └──── Shadow Model ──────── Log Only (not returned)

Canary Deployment

Route a small percentage of real traffic to the new model. If metrics are good, gradually increase the percentage.

Phase 1:  95% → v1,  5% → v2   (monitor for 1 hour)
Phase 2:  75% → v1, 25% → v2   (monitor for 4 hours)
Phase 3:  50% → v1, 50% → v2   (monitor for 24 hours)
Phase 4:   0% → v1, 100% → v2  (full rollout)

Shadow is safer (no user impact) but more expensive (double compute). Canary is cheaper but carries some risk (a small percentage of users get the new model).

The Monitoring Stack

A complete ML monitoring setup typically includes: - **Prometheus** for metrics collection - **Grafana** for dashboards and visualization - **Evidently** or **WhyLabs** for ML-specific drift detection - **PagerDuty/OpsGenie** for alerting - **MLflow** for model versioning and rollback

Model Monitoring & Observability

Types of Drift

Understanding drift is fundamental to monitoring ML systems:

Data Drift (Covariate Shift)

The input data distribution changes, but the relationship between inputs and outputs remains the same.

*Example*: A spam classifier trained on 2023 emails encounters 2024 emails with new slang and formats. The emails look different, but what makes something spam hasn't changed.

Concept Drift

The relationship between inputs and outputs changes — the "rules" of the problem shift.

*Example*: A fraud model learns that transactions over $5,000 are suspicious. After a policy change, the fraud threshold shifts to $10,000. The concept of "fraud" itself has changed.

Prediction Drift

The distribution of model outputs changes, even if inputs look similar.

*Example*: A model that used to predict 30% positive / 70% negative suddenly shifts to 50/50. This could indicate upstream data issues or concept drift.

The Silent Failure Problem

Detecting Drift: Statistical Tests

The most common approach is comparing the distribution of recent data against a reference (training) distribution.

Kolmogorov-Smirnov (KS) Test

Compares two distributions by measuring the maximum difference between their cumulative distribution functions (CDFs). Works for continuous features.

Null hypothesis: The two samples come from the same distribution

Low p-value (< 0.05): Significant drift detected

High p-value (>= 0.05): No significant drift

python

1import numpy as np
2from scipy import stats
3
4# Reference (training) distribution
5np.random.seed(42)
6reference_data = np.random.normal(loc=50, scale=10, size=1000)
7
8# Scenario 1: No drift (similar distribution)
9current_no_drift = np.random.normal(loc=50, scale=10, size=500)
10
11# Scenario 2: Drift detected (mean shifted)
12current_with_drift = np.random.normal(loc=58, scale=12, size=500)
13
14# Run KS tests
15stat1, p_value1 = stats.ks_2samp(reference_data, current_no_drift)
16stat2, p_value2 = stats.ks_2samp(reference_data, current_with_drift)
17
18print("=== No Drift Scenario ===")
19print(f"KS Statistic: {stat1:.4f}, p-value: {p_value1:.4f}")
20print(f"Drift detected: {p_value1 < 0.05}")
21
22print("\n=== Drift Scenario ===")
23print(f"KS Statistic: {stat2:.4f}, p-value: {p_value2:.4f}")
24print(f"Drift detected: {p_value2 < 0.05}")
25
26# For categorical features, use Chi-squared test
27from scipy.stats import chi2_contingency
28
29# Reference category counts vs current
30observed = np.array([[300, 200, 500],   # reference
31                     [400, 100, 500]])   # current (category shift)
32chi2, p_val, dof, expected = chi2_contingency(observed)
33print(f"\nChi-squared test p-value: {p_val:.4f}")
34print(f"Categorical drift detected: {p_val < 0.05}")

Evidently AI

Evidently is an open-source library purpose-built for ML monitoring. It generates rich reports and dashboards for data drift, model quality, and target drift.

python

1from evidently.report import Report
2from evidently.metric_preset import (
3    DataDriftPreset,
4    DataQualityPreset,
5    TargetDriftPreset
6)
7import pandas as pd
8
9# Reference and current datasets
10reference = pd.DataFrame({
11    "feature_1": np.random.normal(0, 1, 1000),
12    "feature_2": np.random.normal(5, 2, 1000),
13    "prediction": np.random.choice([0, 1], 1000, p=[0.7, 0.3]),
14})
15
16current = pd.DataFrame({
17    "feature_1": np.random.normal(0.5, 1.2, 500),   # shifted!
18    "feature_2": np.random.normal(5, 2, 500),        # stable
19    "prediction": np.random.choice([0, 1], 500, p=[0.5, 0.5]),  # shifted!
20})
21
22# Generate a data drift report
23report = Report(metrics=[
24    DataDriftPreset(),
25    DataQualityPreset(),
26])
27report.run(reference_data=reference, current_data=current)
28
29# Save as HTML
30report.save_html("drift_report.html")
31
32# Or get results as a dictionary
33result = report.as_dict()
34for feature, info in result['metrics'][0]['result']['drift_by_columns'].items():
35    print(f"{feature}: drift={info['drift_detected']}, "
36          f"p-value={info.get('p_value', 'N/A')}")

Operational Monitoring Metrics

Beyond data drift, you need to monitor the operational health of your serving infrastructure:

Metric	What It Measures	Alert Threshold Example
Latency (p50, p95, p99)	Response time distribution	p99 > 200ms
Throughput	Requests per second	< 100 rps (under capacity)
Error rate	Failed predictions / total	> 1%
Memory usage	Model server RAM	> 85%
GPU utilization	GPU compute usage	< 20% (over-provisioned)
Queue depth	Pending requests	> 1000 (backlog)
Model version	Which model is serving	Mismatch with expected

Prometheus + Grafana for ML

Prometheus scrapes metrics from your serving endpoints. Grafana visualizes them as dashboards.

python

1from prometheus_client import (
2    Counter, Histogram, Gauge, start_http_server, Summary
3)
4import time
5import random
6
7# Define metrics
8PREDICTION_COUNTER = Counter(
9    'ml_predictions_total',
10    'Total number of predictions',
11    ['model_name', 'model_version', 'predicted_class']
12)
13
14PREDICTION_LATENCY = Histogram(
15    'ml_prediction_latency_seconds',
16    'Time to generate a prediction',
17    ['model_name'],
18    buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0]
19)
20
21PREDICTION_CONFIDENCE = Summary(
22    'ml_prediction_confidence',
23    'Model prediction confidence scores',
24    ['model_name']
25)
26
27DATA_DRIFT_SCORE = Gauge(
28    'ml_data_drift_score',
29    'Current data drift score (0=no drift, 1=full drift)',
30    ['model_name', 'feature_name']
31)
32
33# Start the Prometheus metrics server
34start_http_server(8000)
35
36# In your prediction endpoint:
37def predict(features):
38    start_time = time.time()
39
40    # Run model inference
41    prediction = model.predict(features)
42    predicted_class = int(prediction.argmax())
43    confidence = float(prediction.max())
44
45    # Record metrics
46    latency = time.time() - start_time
47    PREDICTION_COUNTER.labels(
48        model_name='fraud-detector',
49        model_version='v3',
50        predicted_class=str(predicted_class)
51    ).inc()
52    PREDICTION_LATENCY.labels(model_name='fraud-detector').observe(latency)
53    PREDICTION_CONFIDENCE.labels(model_name='fraud-detector').observe(confidence)
54
55    return {"class": predicted_class, "confidence": confidence}

Alerting Strategies

Set up alerts at multiple levels:

1. Immediate (PagerDuty): Error rate > 5%, service down 2. Urgent (Slack): Latency p99 > 500ms, drift score > 0.7 3. Informational (Email): Weekly drift report, accuracy trend

Example Grafana Alert Rules:

# High latency alert
ALERT: ml_prediction_latency_seconds{quantile="0.99"} > 0.5
FOR: 5m
SEVERITY: warning
Data drift alert
ALERT: ml_data_drift_score > 0.7
FOR: 1h
SEVERITY: critical
ACTION: trigger_retraining_pipeline

Automated Retraining Triggers

Monitoring should close the loop by triggering retraining when needed:

Monitor → Detect Drift → Alert → Trigger Pipeline → Retrain → Validate → Deploy

Common triggers:

Data drift: Statistical drift score exceeds threshold

Performance degradation: Accuracy/F1 drops below baseline

Schedule: Weekly/monthly regardless of drift

Data volume: When enough new labeled data accumulates

Shadow Deployments and Canary Releases

Shadow Deployment

Route 100% of traffic to the current model, but also send a copy of each request to the new model. Compare predictions without affecting users.

                   ┌──── Production Model ──── Response to User
User Request ────►│
                   └──── Shadow Model ──────── Log Only (not returned)

Canary Deployment

Route a small percentage of real traffic to the new model. If metrics are good, gradually increase the percentage.

Phase 1:  95% → v1,  5% → v2   (monitor for 1 hour)
Phase 2:  75% → v1, 25% → v2   (monitor for 4 hours)
Phase 3:  50% → v1, 50% → v2   (monitor for 24 hours)
Phase 4:   0% → v1, 100% → v2  (full rollout)

Shadow is safer (no user impact) but more expensive (double compute). Canary is cheaper but carries some risk (a small percentage of users get the new model).