Model Monitoring & Observability
A deployed model is a living system. Unlike traditional software, ML models silently degrade — they keep returning predictions, but those predictions become increasingly wrong as the world changes around them.
Types of Drift
Understanding drift is fundamental to monitoring ML systems:
Data Drift (Covariate Shift)
The input data distribution changes, but the relationship between inputs and outputs remains the same.*Example*: A spam classifier trained on 2023 emails encounters 2024 emails with new slang and formats. The emails look different, but what makes something spam hasn't changed.
Concept Drift
The relationship between inputs and outputs changes — the "rules" of the problem shift.*Example*: A fraud model learns that transactions over $5,000 are suspicious. After a policy change, the fraud threshold shifts to $10,000. The concept of "fraud" itself has changed.
Prediction Drift
The distribution of model outputs changes, even if inputs look similar.*Example*: A model that used to predict 30% positive / 70% negative suddenly shifts to 50/50. This could indicate upstream data issues or concept drift.
The Silent Failure Problem
Detecting Drift: Statistical Tests
The most common approach is comparing the distribution of recent data against a reference (training) distribution.
Kolmogorov-Smirnov (KS) Test
Compares two distributions by measuring the maximum difference between their cumulative distribution functions (CDFs). Works for continuous features.1import numpy as np
2from scipy import stats
3
4# Reference (training) distribution
5np.random.seed(42)
6reference_data = np.random.normal(loc=50, scale=10, size=1000)
7
8# Scenario 1: No drift (similar distribution)
9current_no_drift = np.random.normal(loc=50, scale=10, size=500)
10
11# Scenario 2: Drift detected (mean shifted)
12current_with_drift = np.random.normal(loc=58, scale=12, size=500)
13
14# Run KS tests
15stat1, p_value1 = stats.ks_2samp(reference_data, current_no_drift)
16stat2, p_value2 = stats.ks_2samp(reference_data, current_with_drift)
17
18print("=== No Drift Scenario ===")
19print(f"KS Statistic: {stat1:.4f}, p-value: {p_value1:.4f}")
20print(f"Drift detected: {p_value1 < 0.05}")
21
22print("\n=== Drift Scenario ===")
23print(f"KS Statistic: {stat2:.4f}, p-value: {p_value2:.4f}")
24print(f"Drift detected: {p_value2 < 0.05}")
25
26# For categorical features, use Chi-squared test
27from scipy.stats import chi2_contingency
28
29# Reference category counts vs current
30observed = np.array([[300, 200, 500], # reference
31 [400, 100, 500]]) # current (category shift)
32chi2, p_val, dof, expected = chi2_contingency(observed)
33print(f"\nChi-squared test p-value: {p_val:.4f}")
34print(f"Categorical drift detected: {p_val < 0.05}")Evidently AI
Evidently is an open-source library purpose-built for ML monitoring. It generates rich reports and dashboards for data drift, model quality, and target drift.
1from evidently.report import Report
2from evidently.metric_preset import (
3 DataDriftPreset,
4 DataQualityPreset,
5 TargetDriftPreset
6)
7import pandas as pd
8
9# Reference and current datasets
10reference = pd.DataFrame({
11 "feature_1": np.random.normal(0, 1, 1000),
12 "feature_2": np.random.normal(5, 2, 1000),
13 "prediction": np.random.choice([0, 1], 1000, p=[0.7, 0.3]),
14})
15
16current = pd.DataFrame({
17 "feature_1": np.random.normal(0.5, 1.2, 500), # shifted!
18 "feature_2": np.random.normal(5, 2, 500), # stable
19 "prediction": np.random.choice([0, 1], 500, p=[0.5, 0.5]), # shifted!
20})
21
22# Generate a data drift report
23report = Report(metrics=[
24 DataDriftPreset(),
25 DataQualityPreset(),
26])
27report.run(reference_data=reference, current_data=current)
28
29# Save as HTML
30report.save_html("drift_report.html")
31
32# Or get results as a dictionary
33result = report.as_dict()
34for feature, info in result['metrics'][0]['result']['drift_by_columns'].items():
35 print(f"{feature}: drift={info['drift_detected']}, "
36 f"p-value={info.get('p_value', 'N/A')}")Operational Monitoring Metrics
Beyond data drift, you need to monitor the operational health of your serving infrastructure:
| Metric | What It Measures | Alert Threshold Example |
|---|---|---|
| Latency (p50, p95, p99) | Response time distribution | p99 > 200ms |
| Throughput | Requests per second | < 100 rps (under capacity) |
| Error rate | Failed predictions / total | > 1% |
| Memory usage | Model server RAM | > 85% |
| GPU utilization | GPU compute usage | < 20% (over-provisioned) |
| Queue depth | Pending requests | > 1000 (backlog) |
| Model version | Which model is serving | Mismatch with expected |
Prometheus + Grafana for ML
Prometheus scrapes metrics from your serving endpoints. Grafana visualizes them as dashboards.
1from prometheus_client import (
2 Counter, Histogram, Gauge, start_http_server, Summary
3)
4import time
5import random
6
7# Define metrics
8PREDICTION_COUNTER = Counter(
9 'ml_predictions_total',
10 'Total number of predictions',
11 ['model_name', 'model_version', 'predicted_class']
12)
13
14PREDICTION_LATENCY = Histogram(
15 'ml_prediction_latency_seconds',
16 'Time to generate a prediction',
17 ['model_name'],
18 buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0]
19)
20
21PREDICTION_CONFIDENCE = Summary(
22 'ml_prediction_confidence',
23 'Model prediction confidence scores',
24 ['model_name']
25)
26
27DATA_DRIFT_SCORE = Gauge(
28 'ml_data_drift_score',
29 'Current data drift score (0=no drift, 1=full drift)',
30 ['model_name', 'feature_name']
31)
32
33# Start the Prometheus metrics server
34start_http_server(8000)
35
36# In your prediction endpoint:
37def predict(features):
38 start_time = time.time()
39
40 # Run model inference
41 prediction = model.predict(features)
42 predicted_class = int(prediction.argmax())
43 confidence = float(prediction.max())
44
45 # Record metrics
46 latency = time.time() - start_time
47 PREDICTION_COUNTER.labels(
48 model_name='fraud-detector',
49 model_version='v3',
50 predicted_class=str(predicted_class)
51 ).inc()
52 PREDICTION_LATENCY.labels(model_name='fraud-detector').observe(latency)
53 PREDICTION_CONFIDENCE.labels(model_name='fraud-detector').observe(confidence)
54
55 return {"class": predicted_class, "confidence": confidence}Alerting Strategies
Set up alerts at multiple levels:
1. Immediate (PagerDuty): Error rate > 5%, service down 2. Urgent (Slack): Latency p99 > 500ms, drift score > 0.7 3. Informational (Email): Weekly drift report, accuracy trend
Example Grafana Alert Rules:
# High latency alert
ALERT: ml_prediction_latency_seconds{quantile="0.99"} > 0.5
FOR: 5m
SEVERITY: warningData drift alert
ALERT: ml_data_drift_score > 0.7
FOR: 1h
SEVERITY: critical
ACTION: trigger_retraining_pipeline
Automated Retraining Triggers
Monitoring should close the loop by triggering retraining when needed:
Monitor → Detect Drift → Alert → Trigger Pipeline → Retrain → Validate → Deploy
Common triggers:
Shadow Deployments and Canary Releases
Shadow Deployment
Route 100% of traffic to the current model, but also send a copy of each request to the new model. Compare predictions without affecting users. ┌──── Production Model ──── Response to User
User Request ────►│
└──── Shadow Model ──────── Log Only (not returned)
Canary Deployment
Route a small percentage of real traffic to the new model. If metrics are good, gradually increase the percentage.Phase 1: 95% → v1, 5% → v2 (monitor for 1 hour)
Phase 2: 75% → v1, 25% → v2 (monitor for 4 hours)
Phase 3: 50% → v1, 50% → v2 (monitor for 24 hours)
Phase 4: 0% → v1, 100% → v2 (full rollout)
Shadow is safer (no user impact) but more expensive (double compute). Canary is cheaper but carries some risk (a small percentage of users get the new model).