Skip to main content

Model Monitoring & Observability

Detect drift, track performance, and keep production models healthy

~45 min
Listen to this lesson

Model Monitoring & Observability

A deployed model is a living system. Unlike traditional software, ML models silently degrade — they keep returning predictions, but those predictions become increasingly wrong as the world changes around them.

Types of Drift

Understanding drift is fundamental to monitoring ML systems:

Data Drift (Covariate Shift)

The input data distribution changes, but the relationship between inputs and outputs remains the same.

*Example*: A spam classifier trained on 2023 emails encounters 2024 emails with new slang and formats. The emails look different, but what makes something spam hasn't changed.

Concept Drift

The relationship between inputs and outputs changes — the "rules" of the problem shift.

*Example*: A fraud model learns that transactions over $5,000 are suspicious. After a policy change, the fraud threshold shifts to $10,000. The concept of "fraud" itself has changed.

Prediction Drift

The distribution of model outputs changes, even if inputs look similar.

*Example*: A model that used to predict 30% positive / 70% negative suddenly shifts to 50/50. This could indicate upstream data issues or concept drift.

The Silent Failure Problem

Traditional software fails loudly — exceptions, crashes, error codes. ML models fail silently — they keep returning predictions with high confidence even when those predictions are wrong. This is why monitoring is not optional; it's essential.

Detecting Drift: Statistical Tests

The most common approach is comparing the distribution of recent data against a reference (training) distribution.

Kolmogorov-Smirnov (KS) Test

Compares two distributions by measuring the maximum difference between their cumulative distribution functions (CDFs). Works for continuous features.

  • Null hypothesis: The two samples come from the same distribution
  • Low p-value (< 0.05): Significant drift detected
  • High p-value (>= 0.05): No significant drift
  • python
    1import numpy as np
    2from scipy import stats
    3
    4# Reference (training) distribution
    5np.random.seed(42)
    6reference_data = np.random.normal(loc=50, scale=10, size=1000)
    7
    8# Scenario 1: No drift (similar distribution)
    9current_no_drift = np.random.normal(loc=50, scale=10, size=500)
    10
    11# Scenario 2: Drift detected (mean shifted)
    12current_with_drift = np.random.normal(loc=58, scale=12, size=500)
    13
    14# Run KS tests
    15stat1, p_value1 = stats.ks_2samp(reference_data, current_no_drift)
    16stat2, p_value2 = stats.ks_2samp(reference_data, current_with_drift)
    17
    18print("=== No Drift Scenario ===")
    19print(f"KS Statistic: {stat1:.4f}, p-value: {p_value1:.4f}")
    20print(f"Drift detected: {p_value1 < 0.05}")
    21
    22print("\n=== Drift Scenario ===")
    23print(f"KS Statistic: {stat2:.4f}, p-value: {p_value2:.4f}")
    24print(f"Drift detected: {p_value2 < 0.05}")
    25
    26# For categorical features, use Chi-squared test
    27from scipy.stats import chi2_contingency
    28
    29# Reference category counts vs current
    30observed = np.array([[300, 200, 500],   # reference
    31                     [400, 100, 500]])   # current (category shift)
    32chi2, p_val, dof, expected = chi2_contingency(observed)
    33print(f"\nChi-squared test p-value: {p_val:.4f}")
    34print(f"Categorical drift detected: {p_val < 0.05}")

    Evidently AI

    Evidently is an open-source library purpose-built for ML monitoring. It generates rich reports and dashboards for data drift, model quality, and target drift.

    python
    1from evidently.report import Report
    2from evidently.metric_preset import (
    3    DataDriftPreset,
    4    DataQualityPreset,
    5    TargetDriftPreset
    6)
    7import pandas as pd
    8
    9# Reference and current datasets
    10reference = pd.DataFrame({
    11    "feature_1": np.random.normal(0, 1, 1000),
    12    "feature_2": np.random.normal(5, 2, 1000),
    13    "prediction": np.random.choice([0, 1], 1000, p=[0.7, 0.3]),
    14})
    15
    16current = pd.DataFrame({
    17    "feature_1": np.random.normal(0.5, 1.2, 500),   # shifted!
    18    "feature_2": np.random.normal(5, 2, 500),        # stable
    19    "prediction": np.random.choice([0, 1], 500, p=[0.5, 0.5]),  # shifted!
    20})
    21
    22# Generate a data drift report
    23report = Report(metrics=[
    24    DataDriftPreset(),
    25    DataQualityPreset(),
    26])
    27report.run(reference_data=reference, current_data=current)
    28
    29# Save as HTML
    30report.save_html("drift_report.html")
    31
    32# Or get results as a dictionary
    33result = report.as_dict()
    34for feature, info in result['metrics'][0]['result']['drift_by_columns'].items():
    35    print(f"{feature}: drift={info['drift_detected']}, "
    36          f"p-value={info.get('p_value', 'N/A')}")

    Operational Monitoring Metrics

    Beyond data drift, you need to monitor the operational health of your serving infrastructure:

    MetricWhat It MeasuresAlert Threshold Example
    Latency (p50, p95, p99)Response time distributionp99 > 200ms
    ThroughputRequests per second< 100 rps (under capacity)
    Error rateFailed predictions / total> 1%
    Memory usageModel server RAM> 85%
    GPU utilizationGPU compute usage< 20% (over-provisioned)
    Queue depthPending requests> 1000 (backlog)
    Model versionWhich model is servingMismatch with expected

    Prometheus + Grafana for ML

    Prometheus scrapes metrics from your serving endpoints. Grafana visualizes them as dashboards.

    python
    1from prometheus_client import (
    2    Counter, Histogram, Gauge, start_http_server, Summary
    3)
    4import time
    5import random
    6
    7# Define metrics
    8PREDICTION_COUNTER = Counter(
    9    'ml_predictions_total',
    10    'Total number of predictions',
    11    ['model_name', 'model_version', 'predicted_class']
    12)
    13
    14PREDICTION_LATENCY = Histogram(
    15    'ml_prediction_latency_seconds',
    16    'Time to generate a prediction',
    17    ['model_name'],
    18    buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0]
    19)
    20
    21PREDICTION_CONFIDENCE = Summary(
    22    'ml_prediction_confidence',
    23    'Model prediction confidence scores',
    24    ['model_name']
    25)
    26
    27DATA_DRIFT_SCORE = Gauge(
    28    'ml_data_drift_score',
    29    'Current data drift score (0=no drift, 1=full drift)',
    30    ['model_name', 'feature_name']
    31)
    32
    33# Start the Prometheus metrics server
    34start_http_server(8000)
    35
    36# In your prediction endpoint:
    37def predict(features):
    38    start_time = time.time()
    39
    40    # Run model inference
    41    prediction = model.predict(features)
    42    predicted_class = int(prediction.argmax())
    43    confidence = float(prediction.max())
    44
    45    # Record metrics
    46    latency = time.time() - start_time
    47    PREDICTION_COUNTER.labels(
    48        model_name='fraud-detector',
    49        model_version='v3',
    50        predicted_class=str(predicted_class)
    51    ).inc()
    52    PREDICTION_LATENCY.labels(model_name='fraud-detector').observe(latency)
    53    PREDICTION_CONFIDENCE.labels(model_name='fraud-detector').observe(confidence)
    54
    55    return {"class": predicted_class, "confidence": confidence}

    Alerting Strategies

    Set up alerts at multiple levels:

    1. Immediate (PagerDuty): Error rate > 5%, service down 2. Urgent (Slack): Latency p99 > 500ms, drift score > 0.7 3. Informational (Email): Weekly drift report, accuracy trend

    Example Grafana Alert Rules:

    # High latency alert
    ALERT: ml_prediction_latency_seconds{quantile="0.99"} > 0.5
    FOR: 5m
    SEVERITY: warning

    Data drift alert

    ALERT: ml_data_drift_score > 0.7 FOR: 1h SEVERITY: critical ACTION: trigger_retraining_pipeline

    Automated Retraining Triggers

    Monitoring should close the loop by triggering retraining when needed:

    Monitor → Detect Drift → Alert → Trigger Pipeline → Retrain → Validate → Deploy
    

    Common triggers:

  • Data drift: Statistical drift score exceeds threshold
  • Performance degradation: Accuracy/F1 drops below baseline
  • Schedule: Weekly/monthly regardless of drift
  • Data volume: When enough new labeled data accumulates
  • Shadow Deployments and Canary Releases

    Shadow Deployment

    Route 100% of traffic to the current model, but also send a copy of each request to the new model. Compare predictions without affecting users.

                       ┌──── Production Model ──── Response to User
    User Request ────►│
                       └──── Shadow Model ──────── Log Only (not returned)
    

    Canary Deployment

    Route a small percentage of real traffic to the new model. If metrics are good, gradually increase the percentage.

    Phase 1:  95% → v1,  5% → v2   (monitor for 1 hour)
    Phase 2:  75% → v1, 25% → v2   (monitor for 4 hours)
    Phase 3:  50% → v1, 50% → v2   (monitor for 24 hours)
    Phase 4:   0% → v1, 100% → v2  (full rollout)
    

    Shadow is safer (no user impact) but more expensive (double compute). Canary is cheaper but carries some risk (a small percentage of users get the new model).

    The Monitoring Stack

    A complete ML monitoring setup typically includes: - **Prometheus** for metrics collection - **Grafana** for dashboards and visualization - **Evidently** or **WhyLabs** for ML-specific drift detection - **PagerDuty/OpsGenie** for alerting - **MLflow** for model versioning and rollback