Skip to main content

ML Infrastructure & Platforms

Feature stores, model registries, vector databases, and infrastructure as code for ML

~50 min
Listen to this lesson

ML Infrastructure & Platforms

Production ML systems rely on specialized infrastructure beyond standard web services. In this lesson, you'll learn about the key infrastructure components that make ML systems reliable, consistent, and scalable: feature stores, model registries, vector databases, and infrastructure as code.

Feature Stores

A feature store is a centralized system for defining, computing, storing, and serving ML features. It solves one of the most common and insidious problems in production ML: training-serving skew.

The Training-Serving Skew Problem

Training time:
  raw_data → pandas → feature_engineering.py → model.fit()

Serving time: raw_data → Java backend → different_feature_code → model.predict()

→ Subtle bugs! Features computed differently → degraded model performance

A feature store ensures the exact same feature definitions are used for both training and serving.

Feast (Feature Store)

Feast is the most popular open-source feature store. It provides:

  • Offline store: For training — batch access to historical features (BigQuery, Parquet, etc.)
  • Online store: For serving — low-latency feature lookup (Redis, DynamoDB, etc.)
  • Feature consistency: Same definition for both stores
  • Point-in-time correct joins: No data leakage in training
  • python
    1# feature_repo/features.py — Feast feature definitions
    2
    3from feast import Entity, FeatureView, Field, FileSource
    4from feast.types import Float32, Int64, String
    5from datetime import timedelta
    6
    7# Define the entity (the "who" for feature lookups)
    8customer = Entity(
    9    name="customer_id",
    10    join_keys=["customer_id"],
    11    description="Unique customer identifier"
    12)
    13
    14# Define a data source
    15customer_transactions_source = FileSource(
    16    path="data/customer_transactions.parquet",
    17    timestamp_field="event_timestamp",
    18    created_timestamp_column="created_timestamp"
    19)
    20
    21# Define a feature view (a group of related features)
    22customer_transaction_features = FeatureView(
    23    name="customer_transactions",
    24    entities=[customer],
    25    ttl=timedelta(days=90),  # Features expire after 90 days
    26    schema=[
    27        Field(name="total_transactions_30d", dtype=Int64),
    28        Field(name="avg_transaction_amount_30d", dtype=Float32),
    29        Field(name="max_transaction_amount_30d", dtype=Float32),
    30        Field(name="transaction_count_7d", dtype=Int64),
    31        Field(name="unique_merchants_30d", dtype=Int64),
    32    ],
    33    source=customer_transactions_source,
    34    online=True,  # Materialize to online store for serving
    35)
    python
    1# Using Feast for training and serving
    2from feast import FeatureStore
    3import pandas as pd
    4
    5store = FeatureStore(repo_path="feature_repo/")
    6
    7# --- OFFLINE: Get features for training ---
    8# Point-in-time correct join — no data leakage!
    9entity_df = pd.DataFrame({
    10    "customer_id": [1001, 1002, 1003],
    11    "event_timestamp": pd.to_datetime([
    12        "2024-01-15", "2024-01-15", "2024-01-15"
    13    ])
    14})
    15
    16training_df = store.get_historical_features(
    17    entity_df=entity_df,
    18    features=[
    19        "customer_transactions:total_transactions_30d",
    20        "customer_transactions:avg_transaction_amount_30d",
    21        "customer_transactions:transaction_count_7d",
    22    ]
    23).to_df()
    24
    25print("Training features:")
    26print(training_df)
    27
    28# --- ONLINE: Get features for serving ---
    29# Materialize features to the online store first
    30store.materialize_incremental(end_date=datetime.now())
    31
    32# Then retrieve for a single customer at serving time (low latency)
    33online_features = store.get_online_features(
    34    features=[
    35        "customer_transactions:total_transactions_30d",
    36        "customer_transactions:avg_transaction_amount_30d",
    37        "customer_transactions:transaction_count_7d",
    38    ],
    39    entity_rows=[{"customer_id": 1001}]
    40).to_dict()
    41
    42print("\nOnline features for customer 1001:")
    43print(online_features)

    Model Registries

    A model registry is a central catalog for managing the lifecycle of trained models. It provides versioning, staging, approval workflows, and deployment tracking.

    MLflow Model Registry

    Developer trains model
          │
          ▼
    Register in MLflow ──► Version 1 (None)
          │
          ▼
    Promote to Staging ──► Version 1 (Staging)
          │
          ▼
    Run validation tests
          │
          ▼
    Promote to Production ──► Version 1 (Production)
          │
          ▼ (new model trained)
    Register new version ──► Version 2 (None)
          │
          ▼
    Version 1 still in Production
    Version 2 in Staging for testing
    

    Vertex AI Model Registry (Google Cloud)

    Google's managed offering adds:

  • Endpoints: Deploy models with traffic splitting
  • Model evaluation: Built-in evaluation metrics
  • Explainability: Integrated SHAP/IG
  • Monitoring: Automatic drift detection
  • python
    1# MLflow Model Registry — promotion workflow
    2import mlflow
    3from mlflow.tracking import MlflowClient
    4
    5client = MlflowClient()
    6
    7# Register a model from a training run
    8model_name = "fraud-detector"
    9run_id = "abc123def456"
    10
    11# Register version 1
    12result = mlflow.register_model(
    13    model_uri=f"runs:/{run_id}/model",
    14    name=model_name
    15)
    16print(f"Registered {model_name} v{result.version}")
    17
    18# Add description and tags
    19client.update_model_version(
    20    name=model_name,
    21    version=result.version,
    22    description="XGBoost fraud classifier trained on 2024-Q1 data"
    23)
    24client.set_model_version_tag(
    25    name=model_name, version=result.version,
    26    key="training_dataset", value="fraud_2024_q1"
    27)
    28client.set_model_version_tag(
    29    name=model_name, version=result.version,
    30    key="accuracy", value="0.956"
    31)
    32
    33# Promote through stages
    34client.transition_model_version_stage(
    35    name=model_name, version=result.version, stage="Staging"
    36)
    37print(f"Moved v{result.version} to Staging")
    38
    39# After validation...
    40client.transition_model_version_stage(
    41    name=model_name, version=result.version, stage="Production"
    42)
    43print(f"Promoted v{result.version} to Production!")
    44
    45# Load the production model for serving
    46prod_model = mlflow.pyfunc.load_model(f"models:/{model_name}/Production")
    47prediction = prod_model.predict(test_features)

    Vector Databases

    A vector database stores and indexes high-dimensional vectors (embeddings) for fast similarity search. They are essential for:

  • Semantic search (find documents by meaning, not keywords)
  • Retrieval-Augmented Generation (RAG for LLMs)
  • Recommendation systems
  • Image/audio similarity search
  • When and Why to Use a Vector Database

    Use CaseWhy Vectors?
    Semantic searchQuery "budget travel tips" matches "affordable vacation ideas"
    RAG (LLM context)Find relevant documents to include in an LLM prompt
    RecommendationsFind items with similar embeddings to user preferences
    DeduplicationFind near-duplicate documents or images
    Anomaly detectionFind data points far from any cluster

    Popular Vector Databases

    DatabaseTypeBest For
    PineconeManaged cloudProduction RAG, scale without ops
    ChromaOpen-source, lightweightPrototyping, small-medium scale
    pgvectorPostgreSQL extensionAlready using Postgres, moderate scale
    WeaviateOpen-source, full-featuredHybrid search (vector + keyword)
    QdrantOpen-source, high-performanceLarge-scale, filtering + search

    python
    1# --- Chroma: Lightweight vector database ---
    2import chromadb
    3
    4client = chromadb.Client()
    5
    6# Create a collection
    7collection = client.create_collection(
    8    name="documents",
    9    metadata={"description": "Government policy documents"}
    10)
    11
    12# Add documents (Chroma auto-embeds with a default model)
    13collection.add(
    14    documents=[
    15        "The housing assistance program provides subsidies for low-income families.",
    16        "Veterans are eligible for enhanced healthcare benefits.",
    17        "The SNAP program provides food assistance to qualifying households.",
    18        "Section 8 vouchers help families afford rental housing.",
    19        "Medicare covers hospital stays and medical services for seniors.",
    20    ],
    21    ids=["doc1", "doc2", "doc3", "doc4", "doc5"],
    22    metadatas=[
    23        {"department": "housing", "year": 2024},
    24        {"department": "veterans", "year": 2024},
    25        {"department": "agriculture", "year": 2024},
    26        {"department": "housing", "year": 2024},
    27        {"department": "health", "year": 2024},
    28    ]
    29)
    30
    31# Semantic search — finds relevant documents by meaning
    32results = collection.query(
    33    query_texts=["affordable housing for families"],
    34    n_results=3
    35)
    36
    37print("Query: 'affordable housing for families'")
    38for doc, dist in zip(results['documents'][0], results['distances'][0]):
    39    print(f"  [{dist:.3f}] {doc}")
    40
    41# Filter + search
    42results = collection.query(
    43    query_texts=["healthcare coverage"],
    44    n_results=2,
    45    where={"department": "health"}
    46)
    python
    1# --- pgvector: Vector search in PostgreSQL ---
    2# SQL to set up pgvector
    3"""
    4CREATE EXTENSION vector;
    5
    6CREATE TABLE documents (
    7    id SERIAL PRIMARY KEY,
    8    content TEXT,
    9    embedding vector(384),   -- 384-dim embeddings
    10    department TEXT,
    11    created_at TIMESTAMP DEFAULT NOW()
    12);
    13
    14-- Create an index for fast similarity search
    15CREATE INDEX ON documents
    16    USING ivfflat (embedding vector_cosine_ops)
    17    WITH (lists = 100);
    18
    19-- Insert a document with its embedding
    20INSERT INTO documents (content, embedding, department)
    21VALUES (
    22    'Housing assistance for low-income families',
    23    '[0.1, 0.2, 0.3, ...]'::vector,
    24    'housing'
    25);
    26
    27-- Similarity search (cosine distance)
    28SELECT content, 1 - (embedding <=> query_embedding) AS similarity
    29FROM documents
    30ORDER BY embedding <=> '[0.1, 0.2, ...]'::vector
    31LIMIT 5;
    32"""
    33
    34# Python with psycopg2 and pgvector
    35import psycopg2
    36from pgvector.psycopg2 import register_vector
    37
    38conn = psycopg2.connect("dbname=mydb")
    39register_vector(conn)
    40
    41cur = conn.cursor()
    42
    43# Search for similar documents
    44query_embedding = model.encode("affordable housing programs")
    45cur.execute(
    46    "SELECT content, 1 - (embedding <=> %s) AS similarity "
    47    "FROM documents ORDER BY embedding <=> %s LIMIT 5",
    48    (query_embedding, query_embedding)
    49)
    50
    51for content, similarity in cur.fetchall():
    52    print(f"[{similarity:.3f}] {content}")

    Metadata Stores

    A metadata store tracks the lineage and provenance of all artifacts in your ML system:

  • Which dataset was used to train which model?
  • Which features were computed from which raw data?
  • Which model version is currently serving?
  • Who approved the deployment?
  • ML Metadata (MLMD), used by TFX, is the most common open-source metadata store. It tracks:

  • Artifacts: Datasets, models, metrics, schemas
  • Executions: Training runs, evaluation runs, transformations
  • Contexts: Experiments, pipelines, projects
  • Infrastructure as Code for ML

    Treat ML infrastructure like software infrastructure — define it in code, version it, and automate it.

    python
    1# Terraform-style infrastructure as code for ML (pseudocode)
    2# In practice, this would be HCL (.tf files) or Pulumi (Python)
    3
    4ml_infrastructure = {
    5    "feature_store": {
    6        "provider": "feast",
    7        "offline_store": {"type": "bigquery", "project": "my-project"},
    8        "online_store": {"type": "redis", "host": "redis.internal:6379"},
    9        "registry": {"type": "gcs", "path": "gs://ml-registry/feast/"},
    10    },
    11    "model_registry": {
    12        "provider": "mlflow",
    13        "backend_store": "postgresql://mlflow:pass@db:5432/mlflow",
    14        "artifact_store": "gs://ml-artifacts/mlflow/",
    15    },
    16    "serving": {
    17        "provider": "kubernetes",
    18        "gpu_type": "nvidia-t4",
    19        "min_replicas": 2,
    20        "max_replicas": 20,
    21        "autoscaling": {
    22            "target_cpu": 70,
    23            "target_latency_ms": 200,
    24        },
    25    },
    26    "monitoring": {
    27        "prometheus": {"retention_days": 30},
    28        "grafana": {"dashboards": ["model-performance", "data-drift"]},
    29        "evidently": {"drift_check_schedule": "0 * * * *"},  # Hourly
    30    },
    31    "vector_database": {
    32        "provider": "pgvector",
    33        "host": "postgres.internal:5432",
    34        "index_type": "ivfflat",
    35        "dimensions": 384,
    36    },
    37}
    38
    39# In a real project, this would be:
    40# terraform apply
    41# or
    42# pulumi up

    Start Simple, Scale Later

    You don't need all of this infrastructure on day one. Start with: 1. **MLflow** for experiment tracking and model registry 2. **PostgreSQL + pgvector** for vector search (if you already have Postgres) 3. **Simple feature pipeline** in Python Add Feast, Kubernetes, and dedicated vector DBs when scale demands it.