Skip to main content

Vector Databases & Semantic Search

Embeddings, vector search concepts, ANN algorithms (HNSW, IVF), Pinecone/Chroma/Weaviate/pgvector comparison, metadata filtering

~45 min
Listen to this lesson

Vector Databases & Semantic Search

Vector databases are purpose-built storage systems that index and query high-dimensional vectors (embeddings). They power semantic search, recommendation systems, and most critically, Retrieval-Augmented Generation (RAG) pipelines for LLM applications.

Why Vector Databases?

Traditional databases search by exact keyword matches. Vector databases search by meaning. Instead of asking "which documents contain the word 'dog'?", you ask "which documents are about things similar to dogs?" --- and the database returns results about puppies, canines, pets, and veterinary care.

# Traditional keyword search
query = "How do I fix a flat tire?"

Only matches documents containing "fix", "flat", "tire"

Semantic vector search

query_embedding = embed("How do I fix a flat tire?")

Also matches: "Changing a punctured wheel", "Tire repair guide",

"Steps to replace a blown tyre" --- same meaning, different words

What Are Embeddings?

Embeddings are dense numerical vectors (arrays of floats) that represent the semantic meaning of text, images, or other data. Similar meanings produce vectors that are close together in the embedding space. A sentence like 'The cat sat on the mat' might become a 1536-dimensional vector where each dimension captures some aspect of meaning.

Generating Embeddings

The first step in any vector search pipeline is converting your data into embeddings using an embedding model.

OpenAI Embeddings

from openai import OpenAI

client = OpenAI()

def get_embedding(text: str, model: str = "text-embedding-3-small") -> list[float]: """Generate an embedding for a single text string.""" response = client.embeddings.create( input=text, model=model ) return response.data[0].embedding

Single embedding

embedding = get_embedding("Vector databases are essential for RAG") print(f"Dimensions: {len(embedding)}") # 1536 for text-embedding-3-small

Batch embeddings (more efficient)

texts = [ "Python is a programming language", "JavaScript runs in the browser", "Rust focuses on memory safety", ]

response = client.embeddings.create( input=texts, model="text-embedding-3-small" ) embeddings = [item.embedding for item in response.data] print(f"Generated {len(embeddings)} embeddings of dimension {len(embeddings[0])}")

Open-Source Embeddings with Sentence Transformers

from sentence_transformers import SentenceTransformer
import numpy as np

Load a pre-trained embedding model

model = SentenceTransformer("all-MiniLM-L6-v2") # 384 dimensions, fast

model = SentenceTransformer("all-mpnet-base-v2") # 768 dimensions, more accurate

Generate embeddings

sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium.", ]

embeddings = model.encode(sentences) print(f"Shape: {embeddings.shape}") # (3, 384)

Compute cosine similarity

from numpy.linalg import norm

def cosine_similarity(a, b): return np.dot(a, b) / (norm(a) * norm(b))

Similar sentences should have high similarity

sim_01 = cosine_similarity(embeddings[0], embeddings[1]) sim_02 = cosine_similarity(embeddings[0], embeddings[2]) print(f"'weather' vs 'sunny': {sim_01:.4f}") # ~0.80 (similar meaning) print(f"'weather' vs 'stadium': {sim_02:.4f}") # ~0.15 (different topics)

Popular Embedding Models

OpenAI text-embedding-3-small (1536d, fast, cheap) and text-embedding-3-large (3072d, most accurate) are popular commercial options. For open-source, consider all-MiniLM-L6-v2 (384d, very fast), BGE-large-en-v1.5 (1024d, top MTEB scores), and Nomic-embed-text-v1.5 (768d, long context). Check the MTEB leaderboard at huggingface.co/spaces/mteb/leaderboard for up-to-date rankings.

Vector Similarity Metrics

Vector databases use distance/similarity metrics to find the closest vectors to a query:

import numpy as np
from numpy.linalg import norm

a = np.array([1.0, 2.0, 3.0]) b = np.array([1.1, 2.1, 2.9])

1. Cosine Similarity --- most common for text embeddings

Measures angle between vectors (ignores magnitude)

cosine_sim = np.dot(a, b) / (norm(a) * norm(b)) print(f"Cosine Similarity: {cosine_sim:.4f}") # ~0.9998

2. Euclidean Distance (L2) --- good for normalized embeddings

euclidean_dist = norm(a - b) print(f"Euclidean Distance: {euclidean_dist:.4f}") # ~0.1732

3. Dot Product (Inner Product) --- fast, works when embeddings are normalized

dot_product = np.dot(a, b) print(f"Dot Product: {dot_product:.4f}") # ~14.00
MetricBest ForRangeNotes
Cosine SimilarityText embeddings[-1, 1]Invariant to vector magnitude
Euclidean (L2)Normalized embeddings[0, inf)Lower = more similar
Dot ProductPre-normalized vectors(-inf, inf)Fastest computation

Approximate Nearest Neighbor (ANN) Algorithms

Exact nearest neighbor search is O(n) --- checking every vector. With millions of vectors, this is too slow. ANN algorithms trade a small amount of accuracy for massive speedups.

HNSW (Hierarchical Navigable Small World)

HNSW is the most popular ANN algorithm. It builds a multi-layered graph where:

  • The top layer has very few nodes (coarse navigation)
  • Each lower layer adds more nodes (finer navigation)
  • Search starts at the top and "walks" down to find the nearest neighbors
  • Layer 3 (coarsest):   A -------- D
                          |          |
    Layer 2:              A --- C -- D --- F
                          |    |    |     |
    Layer 1:              A-B--C--D-E--F--G
                          | |  |  | |  |  |
    Layer 0 (finest):     A-B--C--D-E--F--G-H-I-J
    

    Pros: Very fast queries, good recall (often 95%+) Cons: High memory usage (stores the graph in RAM), slow index builds

    IVF (Inverted File Index)

    IVF partitions the vector space into clusters using k-means, then only searches relevant clusters:

    1. Indexing: Run k-means to create N centroids (cluster centers) 2. Querying: Find the closest centroids, then search only those clusters

    # Conceptual IVF (using FAISS)
    import faiss
    import numpy as np

    dimension = 384 num_vectors = 100000 num_clusters = 256 # Number of Voronoi cells

    Create IVF index

    quantizer = faiss.IndexFlatL2(dimension) index = faiss.IndexIVFFlat(quantizer, dimension, num_clusters)

    Train on data (learns cluster centroids)

    training_data = np.random.random((num_vectors, dimension)).astype("float32") index.train(training_data)

    Add vectors

    index.add(training_data)

    Search --- nprobe controls accuracy/speed tradeoff

    index.nprobe = 16 # Search 16 out of 256 clusters query = np.random.random((1, dimension)).astype("float32") distances, indices = index.search(query, k=10) print(f"Top 10 nearest neighbor indices: {indices[0]}")

    Pros: Lower memory than HNSW, fast with tuned nprobe Cons: Requires training step, lower recall at low nprobe values

    Choosing an ANN Algorithm

    Use HNSW when you need the best query latency and can afford higher memory. Use IVF when working with very large datasets where memory is a concern. For the best of both worlds, many production systems use IVF-HNSW (e.g., FAISS IndexIVFHNSW), which clusters data with IVF but uses HNSW within each cluster.

    Vector Database Comparison

    ChromaDB --- Best for Prototyping & Local Development

    import chromadb
    from chromadb.utils import embedding_functions

    Initialize client (in-memory or persistent)

    client = chromadb.Client() # in-memory

    client = chromadb.PersistentClient(path="./chroma_db") # persistent

    Use a sentence-transformer embedding function

    ef = embedding_functions.SentenceTransformerEmbeddingFunction( model_name="all-MiniLM-L6-v2" )

    Create a collection

    collection = client.create_collection( name="documents", embedding_function=ef, metadata={"hnsw:space": "cosine"} )

    Add documents (embeddings generated automatically)

    collection.add( documents=[ "Python is great for data science", "JavaScript powers the modern web", "Rust ensures memory safety without a garbage collector", "Go is designed for concurrent programming", ], metadatas=[ {"category": "data", "year": 2024}, {"category": "web", "year": 2024}, {"category": "systems", "year": 2024}, {"category": "systems", "year": 2024}, ], ids=["doc1", "doc2", "doc3", "doc4"], )

    Query with automatic embedding

    results = collection.query( query_texts=["Which language is best for machine learning?"], n_results=3, where={"category": "data"}, # Metadata filter )

    print(results["documents"][0]) # Retrieved documents print(results["distances"][0]) # Similarity scores print(results["metadatas"][0]) # Associated metadata

    Pinecone --- Best for Production / Managed Service

    from pinecone import Pinecone, ServerlessSpec

    pc = Pinecone(api_key="YOUR_API_KEY")

    Create an index

    pc.create_index( name="my-index", dimension=1536, metric="cosine", spec=ServerlessSpec(cloud="aws", region="us-east-1"), )

    index = pc.Index("my-index")

    Upsert vectors with metadata

    index.upsert(vectors=[ { "id": "doc1", "values": embedding_1, # 1536-dim list of floats "metadata": {"source": "wiki", "topic": "AI", "date": "2024-01-15"}, }, { "id": "doc2", "values": embedding_2, "metadata": {"source": "arxiv", "topic": "ML", "date": "2024-02-20"}, }, ])

    Query with metadata filtering

    results = index.query( vector=query_embedding, top_k=5, include_metadata=True, filter={ "topic": {"$eq": "AI"}, "date": {"$gte": "2024-01-01"}, }, )

    for match in results["matches"]: print(f"ID: {match['id']}, Score: {match['score']:.4f}") print(f"Metadata: {match['metadata']}")

    Weaviate --- Best for Hybrid Search (Vector + Keyword)

    import weaviate
    from weaviate.classes.config import Configure, Property, DataType
    from weaviate.classes.query import MetadataQuery

    client = weaviate.connect_to_local() # or connect_to_weaviate_cloud()

    Create a collection with vectorizer

    collection = client.collections.create( name="Document", vectorizer_config=Configure.Vectorizer.text2vec_openai(), properties=[ Property(name="content", data_type=DataType.TEXT), Property(name="source", data_type=DataType.TEXT), ], )

    Add objects (auto-vectorized)

    collection.data.insert({"content": "RAG improves LLM accuracy", "source": "paper"})

    Hybrid search (combines vector + BM25 keyword search)

    results = collection.query.hybrid( query="retrieval augmented generation", alpha=0.75, # 0=pure keyword, 1=pure vector limit=5, return_metadata=MetadataQuery(score=True), )

    for obj in results.objects: print(f"Score: {obj.metadata.score:.4f}") print(f"Content: {obj.properties['content']}")

    pgvector --- Best for Existing PostgreSQL Users

    -- Enable the extension
    CREATE EXTENSION vector;

    -- Create a table with a vector column CREATE TABLE documents ( id SERIAL PRIMARY KEY, content TEXT, metadata JSONB, embedding vector(1536) -- 1536-dimensional vector );

    -- Create an HNSW index for fast ANN search CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 200);

    -- Insert a document INSERT INTO documents (content, metadata, embedding) VALUES ('RAG pipelines improve LLM accuracy', '{"source": "paper", "year": 2024}', '[0.1, 0.2, ...]'); -- 1536 floats

    -- Semantic search with metadata filtering SELECT content, metadata, 1 - (embedding <=> '[0.15, 0.22, ...]') AS similarity FROM documents WHERE metadata->>'year' = '2024' ORDER BY embedding <=> '[0.15, 0.22, ...]' LIMIT 5;

    Vector Database Comparison Summary

    Chroma: Open-source, easy setup, great for prototyping, limited scalability. Pinecone: Fully managed, scales to billions, pay-per-use, no self-hosting option. Weaviate: Open-source, built-in hybrid search, GraphQL API, good for production. pgvector: PostgreSQL extension, no new infrastructure needed, ACID transactions, great if you already use Postgres. FAISS: Not a database (in-memory library by Meta), fastest raw performance, no metadata filtering built in.

    Metadata Filtering

    Metadata filtering lets you narrow vector search results using structured attributes. This is critical for production systems where you need to enforce access control, recency, or category constraints.

    # ChromaDB metadata filtering examples
    collection.query(
        query_texts=["machine learning"],
        n_results=5,
        where={
            "$and": [
                {"category": {"$eq": "research"}},
                {"year": {"$gte": 2023}},
                {"access_level": {"$in": ["public", "internal"]}},
            ]
        },
    )

    Pinecone metadata filtering

    index.query( vector=query_embedding, top_k=10, filter={ "$and": [ {"department": {"$eq": "engineering"}}, {"date": {"$gte": "2024-01-01"}}, {"status": {"$ne": "archived"}}, ] }, )

    Pre-filtering vs Post-filtering

  • Pre-filtering: Filter metadata BEFORE vector search (faster, but can reduce recall)
  • Post-filtering: Run vector search first, then filter results (better recall, but wastes compute)
  • Most modern vector databases use pre-filtering with optimized indices for best results