Vector Databases & Semantic Search

Vector databases are purpose-built storage systems that index and query high-dimensional vectors (embeddings). They power semantic search, recommendation systems, and most critically, Retrieval-Augmented Generation (RAG) pipelines for LLM applications.

Why Vector Databases?

Traditional databases search by exact keyword matches. Vector databases search by meaning. Instead of asking "which documents contain the word 'dog'?", you ask "which documents are about things similar to dogs?" --- and the database returns results about puppies, canines, pets, and veterinary care.

# Traditional keyword search
query = "How do I fix a flat tire?"
Only matches documents containing "fix", "flat", "tire"
Semantic vector search
query_embedding = embed("How do I fix a flat tire?")
Also matches: "Changing a punctured wheel", "Tire repair guide",
"Steps to replace a blown tyre" --- same meaning, different words

What Are Embeddings?

Embeddings are dense numerical vectors (arrays of floats) that represent the semantic meaning of text, images, or other data. Similar meanings produce vectors that are close together in the embedding space. A sentence like 'The cat sat on the mat' might become a 1536-dimensional vector where each dimension captures some aspect of meaning.

Generating Embeddings

The first step in any vector search pipeline is converting your data into embeddings using an embedding model.

OpenAI Embeddings

from openai import OpenAI
client = OpenAI()
def get_embedding(text: str, model: str = "text-embedding-3-small") -> list[float]:
    """Generate an embedding for a single text string."""
    response = client.embeddings.create(
        input=text,
        model=model
    )
    return response.data[0].embedding
Single embedding
embedding = get_embedding("Vector databases are essential for RAG")
print(f"Dimensions: {len(embedding)}")  # 1536 for text-embedding-3-small
Batch embeddings (more efficient)
texts = [
    "Python is a programming language",
    "JavaScript runs in the browser",
    "Rust focuses on memory safety",
]response = client.embeddings.create(
    input=texts,
    model="text-embedding-3-small"
)
embeddings = [item.embedding for item in response.data]
print(f"Generated {len(embeddings)} embeddings of dimension {len(embeddings[0])}")

Open-Source Embeddings with Sentence Transformers

from sentence_transformers import SentenceTransformer
import numpy as np
Load a pre-trained embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")  # 384 dimensions, fast
model = SentenceTransformer("all-mpnet-base-v2")  # 768 dimensions, more accurate
Generate embeddings
sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]
embeddings = model.encode(sentences)
print(f"Shape: {embeddings.shape}")  # (3, 384)
Compute cosine similarity
from numpy.linalg import norm
def cosine_similarity(a, b):
    return np.dot(a, b) / (norm(a) * norm(b))
Similar sentences should have high similarity
sim_01 = cosine_similarity(embeddings[0], embeddings[1])
sim_02 = cosine_similarity(embeddings[0], embeddings[2])
print(f"'weather' vs 'sunny': {sim_01:.4f}")   # ~0.80 (similar meaning)
print(f"'weather' vs 'stadium': {sim_02:.4f}") # ~0.15 (different topics)

Popular Embedding Models

OpenAI text-embedding-3-small (1536d, fast, cheap) and text-embedding-3-large (3072d, most accurate) are popular commercial options. For open-source, consider all-MiniLM-L6-v2 (384d, very fast), BGE-large-en-v1.5 (1024d, top MTEB scores), and Nomic-embed-text-v1.5 (768d, long context). Check the MTEB leaderboard at huggingface.co/spaces/mteb/leaderboard for up-to-date rankings.

Vector Similarity Metrics

Vector databases use distance/similarity metrics to find the closest vectors to a query:

import numpy as np
from numpy.linalg import norm
a = np.array([1.0, 2.0, 3.0])
b = np.array([1.1, 2.1, 2.9])
1. Cosine Similarity --- most common for text embeddings
Measures angle between vectors (ignores magnitude)
cosine_sim = np.dot(a, b) / (norm(a) * norm(b))
print(f"Cosine Similarity: {cosine_sim:.4f}")  # ~0.9998
2. Euclidean Distance (L2) --- good for normalized embeddings
euclidean_dist = norm(a - b)
print(f"Euclidean Distance: {euclidean_dist:.4f}")  # ~0.1732
3. Dot Product (Inner Product) --- fast, works when embeddings are normalized
dot_product = np.dot(a, b)
print(f"Dot Product: {dot_product:.4f}")  # ~14.00

Metric	Best For	Range	Notes
Cosine Similarity	Text embeddings	[-1, 1]	Invariant to vector magnitude
Euclidean (L2)	Normalized embeddings	[0, inf)	Lower = more similar
Dot Product	Pre-normalized vectors	(-inf, inf)	Fastest computation

Approximate Nearest Neighbor (ANN) Algorithms

Exact nearest neighbor search is O(n) --- checking every vector. With millions of vectors, this is too slow. ANN algorithms trade a small amount of accuracy for massive speedups.

HNSW (Hierarchical Navigable Small World)

HNSW is the most popular ANN algorithm. It builds a multi-layered graph where:

The top layer has very few nodes (coarse navigation)

Each lower layer adds more nodes (finer navigation)

Search starts at the top and "walks" down to find the nearest neighbors

Layer 3 (coarsest):   A -------- D
                      |          |
Layer 2:              A --- C -- D --- F
                      |    |    |     |
Layer 1:              A-B--C--D-E--F--G
                      | |  |  | |  |  |
Layer 0 (finest):     A-B--C--D-E--F--G-H-I-J

Pros: Very fast queries, good recall (often 95%+) Cons: High memory usage (stores the graph in RAM), slow index builds

IVF (Inverted File Index)

IVF partitions the vector space into clusters using k-means, then only searches relevant clusters:

1. Indexing: Run k-means to create N centroids (cluster centers) 2. Querying: Find the closest centroids, then search only those clusters

# Conceptual IVF (using FAISS)
import faiss
import numpy as np
dimension = 384
num_vectors = 100000
num_clusters = 256  # Number of Voronoi cells
Create IVF index
quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFFlat(quantizer, dimension, num_clusters)
Train on data (learns cluster centroids)
training_data = np.random.random((num_vectors, dimension)).astype("float32")
index.train(training_data)
Add vectors
index.add(training_data)
Search --- nprobe controls accuracy/speed tradeoff
index.nprobe = 16  # Search 16 out of 256 clusters
query = np.random.random((1, dimension)).astype("float32")
distances, indices = index.search(query, k=10)
print(f"Top 10 nearest neighbor indices: {indices[0]}")

Pros: Lower memory than HNSW, fast with tuned nprobe Cons: Requires training step, lower recall at low nprobe values

Choosing an ANN Algorithm

Use HNSW when you need the best query latency and can afford higher memory. Use IVF when working with very large datasets where memory is a concern. For the best of both worlds, many production systems use IVF-HNSW (e.g., FAISS IndexIVFHNSW), which clusters data with IVF but uses HNSW within each cluster.

Vector Database Comparison

ChromaDB --- Best for Prototyping & Local Development

import chromadb
from chromadb.utils import embedding_functions
Initialize client (in-memory or persistent)
client = chromadb.Client()  # in-memory
client = chromadb.PersistentClient(path="./chroma_db")  # persistent
Use a sentence-transformer embedding function
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)
Create a collection
collection = client.create_collection(
    name="documents",
    embedding_function=ef,
    metadata={"hnsw:space": "cosine"}
)
Add documents (embeddings generated automatically)
collection.add(
    documents=[
        "Python is great for data science",
        "JavaScript powers the modern web",
        "Rust ensures memory safety without a garbage collector",
        "Go is designed for concurrent programming",
    ],
    metadatas=[
        {"category": "data", "year": 2024},
        {"category": "web", "year": 2024},
        {"category": "systems", "year": 2024},
        {"category": "systems", "year": 2024},
    ],
    ids=["doc1", "doc2", "doc3", "doc4"],
)
Query with automatic embedding
results = collection.query(
    query_texts=["Which language is best for machine learning?"],
    n_results=3,
    where={"category": "data"},  # Metadata filter
)print(results["documents"][0])    # Retrieved documents
print(results["distances"][0])    # Similarity scores
print(results["metadatas"][0])    # Associated metadata

Pinecone --- Best for Production / Managed Service

from pinecone import Pinecone, ServerlessSpec
pc = Pinecone(api_key="YOUR_API_KEY")
Create an index
pc.create_index(
    name="my-index",
    dimension=1536,
    metric="cosine",
    spec=ServerlessSpec(cloud="aws", region="us-east-1"),
)
index = pc.Index("my-index")
Upsert vectors with metadata
index.upsert(vectors=[
    {
        "id": "doc1",
        "values": embedding_1,  # 1536-dim list of floats
        "metadata": {"source": "wiki", "topic": "AI", "date": "2024-01-15"},
    },
    {
        "id": "doc2",
        "values": embedding_2,
        "metadata": {"source": "arxiv", "topic": "ML", "date": "2024-02-20"},
    },
])
Query with metadata filtering
results = index.query(
    vector=query_embedding,
    top_k=5,
    include_metadata=True,
    filter={
        "topic": {"$eq": "AI"},
        "date": {"$gte": "2024-01-01"},
    },
)for match in results["matches"]:
    print(f"ID: {match['id']}, Score: {match['score']:.4f}")
    print(f"Metadata: {match['metadata']}")

Weaviate --- Best for Hybrid Search (Vector + Keyword)

import weaviate
from weaviate.classes.config import Configure, Property, DataType
from weaviate.classes.query import MetadataQuery
client = weaviate.connect_to_local()  # or connect_to_weaviate_cloud()
Create a collection with vectorizer
collection = client.collections.create(
    name="Document",
    vectorizer_config=Configure.Vectorizer.text2vec_openai(),
    properties=[
        Property(name="content", data_type=DataType.TEXT),
        Property(name="source", data_type=DataType.TEXT),
    ],
)
Add objects (auto-vectorized)
collection.data.insert({"content": "RAG improves LLM accuracy", "source": "paper"})
Hybrid search (combines vector + BM25 keyword search)
results = collection.query.hybrid(
    query="retrieval augmented generation",
    alpha=0.75,  # 0=pure keyword, 1=pure vector
    limit=5,
    return_metadata=MetadataQuery(score=True),
)for obj in results.objects:
    print(f"Score: {obj.metadata.score:.4f}")
    print(f"Content: {obj.properties['content']}")

pgvector --- Best for Existing PostgreSQL Users

-- Enable the extension
CREATE EXTENSION vector;
-- Create a table with a vector column
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    metadata JSONB,
    embedding vector(1536)  -- 1536-dimensional vector
);
-- Create an HNSW index for fast ANN search
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);
-- Insert a document
INSERT INTO documents (content, metadata, embedding)
VALUES ('RAG pipelines improve LLM accuracy',
        '{"source": "paper", "year": 2024}',
        '[0.1, 0.2, ...]');  -- 1536 floats-- Semantic search with metadata filtering
SELECT content, metadata,
       1 - (embedding <=> '[0.15, 0.22, ...]') AS similarity
FROM documents
WHERE metadata->>'year' = '2024'
ORDER BY embedding <=> '[0.15, 0.22, ...]'
LIMIT 5;

Vector Database Comparison Summary

Chroma: Open-source, easy setup, great for prototyping, limited scalability. Pinecone: Fully managed, scales to billions, pay-per-use, no self-hosting option. Weaviate: Open-source, built-in hybrid search, GraphQL API, good for production. pgvector: PostgreSQL extension, no new infrastructure needed, ACID transactions, great if you already use Postgres. FAISS: Not a database (in-memory library by Meta), fastest raw performance, no metadata filtering built in.

Metadata Filtering

Metadata filtering lets you narrow vector search results using structured attributes. This is critical for production systems where you need to enforce access control, recency, or category constraints.

# ChromaDB metadata filtering examples
collection.query(
    query_texts=["machine learning"],
    n_results=5,
    where={
        "$and": [
            {"category": {"$eq": "research"}},
            {"year": {"$gte": 2023}},
            {"access_level": {"$in": ["public", "internal"]}},
        ]
    },
)
Pinecone metadata filtering
index.query(
    vector=query_embedding,
    top_k=10,
    filter={
        "$and": [
            {"department": {"$eq": "engineering"}},
            {"date": {"$gte": "2024-01-01"}},
            {"status": {"$ne": "archived"}},
        ]
    },
)

Pre-filtering vs Post-filtering

Pre-filtering: Filter metadata BEFORE vector search (faster, but can reduce recall)

Post-filtering: Run vector search first, then filter results (better recall, but wastes compute)

Most modern vector databases use pre-filtering with optimized indices for best results