Vector Databases & Semantic Search
Vector databases are purpose-built storage systems that index and query high-dimensional vectors (embeddings). They power semantic search, recommendation systems, and most critically, Retrieval-Augmented Generation (RAG) pipelines for LLM applications.
Why Vector Databases?
Traditional databases search by exact keyword matches. Vector databases search by meaning. Instead of asking "which documents contain the word 'dog'?", you ask "which documents are about things similar to dogs?" --- and the database returns results about puppies, canines, pets, and veterinary care.
# Traditional keyword search
query = "How do I fix a flat tire?"
Only matches documents containing "fix", "flat", "tire"
Semantic vector search
query_embedding = embed("How do I fix a flat tire?")
Also matches: "Changing a punctured wheel", "Tire repair guide",
"Steps to replace a blown tyre" --- same meaning, different words
What Are Embeddings?
Generating Embeddings
The first step in any vector search pipeline is converting your data into embeddings using an embedding model.
OpenAI Embeddings
from openai import OpenAIclient = OpenAI()
def get_embedding(text: str, model: str = "text-embedding-3-small") -> list[float]:
"""Generate an embedding for a single text string."""
response = client.embeddings.create(
input=text,
model=model
)
return response.data[0].embedding
Single embedding
embedding = get_embedding("Vector databases are essential for RAG")
print(f"Dimensions: {len(embedding)}") # 1536 for text-embedding-3-smallBatch embeddings (more efficient)
texts = [
"Python is a programming language",
"JavaScript runs in the browser",
"Rust focuses on memory safety",
]response = client.embeddings.create(
input=texts,
model="text-embedding-3-small"
)
embeddings = [item.embedding for item in response.data]
print(f"Generated {len(embeddings)} embeddings of dimension {len(embeddings[0])}")
Open-Source Embeddings with Sentence Transformers
from sentence_transformers import SentenceTransformer
import numpy as npLoad a pre-trained embedding model
model = SentenceTransformer("all-MiniLM-L6-v2") # 384 dimensions, fast
model = SentenceTransformer("all-mpnet-base-v2") # 768 dimensions, more accurate
Generate embeddings
sentences = [
"The weather is lovely today.",
"It's so sunny outside!",
"He drove to the stadium.",
]embeddings = model.encode(sentences)
print(f"Shape: {embeddings.shape}") # (3, 384)
Compute cosine similarity
from numpy.linalg import normdef cosine_similarity(a, b):
return np.dot(a, b) / (norm(a) * norm(b))
Similar sentences should have high similarity
sim_01 = cosine_similarity(embeddings[0], embeddings[1])
sim_02 = cosine_similarity(embeddings[0], embeddings[2])
print(f"'weather' vs 'sunny': {sim_01:.4f}") # ~0.80 (similar meaning)
print(f"'weather' vs 'stadium': {sim_02:.4f}") # ~0.15 (different topics)
Popular Embedding Models
Vector Similarity Metrics
Vector databases use distance/similarity metrics to find the closest vectors to a query:
import numpy as np
from numpy.linalg import norma = np.array([1.0, 2.0, 3.0])
b = np.array([1.1, 2.1, 2.9])
1. Cosine Similarity --- most common for text embeddings
Measures angle between vectors (ignores magnitude)
cosine_sim = np.dot(a, b) / (norm(a) * norm(b))
print(f"Cosine Similarity: {cosine_sim:.4f}") # ~0.99982. Euclidean Distance (L2) --- good for normalized embeddings
euclidean_dist = norm(a - b)
print(f"Euclidean Distance: {euclidean_dist:.4f}") # ~0.17323. Dot Product (Inner Product) --- fast, works when embeddings are normalized
dot_product = np.dot(a, b)
print(f"Dot Product: {dot_product:.4f}") # ~14.00
| Metric | Best For | Range | Notes |
|---|---|---|---|
| Cosine Similarity | Text embeddings | [-1, 1] | Invariant to vector magnitude |
| Euclidean (L2) | Normalized embeddings | [0, inf) | Lower = more similar |
| Dot Product | Pre-normalized vectors | (-inf, inf) | Fastest computation |
Approximate Nearest Neighbor (ANN) Algorithms
Exact nearest neighbor search is O(n) --- checking every vector. With millions of vectors, this is too slow. ANN algorithms trade a small amount of accuracy for massive speedups.
HNSW (Hierarchical Navigable Small World)
HNSW is the most popular ANN algorithm. It builds a multi-layered graph where:
Layer 3 (coarsest): A -------- D
| |
Layer 2: A --- C -- D --- F
| | | |
Layer 1: A-B--C--D-E--F--G
| | | | | | |
Layer 0 (finest): A-B--C--D-E--F--G-H-I-J
Pros: Very fast queries, good recall (often 95%+) Cons: High memory usage (stores the graph in RAM), slow index builds
IVF (Inverted File Index)
IVF partitions the vector space into clusters using k-means, then only searches relevant clusters:
1. Indexing: Run k-means to create N centroids (cluster centers) 2. Querying: Find the closest centroids, then search only those clusters
# Conceptual IVF (using FAISS)
import faiss
import numpy as npdimension = 384
num_vectors = 100000
num_clusters = 256 # Number of Voronoi cells
Create IVF index
quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFFlat(quantizer, dimension, num_clusters)Train on data (learns cluster centroids)
training_data = np.random.random((num_vectors, dimension)).astype("float32")
index.train(training_data)Add vectors
index.add(training_data)Search --- nprobe controls accuracy/speed tradeoff
index.nprobe = 16 # Search 16 out of 256 clusters
query = np.random.random((1, dimension)).astype("float32")
distances, indices = index.search(query, k=10)
print(f"Top 10 nearest neighbor indices: {indices[0]}")
Pros: Lower memory than HNSW, fast with tuned nprobe Cons: Requires training step, lower recall at low nprobe values
Choosing an ANN Algorithm
Vector Database Comparison
ChromaDB --- Best for Prototyping & Local Development
import chromadb
from chromadb.utils import embedding_functionsInitialize client (in-memory or persistent)
client = chromadb.Client() # in-memory
client = chromadb.PersistentClient(path="./chroma_db") # persistent
Use a sentence-transformer embedding function
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-MiniLM-L6-v2"
)Create a collection
collection = client.create_collection(
name="documents",
embedding_function=ef,
metadata={"hnsw:space": "cosine"}
)Add documents (embeddings generated automatically)
collection.add(
documents=[
"Python is great for data science",
"JavaScript powers the modern web",
"Rust ensures memory safety without a garbage collector",
"Go is designed for concurrent programming",
],
metadatas=[
{"category": "data", "year": 2024},
{"category": "web", "year": 2024},
{"category": "systems", "year": 2024},
{"category": "systems", "year": 2024},
],
ids=["doc1", "doc2", "doc3", "doc4"],
)Query with automatic embedding
results = collection.query(
query_texts=["Which language is best for machine learning?"],
n_results=3,
where={"category": "data"}, # Metadata filter
)print(results["documents"][0]) # Retrieved documents
print(results["distances"][0]) # Similarity scores
print(results["metadatas"][0]) # Associated metadata
Pinecone --- Best for Production / Managed Service
from pinecone import Pinecone, ServerlessSpecpc = Pinecone(api_key="YOUR_API_KEY")
Create an index
pc.create_index(
name="my-index",
dimension=1536,
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1"),
)index = pc.Index("my-index")
Upsert vectors with metadata
index.upsert(vectors=[
{
"id": "doc1",
"values": embedding_1, # 1536-dim list of floats
"metadata": {"source": "wiki", "topic": "AI", "date": "2024-01-15"},
},
{
"id": "doc2",
"values": embedding_2,
"metadata": {"source": "arxiv", "topic": "ML", "date": "2024-02-20"},
},
])Query with metadata filtering
results = index.query(
vector=query_embedding,
top_k=5,
include_metadata=True,
filter={
"topic": {"$eq": "AI"},
"date": {"$gte": "2024-01-01"},
},
)for match in results["matches"]:
print(f"ID: {match['id']}, Score: {match['score']:.4f}")
print(f"Metadata: {match['metadata']}")
Weaviate --- Best for Hybrid Search (Vector + Keyword)
import weaviate
from weaviate.classes.config import Configure, Property, DataType
from weaviate.classes.query import MetadataQueryclient = weaviate.connect_to_local() # or connect_to_weaviate_cloud()
Create a collection with vectorizer
collection = client.collections.create(
name="Document",
vectorizer_config=Configure.Vectorizer.text2vec_openai(),
properties=[
Property(name="content", data_type=DataType.TEXT),
Property(name="source", data_type=DataType.TEXT),
],
)Add objects (auto-vectorized)
collection.data.insert({"content": "RAG improves LLM accuracy", "source": "paper"})Hybrid search (combines vector + BM25 keyword search)
results = collection.query.hybrid(
query="retrieval augmented generation",
alpha=0.75, # 0=pure keyword, 1=pure vector
limit=5,
return_metadata=MetadataQuery(score=True),
)for obj in results.objects:
print(f"Score: {obj.metadata.score:.4f}")
print(f"Content: {obj.properties['content']}")
pgvector --- Best for Existing PostgreSQL Users
-- Enable the extension
CREATE EXTENSION vector;-- Create a table with a vector column
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT,
metadata JSONB,
embedding vector(1536) -- 1536-dimensional vector
);
-- Create an HNSW index for fast ANN search
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);
-- Insert a document
INSERT INTO documents (content, metadata, embedding)
VALUES ('RAG pipelines improve LLM accuracy',
'{"source": "paper", "year": 2024}',
'[0.1, 0.2, ...]'); -- 1536 floats
-- Semantic search with metadata filtering
SELECT content, metadata,
1 - (embedding <=> '[0.15, 0.22, ...]') AS similarity
FROM documents
WHERE metadata->>'year' = '2024'
ORDER BY embedding <=> '[0.15, 0.22, ...]'
LIMIT 5;
Vector Database Comparison Summary
Metadata Filtering
Metadata filtering lets you narrow vector search results using structured attributes. This is critical for production systems where you need to enforce access control, recency, or category constraints.
# ChromaDB metadata filtering examples
collection.query(
query_texts=["machine learning"],
n_results=5,
where={
"$and": [
{"category": {"$eq": "research"}},
{"year": {"$gte": 2023}},
{"access_level": {"$in": ["public", "internal"]}},
]
},
)Pinecone metadata filtering
index.query(
vector=query_embedding,
top_k=10,
filter={
"$and": [
{"department": {"$eq": "engineering"}},
{"date": {"$gte": "2024-01-01"}},
{"status": {"$ne": "archived"}},
]
},
)