Content-Based Recommendations
Content-based filtering recommends items similar to what a user has liked before, based on item features. Unlike CF, it does not need other users' data -- it builds a profile of each user's preferences from the items they have interacted with.
How It Works
1. Represent items as feature vectors (genre, keywords, actors, description, etc.) 2. Build a user profile by aggregating features of items the user liked 3. Recommend items whose feature vectors are most similar to the user profile
Advantages Over CF
| Content-Based | Collaborative Filtering |
|---|---|
| No cold start for new items (features are known) | Cold start for new items |
| Can explain why (feature overlap) | Hard to explain |
| No need for other users' data | Needs community data |
| Limited to known features | Discovers unexpected items |
| Cannot leverage community trends | Serendipitous recommendations |
Feature Representations
For structured data (e.g., movies): genre, year, director, cast, runtime, language -- one-hot or multi-hot encoding.
For text data (e.g., product descriptions, articles): TF-IDF, word embeddings, or transformer embeddings.
1import numpy as np
2from collections import Counter
3
4# Movie catalog with features
5movies = {
6 0: {"title": "The Matrix", "genres": ["Action", "SciFi"], "year": 1999,
7 "description": "A hacker discovers reality is a simulation"},
8 1: {"title": "Inception", "genres": ["Action", "SciFi", "Thriller"], "year": 2010,
9 "description": "A thief enters dreams to plant ideas"},
10 2: {"title": "The Notebook", "genres": ["Romance", "Drama"], "year": 2004,
11 "description": "A love story spanning decades"},
12 3: {"title": "Interstellar", "genres": ["SciFi", "Drama"], "year": 2014,
13 "description": "Astronauts travel through a wormhole to save humanity"},
14 4: {"title": "Titanic", "genres": ["Romance", "Drama"], "year": 1997,
15 "description": "A love story aboard a doomed ship"},
16 5: {"title": "Blade Runner", "genres": ["SciFi", "Thriller"], "year": 1982,
17 "description": "A detective hunts replicants in a dystopian future"},
18}
19
20# Build multi-hot genre encoding
21all_genres = sorted(set(g for m in movies.values() for g in m["genres"]))
22print(f"Genres: {all_genres}")
23
24def encode_genres(movie, genre_list):
25 return np.array([1 if g in movie["genres"] else 0 for g in genre_list], dtype=float)
26
27# Encode all movies
28item_features = np.array([encode_genres(movies[i], all_genres) for i in range(len(movies))])
29print(f"\nItem feature matrix shape: {item_features.shape}")
30for i, m in movies.items():
31 print(f" {m['title']}: {item_features[i].astype(int)}")TF-IDF for Text-Based Profiles
When items have text descriptions, TF-IDF (Term Frequency-Inverse Document Frequency) creates feature vectors that capture the importance of words:
log(N / df_t)Words that are frequent in a document but rare overall get high TF-IDF scores, making them good discriminators.
Building a User Profile
Given items a user has liked, the user profile is typically the weighted average** of those item vectors:
user_profile = sum(w_i * item_vector_i) / sum(w_i)
Weights can be: equal, recency-based, or based on explicit ratings.
Scoring and Ranking
To score a candidate item for a user, compute cosine similarity between the user profile vector and the item vector. Rank all unrated items by this score.
1import numpy as np
2from collections import Counter
3import re
4
5def simple_tokenize(text):
6 """Lowercase and split into words, removing punctuation."""
7 return re.findall(r'\b[a-z]+\b', text.lower())
8
9def compute_tfidf(documents):
10 """
11 Compute TF-IDF matrix from a list of text documents.
12 Returns: tfidf_matrix (n_docs x n_terms), vocabulary list
13 """
14 # Tokenize
15 tokenized = [simple_tokenize(doc) for doc in documents]
16
17 # Build vocabulary
18 vocab = sorted(set(word for doc in tokenized for word in doc))
19 word_to_idx = {w: i for i, w in enumerate(vocab)}
20 n_docs = len(documents)
21 n_terms = len(vocab)
22
23 # Term frequency
24 tf = np.zeros((n_docs, n_terms))
25 for d, tokens in enumerate(tokenized):
26 counts = Counter(tokens)
27 total = len(tokens)
28 for word, count in counts.items():
29 tf[d, word_to_idx[word]] = count / total
30
31 # Document frequency
32 df = np.zeros(n_terms)
33 for d, tokens in enumerate(tokenized):
34 unique_tokens = set(tokens)
35 for word in unique_tokens:
36 df[word_to_idx[word]] += 1
37
38 # IDF
39 idf = np.log(n_docs / (df + 1)) + 1 # Smoothed IDF
40
41 # TF-IDF
42 tfidf = tf * idf
43 # L2 normalize each document vector
44 norms = np.linalg.norm(tfidf, axis=1, keepdims=True)
45 norms[norms == 0] = 1
46 tfidf = tfidf / norms
47
48 return tfidf, vocab
49
50
51# Movie descriptions
52descriptions = [
53 "A hacker discovers reality is a simulation controlled by machines",
54 "A thief enters dreams to plant ideas in the subconscious mind",
55 "A love story spanning decades of devotion and memory",
56 "Astronauts travel through a wormhole to save humanity from extinction",
57 "A love story aboard a doomed ship crossing the Atlantic",
58 "A detective hunts replicants in a dystopian cyberpunk future",
59]
60
61tfidf_matrix, vocab = compute_tfidf(descriptions)
62print(f"TF-IDF shape: {tfidf_matrix.shape} ({len(descriptions)} movies x {len(vocab)} terms)")
63
64# Show top terms per movie
65movie_titles = ["Matrix", "Inception", "Notebook", "Interstellar", "Titanic", "Blade Runner"]
66for i, title in enumerate(movie_titles):
67 top_idx = np.argsort(-tfidf_matrix[i])[:5]
68 top_terms = [(vocab[j], tfidf_matrix[i, j]) for j in top_idx if tfidf_matrix[i, j] > 0]
69 print(f" {title}: {[(t, round(s, 3)) for t, s in top_terms]}")1import numpy as np
2
3def content_based_recommend(user_ratings, item_features, top_n=3):
4 """
5 Content-based recommendation using item feature vectors.
6
7 Args:
8 user_ratings: dict of {item_id: rating}
9 item_features: (n_items, n_features) matrix
10 top_n: number of recommendations
11
12 Returns:
13 List of (item_id, score) tuples
14 """
15 n_items = item_features.shape[0]
16
17 # Build user profile: weighted average of rated item features
18 profile = np.zeros(item_features.shape[1])
19 total_weight = 0
20 for item_id, rating in user_ratings.items():
21 profile += rating * item_features[item_id]
22 total_weight += rating
23
24 if total_weight > 0:
25 profile /= total_weight
26
27 # Score all unrated items by cosine similarity with user profile
28 scores = []
29 for i in range(n_items):
30 if i in user_ratings:
31 continue
32 # Cosine similarity
33 dot = np.dot(profile, item_features[i])
34 norm = np.linalg.norm(profile) * np.linalg.norm(item_features[i])
35 sim = dot / norm if norm > 0 else 0
36 scores.append((i, sim))
37
38 scores.sort(key=lambda x: -x[1])
39 return scores[:top_n]
40
41
42# Genre-based features (from previous example)
43all_genres = ["Action", "Drama", "Romance", "SciFi", "Thriller"]
44item_features = np.array([
45 [1, 0, 0, 1, 0], # Matrix: Action, SciFi
46 [1, 0, 0, 1, 1], # Inception: Action, SciFi, Thriller
47 [0, 1, 1, 0, 0], # Notebook: Drama, Romance
48 [0, 1, 0, 1, 0], # Interstellar: SciFi, Drama
49 [0, 1, 1, 0, 0], # Titanic: Drama, Romance
50 [0, 0, 0, 1, 1], # Blade Runner: SciFi, Thriller
51], dtype=float)
52
53movie_titles = ["Matrix", "Inception", "Notebook", "Interstellar", "Titanic", "Blade Runner"]
54
55# User who liked SciFi/Action
56user_a = {0: 5, 1: 4} # Liked Matrix (5) and Inception (4)
57recs = content_based_recommend(user_a, item_features, top_n=4)
58print("User A (likes SciFi/Action):")
59for item_id, score in recs:
60 print(f" {movie_titles[item_id]}: score={score:.3f}")
61
62# User who liked Romance/Drama
63user_b = {2: 5, 4: 4} # Liked Notebook (5) and Titanic (4)
64recs = content_based_recommend(user_b, item_features, top_n=4)
65print("\nUser B (likes Romance/Drama):")
66for item_id, score in recs:
67 print(f" {movie_titles[item_id]}: score={score:.3f}")Hybrid Systems
In practice, the best recommender systems combine multiple approaches. Hybrid methods address the limitations of any single technique.
Common Hybrid Strategies
1. Weighted Hybrid: Combine scores from CF and content-based with a weight parameter:
score = alpha * CF_score + (1 - alpha) * content_score
2. Switching Hybrid: Use content-based for new users/items (cold start), switch to CF once enough data is available
3. Feature Augmentation: Use content-based features as input to a CF model (e.g., add genre embeddings to matrix factorization)
4. Cascade: First use one method to generate candidates, then re-rank with another
5. Meta-Level: The output of one model becomes the input features for another
Knowledge-Based Recommendations
For domains where preferences are explicit and well-understood (e.g., buying a house, choosing a laptop):
Session-Based Recommendations
When user identity is unavailable (anonymous browsing):