Content-Based Recommendations

Content-based filtering recommends items similar to what a user has liked before, based on item features. Unlike CF, it does not need other users' data -- it builds a profile of each user's preferences from the items they have interacted with.

How It Works

1. Represent items as feature vectors (genre, keywords, actors, description, etc.) 2. Build a user profile by aggregating features of items the user liked 3. Recommend items whose feature vectors are most similar to the user profile

Advantages Over CF

Content-Based	Collaborative Filtering
No cold start for new items (features are known)	Cold start for new items
Can explain why (feature overlap)	Hard to explain
No need for other users' data	Needs community data
Limited to known features	Discovers unexpected items
Cannot leverage community trends	Serendipitous recommendations

Feature Representations

For structured data (e.g., movies): genre, year, director, cast, runtime, language -- one-hot or multi-hot encoding.

For text data (e.g., product descriptions, articles): TF-IDF, word embeddings, or transformer embeddings.

python

1import numpy as np
2from collections import Counter
3
4# Movie catalog with features
5movies = {
6    0: {"title": "The Matrix", "genres": ["Action", "SciFi"], "year": 1999,
7        "description": "A hacker discovers reality is a simulation"},
8    1: {"title": "Inception", "genres": ["Action", "SciFi", "Thriller"], "year": 2010,
9        "description": "A thief enters dreams to plant ideas"},
10    2: {"title": "The Notebook", "genres": ["Romance", "Drama"], "year": 2004,
11        "description": "A love story spanning decades"},
12    3: {"title": "Interstellar", "genres": ["SciFi", "Drama"], "year": 2014,
13        "description": "Astronauts travel through a wormhole to save humanity"},
14    4: {"title": "Titanic", "genres": ["Romance", "Drama"], "year": 1997,
15        "description": "A love story aboard a doomed ship"},
16    5: {"title": "Blade Runner", "genres": ["SciFi", "Thriller"], "year": 1982,
17        "description": "A detective hunts replicants in a dystopian future"},
18}
19
20# Build multi-hot genre encoding
21all_genres = sorted(set(g for m in movies.values() for g in m["genres"]))
22print(f"Genres: {all_genres}")
23
24def encode_genres(movie, genre_list):
25    return np.array([1 if g in movie["genres"] else 0 for g in genre_list], dtype=float)
26
27# Encode all movies
28item_features = np.array([encode_genres(movies[i], all_genres) for i in range(len(movies))])
29print(f"\nItem feature matrix shape: {item_features.shape}")
30for i, m in movies.items():
31    print(f"  {m['title']}: {item_features[i].astype(int)}")

TF-IDF for Text-Based Profiles

When items have text descriptions, TF-IDF (Term Frequency-Inverse Document Frequency) creates feature vectors that capture the importance of words:

TF(t, d): How often term t appears in document d (normalized by document length)

IDF(t): How rare term t is across all documents: log(N / df_t)

**TF-IDF(t, d) = TF(t, d) * IDF(t)

Words that are frequent in a document but rare overall get high TF-IDF scores, making them good discriminators.

Building a User Profile

Given items a user has liked, the user profile is typically the weighted average** of those item vectors:

user_profile = sum(w_i * item_vector_i) / sum(w_i)

Weights can be: equal, recency-based, or based on explicit ratings.

Scoring and Ranking

To score a candidate item for a user, compute cosine similarity between the user profile vector and the item vector. Rank all unrated items by this score.

python

1import numpy as np
2from collections import Counter
3import re
4
5def simple_tokenize(text):
6    """Lowercase and split into words, removing punctuation."""
7    return re.findall(r'\b[a-z]+\b', text.lower())
8
9def compute_tfidf(documents):
10    """
11    Compute TF-IDF matrix from a list of text documents.
12    Returns: tfidf_matrix (n_docs x n_terms), vocabulary list
13    """
14    # Tokenize
15    tokenized = [simple_tokenize(doc) for doc in documents]
16
17    # Build vocabulary
18    vocab = sorted(set(word for doc in tokenized for word in doc))
19    word_to_idx = {w: i for i, w in enumerate(vocab)}
20    n_docs = len(documents)
21    n_terms = len(vocab)
22
23    # Term frequency
24    tf = np.zeros((n_docs, n_terms))
25    for d, tokens in enumerate(tokenized):
26        counts = Counter(tokens)
27        total = len(tokens)
28        for word, count in counts.items():
29            tf[d, word_to_idx[word]] = count / total
30
31    # Document frequency
32    df = np.zeros(n_terms)
33    for d, tokens in enumerate(tokenized):
34        unique_tokens = set(tokens)
35        for word in unique_tokens:
36            df[word_to_idx[word]] += 1
37
38    # IDF
39    idf = np.log(n_docs / (df + 1)) + 1  # Smoothed IDF
40
41    # TF-IDF
42    tfidf = tf * idf
43    # L2 normalize each document vector
44    norms = np.linalg.norm(tfidf, axis=1, keepdims=True)
45    norms[norms == 0] = 1
46    tfidf = tfidf / norms
47
48    return tfidf, vocab
49
50
51# Movie descriptions
52descriptions = [
53    "A hacker discovers reality is a simulation controlled by machines",
54    "A thief enters dreams to plant ideas in the subconscious mind",
55    "A love story spanning decades of devotion and memory",
56    "Astronauts travel through a wormhole to save humanity from extinction",
57    "A love story aboard a doomed ship crossing the Atlantic",
58    "A detective hunts replicants in a dystopian cyberpunk future",
59]
60
61tfidf_matrix, vocab = compute_tfidf(descriptions)
62print(f"TF-IDF shape: {tfidf_matrix.shape} ({len(descriptions)} movies x {len(vocab)} terms)")
63
64# Show top terms per movie
65movie_titles = ["Matrix", "Inception", "Notebook", "Interstellar", "Titanic", "Blade Runner"]
66for i, title in enumerate(movie_titles):
67    top_idx = np.argsort(-tfidf_matrix[i])[:5]
68    top_terms = [(vocab[j], tfidf_matrix[i, j]) for j in top_idx if tfidf_matrix[i, j] > 0]
69    print(f"  {title}: {[(t, round(s, 3)) for t, s in top_terms]}")

python

1import numpy as np
2
3def content_based_recommend(user_ratings, item_features, top_n=3):
4    """
5    Content-based recommendation using item feature vectors.
6
7    Args:
8        user_ratings: dict of {item_id: rating}
9        item_features: (n_items, n_features) matrix
10        top_n: number of recommendations
11
12    Returns:
13        List of (item_id, score) tuples
14    """
15    n_items = item_features.shape[0]
16
17    # Build user profile: weighted average of rated item features
18    profile = np.zeros(item_features.shape[1])
19    total_weight = 0
20    for item_id, rating in user_ratings.items():
21        profile += rating * item_features[item_id]
22        total_weight += rating
23
24    if total_weight > 0:
25        profile /= total_weight
26
27    # Score all unrated items by cosine similarity with user profile
28    scores = []
29    for i in range(n_items):
30        if i in user_ratings:
31            continue
32        # Cosine similarity
33        dot = np.dot(profile, item_features[i])
34        norm = np.linalg.norm(profile) * np.linalg.norm(item_features[i])
35        sim = dot / norm if norm > 0 else 0
36        scores.append((i, sim))
37
38    scores.sort(key=lambda x: -x[1])
39    return scores[:top_n]
40
41
42# Genre-based features (from previous example)
43all_genres = ["Action", "Drama", "Romance", "SciFi", "Thriller"]
44item_features = np.array([
45    [1, 0, 0, 1, 0],  # Matrix: Action, SciFi
46    [1, 0, 0, 1, 1],  # Inception: Action, SciFi, Thriller
47    [0, 1, 1, 0, 0],  # Notebook: Drama, Romance
48    [0, 1, 0, 1, 0],  # Interstellar: SciFi, Drama
49    [0, 1, 1, 0, 0],  # Titanic: Drama, Romance
50    [0, 0, 0, 1, 1],  # Blade Runner: SciFi, Thriller
51], dtype=float)
52
53movie_titles = ["Matrix", "Inception", "Notebook", "Interstellar", "Titanic", "Blade Runner"]
54
55# User who liked SciFi/Action
56user_a = {0: 5, 1: 4}  # Liked Matrix (5) and Inception (4)
57recs = content_based_recommend(user_a, item_features, top_n=4)
58print("User A (likes SciFi/Action):")
59for item_id, score in recs:
60    print(f"  {movie_titles[item_id]}: score={score:.3f}")
61
62# User who liked Romance/Drama
63user_b = {2: 5, 4: 4}  # Liked Notebook (5) and Titanic (4)
64recs = content_based_recommend(user_b, item_features, top_n=4)
65print("\nUser B (likes Romance/Drama):")
66for item_id, score in recs:
67    print(f"  {movie_titles[item_id]}: score={score:.3f}")

Hybrid Systems

In practice, the best recommender systems combine multiple approaches. Hybrid methods address the limitations of any single technique.

Common Hybrid Strategies

1. Weighted Hybrid: Combine scores from CF and content-based with a weight parameter: score = alpha * CF_score + (1 - alpha) * content_score

2. Switching Hybrid: Use content-based for new users/items (cold start), switch to CF once enough data is available

3. Feature Augmentation: Use content-based features as input to a CF model (e.g., add genre embeddings to matrix factorization)

4. Cascade: First use one method to generate candidates, then re-rank with another

5. Meta-Level: The output of one model becomes the input features for another

Knowledge-Based Recommendations

For domains where preferences are explicit and well-understood (e.g., buying a house, choosing a laptop):

Constraint-based: User specifies constraints (price < $500, RAM >= 16GB), system returns matching items

Case-based: User provides an example item, system finds similar items and allows iterative refinement

Session-Based Recommendations

When user identity is unavailable (anonymous browsing):

Recommend based on the current session (items viewed/clicked in this visit)

Models: Markov chains, recurrent neural networks (GRU4Rec), or attention-based models

Key challenge: limited context (only a few interactions per session)

The Filter Bubble Problem

Content-based systems tend to recommend items very similar to what the user already likes, creating a 'filter bubble' where the user never discovers new types of content. To combat this: (1) inject diversity by including items from underrepresented categories, (2) use exploration strategies like epsilon-greedy or Thompson sampling, (3) add a novelty bonus to recommendation scores, (4) combine with CF which naturally surfaces serendipitous items from similar users' diverse tastes.

Modern Embeddings vs TF-IDF

While TF-IDF is a solid baseline, modern systems increasingly use **learned embeddings** from pretrained transformers (e.g., Sentence-BERT) to represent item text. These embeddings capture semantic similarity ('astronaut' is close to 'space') that TF-IDF misses. For images, use CNN embeddings (ResNet, CLIP). The rest of the content-based pipeline stays the same -- compute cosine similarity between user profile and item embeddings.

Content-Based Recommendations

How It Works

Advantages Over CF

Content-Based	Collaborative Filtering
No cold start for new items (features are known)	Cold start for new items
Can explain why (feature overlap)	Hard to explain
No need for other users' data	Needs community data
Limited to known features	Discovers unexpected items
Cannot leverage community trends	Serendipitous recommendations

Feature Representations

For structured data (e.g., movies): genre, year, director, cast, runtime, language -- one-hot or multi-hot encoding.

For text data (e.g., product descriptions, articles): TF-IDF, word embeddings, or transformer embeddings.

python

1import numpy as np
2from collections import Counter
3
4# Movie catalog with features
5movies = {
6    0: {"title": "The Matrix", "genres": ["Action", "SciFi"], "year": 1999,
7        "description": "A hacker discovers reality is a simulation"},
8    1: {"title": "Inception", "genres": ["Action", "SciFi", "Thriller"], "year": 2010,
9        "description": "A thief enters dreams to plant ideas"},
10    2: {"title": "The Notebook", "genres": ["Romance", "Drama"], "year": 2004,
11        "description": "A love story spanning decades"},
12    3: {"title": "Interstellar", "genres": ["SciFi", "Drama"], "year": 2014,
13        "description": "Astronauts travel through a wormhole to save humanity"},
14    4: {"title": "Titanic", "genres": ["Romance", "Drama"], "year": 1997,
15        "description": "A love story aboard a doomed ship"},
16    5: {"title": "Blade Runner", "genres": ["SciFi", "Thriller"], "year": 1982,
17        "description": "A detective hunts replicants in a dystopian future"},
18}
19
20# Build multi-hot genre encoding
21all_genres = sorted(set(g for m in movies.values() for g in m["genres"]))
22print(f"Genres: {all_genres}")
23
24def encode_genres(movie, genre_list):
25    return np.array([1 if g in movie["genres"] else 0 for g in genre_list], dtype=float)
26
27# Encode all movies
28item_features = np.array([encode_genres(movies[i], all_genres) for i in range(len(movies))])
29print(f"\nItem feature matrix shape: {item_features.shape}")
30for i, m in movies.items():
31    print(f"  {m['title']}: {item_features[i].astype(int)}")

TF-IDF for Text-Based Profiles

When items have text descriptions, TF-IDF (Term Frequency-Inverse Document Frequency) creates feature vectors that capture the importance of words:

TF(t, d): How often term t appears in document d (normalized by document length)

IDF(t): How rare term t is across all documents: log(N / df_t)

**TF-IDF(t, d) = TF(t, d) * IDF(t)

Words that are frequent in a document but rare overall get high TF-IDF scores, making them good discriminators.

Building a User Profile

Given items a user has liked, the user profile is typically the weighted average** of those item vectors:

user_profile = sum(w_i * item_vector_i) / sum(w_i)

Weights can be: equal, recency-based, or based on explicit ratings.

Scoring and Ranking

To score a candidate item for a user, compute cosine similarity between the user profile vector and the item vector. Rank all unrated items by this score.

python

1import numpy as np
2from collections import Counter
3import re
4
5def simple_tokenize(text):
6    """Lowercase and split into words, removing punctuation."""
7    return re.findall(r'\b[a-z]+\b', text.lower())
8
9def compute_tfidf(documents):
10    """
11    Compute TF-IDF matrix from a list of text documents.
12    Returns: tfidf_matrix (n_docs x n_terms), vocabulary list
13    """
14    # Tokenize
15    tokenized = [simple_tokenize(doc) for doc in documents]
16
17    # Build vocabulary
18    vocab = sorted(set(word for doc in tokenized for word in doc))
19    word_to_idx = {w: i for i, w in enumerate(vocab)}
20    n_docs = len(documents)
21    n_terms = len(vocab)
22
23    # Term frequency
24    tf = np.zeros((n_docs, n_terms))
25    for d, tokens in enumerate(tokenized):
26        counts = Counter(tokens)
27        total = len(tokens)
28        for word, count in counts.items():
29            tf[d, word_to_idx[word]] = count / total
30
31    # Document frequency
32    df = np.zeros(n_terms)
33    for d, tokens in enumerate(tokenized):
34        unique_tokens = set(tokens)
35        for word in unique_tokens:
36            df[word_to_idx[word]] += 1
37
38    # IDF
39    idf = np.log(n_docs / (df + 1)) + 1  # Smoothed IDF
40
41    # TF-IDF
42    tfidf = tf * idf
43    # L2 normalize each document vector
44    norms = np.linalg.norm(tfidf, axis=1, keepdims=True)
45    norms[norms == 0] = 1
46    tfidf = tfidf / norms
47
48    return tfidf, vocab
49
50
51# Movie descriptions
52descriptions = [
53    "A hacker discovers reality is a simulation controlled by machines",
54    "A thief enters dreams to plant ideas in the subconscious mind",
55    "A love story spanning decades of devotion and memory",
56    "Astronauts travel through a wormhole to save humanity from extinction",
57    "A love story aboard a doomed ship crossing the Atlantic",
58    "A detective hunts replicants in a dystopian cyberpunk future",
59]
60
61tfidf_matrix, vocab = compute_tfidf(descriptions)
62print(f"TF-IDF shape: {tfidf_matrix.shape} ({len(descriptions)} movies x {len(vocab)} terms)")
63
64# Show top terms per movie
65movie_titles = ["Matrix", "Inception", "Notebook", "Interstellar", "Titanic", "Blade Runner"]
66for i, title in enumerate(movie_titles):
67    top_idx = np.argsort(-tfidf_matrix[i])[:5]
68    top_terms = [(vocab[j], tfidf_matrix[i, j]) for j in top_idx if tfidf_matrix[i, j] > 0]
69    print(f"  {title}: {[(t, round(s, 3)) for t, s in top_terms]}")

python

1import numpy as np
2
3def content_based_recommend(user_ratings, item_features, top_n=3):
4    """
5    Content-based recommendation using item feature vectors.
6
7    Args:
8        user_ratings: dict of {item_id: rating}
9        item_features: (n_items, n_features) matrix
10        top_n: number of recommendations
11
12    Returns:
13        List of (item_id, score) tuples
14    """
15    n_items = item_features.shape[0]
16
17    # Build user profile: weighted average of rated item features
18    profile = np.zeros(item_features.shape[1])
19    total_weight = 0
20    for item_id, rating in user_ratings.items():
21        profile += rating * item_features[item_id]
22        total_weight += rating
23
24    if total_weight > 0:
25        profile /= total_weight
26
27    # Score all unrated items by cosine similarity with user profile
28    scores = []
29    for i in range(n_items):
30        if i in user_ratings:
31            continue
32        # Cosine similarity
33        dot = np.dot(profile, item_features[i])
34        norm = np.linalg.norm(profile) * np.linalg.norm(item_features[i])
35        sim = dot / norm if norm > 0 else 0
36        scores.append((i, sim))
37
38    scores.sort(key=lambda x: -x[1])
39    return scores[:top_n]
40
41
42# Genre-based features (from previous example)
43all_genres = ["Action", "Drama", "Romance", "SciFi", "Thriller"]
44item_features = np.array([
45    [1, 0, 0, 1, 0],  # Matrix: Action, SciFi
46    [1, 0, 0, 1, 1],  # Inception: Action, SciFi, Thriller
47    [0, 1, 1, 0, 0],  # Notebook: Drama, Romance
48    [0, 1, 0, 1, 0],  # Interstellar: SciFi, Drama
49    [0, 1, 1, 0, 0],  # Titanic: Drama, Romance
50    [0, 0, 0, 1, 1],  # Blade Runner: SciFi, Thriller
51], dtype=float)
52
53movie_titles = ["Matrix", "Inception", "Notebook", "Interstellar", "Titanic", "Blade Runner"]
54
55# User who liked SciFi/Action
56user_a = {0: 5, 1: 4}  # Liked Matrix (5) and Inception (4)
57recs = content_based_recommend(user_a, item_features, top_n=4)
58print("User A (likes SciFi/Action):")
59for item_id, score in recs:
60    print(f"  {movie_titles[item_id]}: score={score:.3f}")
61
62# User who liked Romance/Drama
63user_b = {2: 5, 4: 4}  # Liked Notebook (5) and Titanic (4)
64recs = content_based_recommend(user_b, item_features, top_n=4)
65print("\nUser B (likes Romance/Drama):")
66for item_id, score in recs:
67    print(f"  {movie_titles[item_id]}: score={score:.3f}")

Hybrid Systems

In practice, the best recommender systems combine multiple approaches. Hybrid methods address the limitations of any single technique.

Common Hybrid Strategies

1. Weighted Hybrid: Combine scores from CF and content-based with a weight parameter: score = alpha * CF_score + (1 - alpha) * content_score

2. Switching Hybrid: Use content-based for new users/items (cold start), switch to CF once enough data is available

3. Feature Augmentation: Use content-based features as input to a CF model (e.g., add genre embeddings to matrix factorization)

4. Cascade: First use one method to generate candidates, then re-rank with another

5. Meta-Level: The output of one model becomes the input features for another

Knowledge-Based Recommendations

For domains where preferences are explicit and well-understood (e.g., buying a house, choosing a laptop):

Constraint-based: User specifies constraints (price < $500, RAM >= 16GB), system returns matching items

Case-based: User provides an example item, system finds similar items and allows iterative refinement

Session-Based Recommendations

When user identity is unavailable (anonymous browsing):

Recommend based on the current session (items viewed/clicked in this visit)

Models: Markov chains, recurrent neural networks (GRU4Rec), or attention-based models

Key challenge: limited context (only a few interactions per session)