Skip to main content

Content-Based Recommendations

Feature engineering for items, TF-IDF profiles, hybrid systems, and knowledge-based approaches

~40 min
Listen to this lesson

Content-Based Recommendations

Content-based filtering recommends items similar to what a user has liked before, based on item features. Unlike CF, it does not need other users' data -- it builds a profile of each user's preferences from the items they have interacted with.

How It Works

1. Represent items as feature vectors (genre, keywords, actors, description, etc.) 2. Build a user profile by aggregating features of items the user liked 3. Recommend items whose feature vectors are most similar to the user profile

Advantages Over CF

Content-BasedCollaborative Filtering
No cold start for new items (features are known)Cold start for new items
Can explain why (feature overlap)Hard to explain
No need for other users' dataNeeds community data
Limited to known featuresDiscovers unexpected items
Cannot leverage community trendsSerendipitous recommendations

Feature Representations

For structured data (e.g., movies): genre, year, director, cast, runtime, language -- one-hot or multi-hot encoding.

For text data (e.g., product descriptions, articles): TF-IDF, word embeddings, or transformer embeddings.

python
1import numpy as np
2from collections import Counter
3
4# Movie catalog with features
5movies = {
6    0: {"title": "The Matrix", "genres": ["Action", "SciFi"], "year": 1999,
7        "description": "A hacker discovers reality is a simulation"},
8    1: {"title": "Inception", "genres": ["Action", "SciFi", "Thriller"], "year": 2010,
9        "description": "A thief enters dreams to plant ideas"},
10    2: {"title": "The Notebook", "genres": ["Romance", "Drama"], "year": 2004,
11        "description": "A love story spanning decades"},
12    3: {"title": "Interstellar", "genres": ["SciFi", "Drama"], "year": 2014,
13        "description": "Astronauts travel through a wormhole to save humanity"},
14    4: {"title": "Titanic", "genres": ["Romance", "Drama"], "year": 1997,
15        "description": "A love story aboard a doomed ship"},
16    5: {"title": "Blade Runner", "genres": ["SciFi", "Thriller"], "year": 1982,
17        "description": "A detective hunts replicants in a dystopian future"},
18}
19
20# Build multi-hot genre encoding
21all_genres = sorted(set(g for m in movies.values() for g in m["genres"]))
22print(f"Genres: {all_genres}")
23
24def encode_genres(movie, genre_list):
25    return np.array([1 if g in movie["genres"] else 0 for g in genre_list], dtype=float)
26
27# Encode all movies
28item_features = np.array([encode_genres(movies[i], all_genres) for i in range(len(movies))])
29print(f"\nItem feature matrix shape: {item_features.shape}")
30for i, m in movies.items():
31    print(f"  {m['title']}: {item_features[i].astype(int)}")

TF-IDF for Text-Based Profiles

When items have text descriptions, TF-IDF (Term Frequency-Inverse Document Frequency) creates feature vectors that capture the importance of words:

  • TF(t, d): How often term t appears in document d (normalized by document length)
  • IDF(t): How rare term t is across all documents: log(N / df_t)
  • **TF-IDF(t, d) = TF(t, d) * IDF(t)
  • Words that are frequent in a document but rare overall get high TF-IDF scores, making them good discriminators.

    Building a User Profile

    Given items a user has liked, the user profile is typically the weighted average** of those item vectors:

    user_profile = sum(w_i * item_vector_i) / sum(w_i)
    

    Weights can be: equal, recency-based, or based on explicit ratings.

    Scoring and Ranking

    To score a candidate item for a user, compute cosine similarity between the user profile vector and the item vector. Rank all unrated items by this score.

    python
    1import numpy as np
    2from collections import Counter
    3import re
    4
    5def simple_tokenize(text):
    6    """Lowercase and split into words, removing punctuation."""
    7    return re.findall(r'\b[a-z]+\b', text.lower())
    8
    9def compute_tfidf(documents):
    10    """
    11    Compute TF-IDF matrix from a list of text documents.
    12    Returns: tfidf_matrix (n_docs x n_terms), vocabulary list
    13    """
    14    # Tokenize
    15    tokenized = [simple_tokenize(doc) for doc in documents]
    16
    17    # Build vocabulary
    18    vocab = sorted(set(word for doc in tokenized for word in doc))
    19    word_to_idx = {w: i for i, w in enumerate(vocab)}
    20    n_docs = len(documents)
    21    n_terms = len(vocab)
    22
    23    # Term frequency
    24    tf = np.zeros((n_docs, n_terms))
    25    for d, tokens in enumerate(tokenized):
    26        counts = Counter(tokens)
    27        total = len(tokens)
    28        for word, count in counts.items():
    29            tf[d, word_to_idx[word]] = count / total
    30
    31    # Document frequency
    32    df = np.zeros(n_terms)
    33    for d, tokens in enumerate(tokenized):
    34        unique_tokens = set(tokens)
    35        for word in unique_tokens:
    36            df[word_to_idx[word]] += 1
    37
    38    # IDF
    39    idf = np.log(n_docs / (df + 1)) + 1  # Smoothed IDF
    40
    41    # TF-IDF
    42    tfidf = tf * idf
    43    # L2 normalize each document vector
    44    norms = np.linalg.norm(tfidf, axis=1, keepdims=True)
    45    norms[norms == 0] = 1
    46    tfidf = tfidf / norms
    47
    48    return tfidf, vocab
    49
    50
    51# Movie descriptions
    52descriptions = [
    53    "A hacker discovers reality is a simulation controlled by machines",
    54    "A thief enters dreams to plant ideas in the subconscious mind",
    55    "A love story spanning decades of devotion and memory",
    56    "Astronauts travel through a wormhole to save humanity from extinction",
    57    "A love story aboard a doomed ship crossing the Atlantic",
    58    "A detective hunts replicants in a dystopian cyberpunk future",
    59]
    60
    61tfidf_matrix, vocab = compute_tfidf(descriptions)
    62print(f"TF-IDF shape: {tfidf_matrix.shape} ({len(descriptions)} movies x {len(vocab)} terms)")
    63
    64# Show top terms per movie
    65movie_titles = ["Matrix", "Inception", "Notebook", "Interstellar", "Titanic", "Blade Runner"]
    66for i, title in enumerate(movie_titles):
    67    top_idx = np.argsort(-tfidf_matrix[i])[:5]
    68    top_terms = [(vocab[j], tfidf_matrix[i, j]) for j in top_idx if tfidf_matrix[i, j] > 0]
    69    print(f"  {title}: {[(t, round(s, 3)) for t, s in top_terms]}")
    python
    1import numpy as np
    2
    3def content_based_recommend(user_ratings, item_features, top_n=3):
    4    """
    5    Content-based recommendation using item feature vectors.
    6
    7    Args:
    8        user_ratings: dict of {item_id: rating}
    9        item_features: (n_items, n_features) matrix
    10        top_n: number of recommendations
    11
    12    Returns:
    13        List of (item_id, score) tuples
    14    """
    15    n_items = item_features.shape[0]
    16
    17    # Build user profile: weighted average of rated item features
    18    profile = np.zeros(item_features.shape[1])
    19    total_weight = 0
    20    for item_id, rating in user_ratings.items():
    21        profile += rating * item_features[item_id]
    22        total_weight += rating
    23
    24    if total_weight > 0:
    25        profile /= total_weight
    26
    27    # Score all unrated items by cosine similarity with user profile
    28    scores = []
    29    for i in range(n_items):
    30        if i in user_ratings:
    31            continue
    32        # Cosine similarity
    33        dot = np.dot(profile, item_features[i])
    34        norm = np.linalg.norm(profile) * np.linalg.norm(item_features[i])
    35        sim = dot / norm if norm > 0 else 0
    36        scores.append((i, sim))
    37
    38    scores.sort(key=lambda x: -x[1])
    39    return scores[:top_n]
    40
    41
    42# Genre-based features (from previous example)
    43all_genres = ["Action", "Drama", "Romance", "SciFi", "Thriller"]
    44item_features = np.array([
    45    [1, 0, 0, 1, 0],  # Matrix: Action, SciFi
    46    [1, 0, 0, 1, 1],  # Inception: Action, SciFi, Thriller
    47    [0, 1, 1, 0, 0],  # Notebook: Drama, Romance
    48    [0, 1, 0, 1, 0],  # Interstellar: SciFi, Drama
    49    [0, 1, 1, 0, 0],  # Titanic: Drama, Romance
    50    [0, 0, 0, 1, 1],  # Blade Runner: SciFi, Thriller
    51], dtype=float)
    52
    53movie_titles = ["Matrix", "Inception", "Notebook", "Interstellar", "Titanic", "Blade Runner"]
    54
    55# User who liked SciFi/Action
    56user_a = {0: 5, 1: 4}  # Liked Matrix (5) and Inception (4)
    57recs = content_based_recommend(user_a, item_features, top_n=4)
    58print("User A (likes SciFi/Action):")
    59for item_id, score in recs:
    60    print(f"  {movie_titles[item_id]}: score={score:.3f}")
    61
    62# User who liked Romance/Drama
    63user_b = {2: 5, 4: 4}  # Liked Notebook (5) and Titanic (4)
    64recs = content_based_recommend(user_b, item_features, top_n=4)
    65print("\nUser B (likes Romance/Drama):")
    66for item_id, score in recs:
    67    print(f"  {movie_titles[item_id]}: score={score:.3f}")

    Hybrid Systems

    In practice, the best recommender systems combine multiple approaches. Hybrid methods address the limitations of any single technique.

    Common Hybrid Strategies

    1. Weighted Hybrid: Combine scores from CF and content-based with a weight parameter: score = alpha * CF_score + (1 - alpha) * content_score

    2. Switching Hybrid: Use content-based for new users/items (cold start), switch to CF once enough data is available

    3. Feature Augmentation: Use content-based features as input to a CF model (e.g., add genre embeddings to matrix factorization)

    4. Cascade: First use one method to generate candidates, then re-rank with another

    5. Meta-Level: The output of one model becomes the input features for another

    Knowledge-Based Recommendations

    For domains where preferences are explicit and well-understood (e.g., buying a house, choosing a laptop):

  • Constraint-based: User specifies constraints (price < $500, RAM >= 16GB), system returns matching items
  • Case-based: User provides an example item, system finds similar items and allows iterative refinement
  • Session-Based Recommendations

    When user identity is unavailable (anonymous browsing):

  • Recommend based on the current session (items viewed/clicked in this visit)
  • Models: Markov chains, recurrent neural networks (GRU4Rec), or attention-based models
  • Key challenge: limited context (only a few interactions per session)
  • The Filter Bubble Problem

    Content-based systems tend to recommend items very similar to what the user already likes, creating a 'filter bubble' where the user never discovers new types of content. To combat this: (1) inject diversity by including items from underrepresented categories, (2) use exploration strategies like epsilon-greedy or Thompson sampling, (3) add a novelty bonus to recommendation scores, (4) combine with CF which naturally surfaces serendipitous items from similar users' diverse tastes.

    Modern Embeddings vs TF-IDF

    While TF-IDF is a solid baseline, modern systems increasingly use **learned embeddings** from pretrained transformers (e.g., Sentence-BERT) to represent item text. These embeddings capture semantic similarity ('astronaut' is close to 'space') that TF-IDF misses. For images, use CNN embeddings (ResNet, CLIP). The rest of the content-based pipeline stays the same -- compute cosine similarity between user profile and item embeddings.