Skip to main content

Collaborative Filtering

User-based and item-based CF, similarity metrics, matrix factorization, and handling cold start

~45 min
Listen to this lesson

Collaborative Filtering

Collaborative filtering (CF) is the most widely used recommendation technique. It predicts a user's interests by collecting preferences from many users -- the core assumption is that if users agreed in the past, they will agree in the future.

Two Flavors of CF

1. User-Based CF: Find users similar to the target user, then recommend items those similar users liked 2. Item-Based CF: Find items similar to items the target user liked, then recommend those similar items

Explicit vs Implicit Feedback

Explicit FeedbackImplicit Feedback
Star ratings, thumbs up/down, reviewsClicks, views, purchases, time spent
Clear signal of preferenceMust infer preference from behavior
Sparse (users rarely rate)Dense (every interaction is data)
May contain bias (rating inflation)No negative signal (absence != dislike)
Most real-world systems rely heavily on implicit feedback because it is far more abundant.

python
1import numpy as np
2
3# Example user-item rating matrix (0 = unrated)
4# Rows = users, Columns = items (movies)
5ratings = np.array([
6    [5, 3, 0, 1, 4],  # User 0
7    [4, 0, 0, 1, 2],  # User 1
8    [1, 1, 0, 5, 4],  # User 2
9    [0, 0, 5, 4, 0],  # User 3
10    [0, 3, 4, 0, 3],  # User 4
11])
12
13movie_names = ["Action_1", "Comedy_1", "Drama_1", "SciFi_1", "Comedy_2"]
14n_users, n_items = ratings.shape
15
16print(f"Rating matrix: {n_users} users x {n_items} items")
17print(f"Total possible ratings: {n_users * n_items}")
18print(f"Observed ratings: {np.count_nonzero(ratings)}")
19print(f"Sparsity: {1 - np.count_nonzero(ratings) / (n_users * n_items):.1%}")

Similarity Metrics

The heart of collaborative filtering is measuring how similar two users (or items) are. The two most common metrics:

Cosine Similarity

Measures the cosine of the angle between two vectors. Ranges from -1 to 1 (or 0 to 1 for non-negative ratings):

sim(a, b) = (a . b) / (||a|| * ||b||)

  • Ignores magnitude differences (a user who rates everything 5 vs one who rates 3 can still be similar)
  • Works well with sparse data
  • Pearson Correlation

    Centers the vectors by subtracting each user's mean rating before computing cosine similarity:

    sim(a, b) = sum((a_i - mean_a)(b_i - mean_b)) / (std_a * std_b)
    

  • Accounts for different rating scales (generous vs strict raters)
  • More robust for explicit ratings
  • Adjusted Cosine (for Item-Based)

    Subtracts the user mean from each rating before computing item similarity. This accounts for the fact that different users have different rating baselines.

    python
    1import numpy as np
    2
    3def cosine_similarity(a, b):
    4    """Compute cosine similarity between two vectors, considering only co-rated items."""
    5    # Only consider items both users have rated
    6    mask = (a > 0) & (b > 0)
    7    if mask.sum() == 0:
    8        return 0.0
    9    a_filtered = a[mask]
    10    b_filtered = b[mask]
    11    dot = np.dot(a_filtered, b_filtered)
    12    norm = np.linalg.norm(a_filtered) * np.linalg.norm(b_filtered)
    13    return dot / norm if norm > 0 else 0.0
    14
    15
    16def pearson_correlation(a, b):
    17    """Compute Pearson correlation between two users on co-rated items."""
    18    mask = (a > 0) & (b > 0)
    19    if mask.sum() < 2:  # Need at least 2 co-rated items
    20        return 0.0
    21    a_filtered = a[mask]
    22    b_filtered = b[mask]
    23    a_centered = a_filtered - a_filtered.mean()
    24    b_centered = b_filtered - b_filtered.mean()
    25    num = np.dot(a_centered, b_centered)
    26    den = np.linalg.norm(a_centered) * np.linalg.norm(b_centered)
    27    return num / den if den > 0 else 0.0
    28
    29
    30# Compute user-user similarity matrix
    31ratings = np.array([
    32    [5, 3, 0, 1, 4],
    33    [4, 0, 0, 1, 2],
    34    [1, 1, 0, 5, 4],
    35    [0, 0, 5, 4, 0],
    36    [0, 3, 4, 0, 3],
    37])
    38n_users = ratings.shape[0]
    39
    40print("=== Cosine Similarity ===")
    41cos_sim = np.zeros((n_users, n_users))
    42for i in range(n_users):
    43    for j in range(n_users):
    44        cos_sim[i, j] = cosine_similarity(ratings[i], ratings[j])
    45print(np.round(cos_sim, 3))
    46
    47print("\n=== Pearson Correlation ===")
    48pear_sim = np.zeros((n_users, n_users))
    49for i in range(n_users):
    50    for j in range(n_users):
    51        pear_sim[i, j] = pearson_correlation(ratings[i], ratings[j])
    52print(np.round(pear_sim, 3))

    User-Based CF: Predicting Ratings

    Once we have a similarity matrix, we predict ratings using a weighted average of similar users' ratings:

    pred(u, i) = mean_u + sum(sim(u, v) * (r_vi - mean_v)) / sum(|sim(u, v)|)
    

    where the sum is over neighbors v who have rated item i.

    Steps:

    1. Compute similarity between target user and all other users 2. Select top-K most similar users (neighbors) who have rated the target item 3. Compute weighted average of their ratings, adjusting for each user's mean

    Item-Based CF

    Instead of finding similar users, we find similar items. Item-based CF is often preferred in production because:

  • Item similarities are more stable over time (items don't change; user tastes do)
  • Can be precomputed offline since the item catalog changes slowly
  • Scales better when there are more users than items
  • pred(u, i) = sum(sim(i, j) * r_uj) / sum(|sim(i, j)|)
    

    where the sum is over items j that user u has rated and are similar to item i.

    python
    1import numpy as np
    2
    3def predict_user_based(ratings, user_sim, target_user, target_item, k=3):
    4    """
    5    Predict rating for target_user on target_item using user-based CF.
    6
    7    Uses top-k most similar users who have rated the target item.
    8    """
    9    # User means (only over rated items)
    10    user_means = np.array([
    11        ratings[u][ratings[u] > 0].mean() if (ratings[u] > 0).any() else 0
    12        for u in range(ratings.shape[0])
    13    ])
    14
    15    target_mean = user_means[target_user]
    16
    17    # Find users who rated this item
    18    rated_mask = ratings[:, target_item] > 0
    19    rated_mask[target_user] = False  # Exclude the target user
    20
    21    candidates = np.where(rated_mask)[0]
    22    if len(candidates) == 0:
    23        return target_mean  # Fallback to user mean
    24
    25    # Get similarities and select top-k
    26    sims = user_sim[target_user, candidates]
    27    top_k_idx = np.argsort(-np.abs(sims))[:k]
    28    neighbors = candidates[top_k_idx]
    29
    30    # Weighted average
    31    numerator = 0.0
    32    denominator = 0.0
    33    for v in neighbors:
    34        sim = user_sim[target_user, v]
    35        numerator += sim * (ratings[v, target_item] - user_means[v])
    36        denominator += abs(sim)
    37
    38    if denominator == 0:
    39        return target_mean
    40
    41    return target_mean + numerator / denominator
    42
    43
    44# Using our previous ratings and similarity
    45ratings = np.array([
    46    [5, 3, 0, 1, 4],
    47    [4, 0, 0, 1, 2],
    48    [1, 1, 0, 5, 4],
    49    [0, 0, 5, 4, 0],
    50    [0, 3, 4, 0, 3],
    51])
    52
    53# Compute Pearson similarity
    54n_users = ratings.shape[0]
    55user_sim = np.zeros((n_users, n_users))
    56for i in range(n_users):
    57    for j in range(n_users):
    58        if i == j:
    59            user_sim[i, j] = 1.0
    60        else:
    61            mask = (ratings[i] > 0) & (ratings[j] > 0)
    62            if mask.sum() >= 2:
    63                a = ratings[i][mask] - ratings[i][ratings[i] > 0].mean()
    64                b = ratings[j][mask] - ratings[j][ratings[j] > 0].mean()
    65                d = np.linalg.norm(a) * np.linalg.norm(b)
    66                user_sim[i, j] = np.dot(a, b) / d if d > 0 else 0
    67
    68# Predict: What would User 1 rate Movie 2 (Drama_1)?
    69pred = predict_user_based(ratings, user_sim, target_user=1, target_item=2, k=3)
    70print(f"Predicted rating for User 1 on Drama_1: {pred:.2f}")
    71
    72# Predict all missing ratings for User 1
    73print("\nUser 1 predicted ratings:")
    74for item in range(ratings.shape[1]):
    75    if ratings[1, item] == 0:
    76        p = predict_user_based(ratings, user_sim, 1, item, k=3)
    77        print(f"  Item {item}: {p:.2f}")
    78    else:
    79        print(f"  Item {item}: {ratings[1, item]:.1f} (actual)")

    Matrix Factorization (SVD & ALS)

    Memory-based CF (user-based/item-based) does not scale well to millions of users and items. Matrix factorization offers a scalable alternative by learning latent factors that explain the observed ratings.

    The idea: decompose the user-item rating matrix R into two lower-rank matrices:

    R ≈ U * V^T
    

    where:

  • U is a (n_users x k) matrix of user latent factors
  • V is a (n_items x k) matrix of item latent factors
  • k is the number of latent dimensions (typically 20-200)
  • SVD (Singular Value Decomposition)

    Classic linear algebra approach. For a complete matrix: R = U * Σ * V^T

    But rating matrices are sparse (mostly zeros), so we use truncated SVD or regularized SVD that only fits on observed entries.

    ALS (Alternating Least Squares)

    1. Initialize U and V randomly 2. Fix V, solve for U (least squares problem) 3. Fix U, solve for V (least squares problem) 4. Repeat until convergence

    ALS is popular because:

  • Each step has a closed-form solution (fast)
  • Easily parallelizable (each user/item can be updated independently)
  • Works well with implicit feedback (weighted ALS)
  • python
    1import numpy as np
    2
    3def matrix_factorization_sgd(R, k=2, lr=0.01, reg=0.02, epochs=100):
    4    """
    5    Learn latent factors using Stochastic Gradient Descent.
    6
    7    Minimizes: sum over observed (u,i) of (R_ui - U_u . V_i)^2
    8               + reg * (||U||^2 + ||V||^2)
    9    """
    10    n_users, n_items = R.shape
    11    # Initialize latent factors
    12    U = np.random.normal(0, 0.1, (n_users, k))
    13    V = np.random.normal(0, 0.1, (n_items, k))
    14    # Bias terms
    15    bu = np.zeros(n_users)
    16    bi = np.zeros(n_items)
    17    mu = R[R > 0].mean()  # Global mean
    18
    19    # Get observed entries
    20    observed = list(zip(*np.where(R > 0)))
    21
    22    losses = []
    23    for epoch in range(epochs):
    24        np.random.shuffle(observed)
    25        total_loss = 0
    26
    27        for u, i in observed:
    28            # Prediction
    29            pred = mu + bu[u] + bi[i] + np.dot(U[u], V[i])
    30            error = R[u, i] - pred
    31
    32            # Update biases
    33            bu[u] += lr * (error - reg * bu[u])
    34            bi[i] += lr * (error - reg * bi[i])
    35
    36            # Update latent factors
    37            U_old = U[u].copy()
    38            U[u] += lr * (error * V[i] - reg * U[u])
    39            V[i] += lr * (error * U_old - reg * V[i])
    40
    41            total_loss += error ** 2 + reg * (
    42                np.sum(U[u] ** 2) + np.sum(V[i] ** 2) +
    43                bu[u] ** 2 + bi[i] ** 2
    44            )
    45
    46        losses.append(total_loss / len(observed))
    47        if (epoch + 1) % 20 == 0:
    48            print(f"Epoch {epoch+1}: loss = {losses[-1]:.4f}")
    49
    50    return U, V, bu, bi, mu
    51
    52
    53# Example
    54R = np.array([
    55    [5, 3, 0, 1, 4],
    56    [4, 0, 0, 1, 2],
    57    [1, 1, 0, 5, 4],
    58    [0, 0, 5, 4, 0],
    59    [0, 3, 4, 0, 3],
    60], dtype=float)
    61
    62U, V, bu, bi, mu = matrix_factorization_sgd(R, k=3, lr=0.01, reg=0.02, epochs=100)
    63
    64# Predict all ratings
    65R_pred = mu + bu[:, None] + bi[None, :] + U @ V.T
    66print("\nOriginal matrix:")
    67print(R.astype(int))
    68print("\nPredicted matrix (rounded):")
    69print(np.round(R_pred, 1))
    70print("\nPredicted missing entries:")
    71for u in range(R.shape[0]):
    72    for i in range(R.shape[1]):
    73        if R[u, i] == 0:
    74            print(f"  User {u}, Item {i}: {R_pred[u, i]:.2f}")

    The Cold Start Problem

    CF struggles when there is insufficient interaction data. **New user cold start**: a user with no history has no basis for similarity computation. **New item cold start**: an item with no ratings cannot be recommended. Solutions include: (1) asking new users for initial preferences, (2) using content-based features as fallback, (3) popularity-based defaults, (4) bandits for exploration. Hybrid systems that combine CF with content-based methods are the most robust solution.

    Implicit Feedback: Weighted Matrix Factorization

    For implicit feedback (clicks, views), we treat all interactions as positive signals with varying **confidence** levels. The classic approach (Hu et al., 2008) defines: preference p_ui = 1 if r_ui > 0, else 0; and confidence c_ui = 1 + alpha * r_ui. The optimization then minimizes: sum(c_ui * (p_ui - U_u . V_i)^2) + regularization. This handles the asymmetry: we are more confident a user likes an item they interacted with many times than one they never saw.