Collaborative Filtering

Collaborative filtering (CF) is the most widely used recommendation technique. It predicts a user's interests by collecting preferences from many users -- the core assumption is that if users agreed in the past, they will agree in the future.

Two Flavors of CF

1. User-Based CF: Find users similar to the target user, then recommend items those similar users liked 2. Item-Based CF: Find items similar to items the target user liked, then recommend those similar items

Explicit vs Implicit Feedback

Explicit Feedback	Implicit Feedback
Star ratings, thumbs up/down, reviews	Clicks, views, purchases, time spent
Clear signal of preference	Must infer preference from behavior
Sparse (users rarely rate)	Dense (every interaction is data)
May contain bias (rating inflation)	No negative signal (absence != dislike)

Most real-world systems rely heavily on implicit feedback because it is far more abundant.

python

1import numpy as np
2
3# Example user-item rating matrix (0 = unrated)
4# Rows = users, Columns = items (movies)
5ratings = np.array([
6    [5, 3, 0, 1, 4],  # User 0
7    [4, 0, 0, 1, 2],  # User 1
8    [1, 1, 0, 5, 4],  # User 2
9    [0, 0, 5, 4, 0],  # User 3
10    [0, 3, 4, 0, 3],  # User 4
11])
12
13movie_names = ["Action_1", "Comedy_1", "Drama_1", "SciFi_1", "Comedy_2"]
14n_users, n_items = ratings.shape
15
16print(f"Rating matrix: {n_users} users x {n_items} items")
17print(f"Total possible ratings: {n_users * n_items}")
18print(f"Observed ratings: {np.count_nonzero(ratings)}")
19print(f"Sparsity: {1 - np.count_nonzero(ratings) / (n_users * n_items):.1%}")

Similarity Metrics

The heart of collaborative filtering is measuring how similar two users (or items) are. The two most common metrics:

Cosine Similarity

Measures the cosine of the angle between two vectors. Ranges from -1 to 1 (or 0 to 1 for non-negative ratings):

sim(a, b) = (a . b) / (||a|| * ||b||)

Ignores magnitude differences (a user who rates everything 5 vs one who rates 3 can still be similar)

Works well with sparse data

Pearson Correlation

Centers the vectors by subtracting each user's mean rating before computing cosine similarity:

sim(a, b) = sum((a_i - mean_a)(b_i - mean_b)) / (std_a * std_b)

Accounts for different rating scales (generous vs strict raters)

More robust for explicit ratings

Adjusted Cosine (for Item-Based)

Subtracts the user mean from each rating before computing item similarity. This accounts for the fact that different users have different rating baselines.

python

1import numpy as np
2
3def cosine_similarity(a, b):
4    """Compute cosine similarity between two vectors, considering only co-rated items."""
5    # Only consider items both users have rated
6    mask = (a > 0) & (b > 0)
7    if mask.sum() == 0:
8        return 0.0
9    a_filtered = a[mask]
10    b_filtered = b[mask]
11    dot = np.dot(a_filtered, b_filtered)
12    norm = np.linalg.norm(a_filtered) * np.linalg.norm(b_filtered)
13    return dot / norm if norm > 0 else 0.0
14
15
16def pearson_correlation(a, b):
17    """Compute Pearson correlation between two users on co-rated items."""
18    mask = (a > 0) & (b > 0)
19    if mask.sum() < 2:  # Need at least 2 co-rated items
20        return 0.0
21    a_filtered = a[mask]
22    b_filtered = b[mask]
23    a_centered = a_filtered - a_filtered.mean()
24    b_centered = b_filtered - b_filtered.mean()
25    num = np.dot(a_centered, b_centered)
26    den = np.linalg.norm(a_centered) * np.linalg.norm(b_centered)
27    return num / den if den > 0 else 0.0
28
29
30# Compute user-user similarity matrix
31ratings = np.array([
32    [5, 3, 0, 1, 4],
33    [4, 0, 0, 1, 2],
34    [1, 1, 0, 5, 4],
35    [0, 0, 5, 4, 0],
36    [0, 3, 4, 0, 3],
37])
38n_users = ratings.shape[0]
39
40print("=== Cosine Similarity ===")
41cos_sim = np.zeros((n_users, n_users))
42for i in range(n_users):
43    for j in range(n_users):
44        cos_sim[i, j] = cosine_similarity(ratings[i], ratings[j])
45print(np.round(cos_sim, 3))
46
47print("\n=== Pearson Correlation ===")
48pear_sim = np.zeros((n_users, n_users))
49for i in range(n_users):
50    for j in range(n_users):
51        pear_sim[i, j] = pearson_correlation(ratings[i], ratings[j])
52print(np.round(pear_sim, 3))

User-Based CF: Predicting Ratings

Once we have a similarity matrix, we predict ratings using a weighted average of similar users' ratings:

pred(u, i) = mean_u + sum(sim(u, v) * (r_vi - mean_v)) / sum(|sim(u, v)|)

where the sum is over neighbors v who have rated item i.

Steps:

1. Compute similarity between target user and all other users 2. Select top-K most similar users (neighbors) who have rated the target item 3. Compute weighted average of their ratings, adjusting for each user's mean

Item-Based CF

Instead of finding similar users, we find similar items. Item-based CF is often preferred in production because:

Item similarities are more stable over time (items don't change; user tastes do)

Can be precomputed offline since the item catalog changes slowly

Scales better when there are more users than items

pred(u, i) = sum(sim(i, j) * r_uj) / sum(|sim(i, j)|)

where the sum is over items j that user u has rated and are similar to item i.

python

1import numpy as np
2
3def predict_user_based(ratings, user_sim, target_user, target_item, k=3):
4    """
5    Predict rating for target_user on target_item using user-based CF.
6
7    Uses top-k most similar users who have rated the target item.
8    """
9    # User means (only over rated items)
10    user_means = np.array([
11        ratings[u][ratings[u] > 0].mean() if (ratings[u] > 0).any() else 0
12        for u in range(ratings.shape[0])
13    ])
14
15    target_mean = user_means[target_user]
16
17    # Find users who rated this item
18    rated_mask = ratings[:, target_item] > 0
19    rated_mask[target_user] = False  # Exclude the target user
20
21    candidates = np.where(rated_mask)[0]
22    if len(candidates) == 0:
23        return target_mean  # Fallback to user mean
24
25    # Get similarities and select top-k
26    sims = user_sim[target_user, candidates]
27    top_k_idx = np.argsort(-np.abs(sims))[:k]
28    neighbors = candidates[top_k_idx]
29
30    # Weighted average
31    numerator = 0.0
32    denominator = 0.0
33    for v in neighbors:
34        sim = user_sim[target_user, v]
35        numerator += sim * (ratings[v, target_item] - user_means[v])
36        denominator += abs(sim)
37
38    if denominator == 0:
39        return target_mean
40
41    return target_mean + numerator / denominator
42
43
44# Using our previous ratings and similarity
45ratings = np.array([
46    [5, 3, 0, 1, 4],
47    [4, 0, 0, 1, 2],
48    [1, 1, 0, 5, 4],
49    [0, 0, 5, 4, 0],
50    [0, 3, 4, 0, 3],
51])
52
53# Compute Pearson similarity
54n_users = ratings.shape[0]
55user_sim = np.zeros((n_users, n_users))
56for i in range(n_users):
57    for j in range(n_users):
58        if i == j:
59            user_sim[i, j] = 1.0
60        else:
61            mask = (ratings[i] > 0) & (ratings[j] > 0)
62            if mask.sum() >= 2:
63                a = ratings[i][mask] - ratings[i][ratings[i] > 0].mean()
64                b = ratings[j][mask] - ratings[j][ratings[j] > 0].mean()
65                d = np.linalg.norm(a) * np.linalg.norm(b)
66                user_sim[i, j] = np.dot(a, b) / d if d > 0 else 0
67
68# Predict: What would User 1 rate Movie 2 (Drama_1)?
69pred = predict_user_based(ratings, user_sim, target_user=1, target_item=2, k=3)
70print(f"Predicted rating for User 1 on Drama_1: {pred:.2f}")
71
72# Predict all missing ratings for User 1
73print("\nUser 1 predicted ratings:")
74for item in range(ratings.shape[1]):
75    if ratings[1, item] == 0:
76        p = predict_user_based(ratings, user_sim, 1, item, k=3)
77        print(f"  Item {item}: {p:.2f}")
78    else:
79        print(f"  Item {item}: {ratings[1, item]:.1f} (actual)")

Matrix Factorization (SVD & ALS)

Memory-based CF (user-based/item-based) does not scale well to millions of users and items. Matrix factorization offers a scalable alternative by learning latent factors that explain the observed ratings.

The idea: decompose the user-item rating matrix R into two lower-rank matrices:

R ≈ U * V^T

where:

U is a (n_users x k) matrix of user latent factors

V is a (n_items x k) matrix of item latent factors

k is the number of latent dimensions (typically 20-200)

SVD (Singular Value Decomposition)

Classic linear algebra approach. For a complete matrix: R = U * Σ * V^T

But rating matrices are sparse (mostly zeros), so we use truncated SVD or regularized SVD that only fits on observed entries.

ALS (Alternating Least Squares)

1. Initialize U and V randomly 2. Fix V, solve for U (least squares problem) 3. Fix U, solve for V (least squares problem) 4. Repeat until convergence

ALS is popular because:

Each step has a closed-form solution (fast)

Easily parallelizable (each user/item can be updated independently)

Works well with implicit feedback (weighted ALS)

python

1import numpy as np
2
3def matrix_factorization_sgd(R, k=2, lr=0.01, reg=0.02, epochs=100):
4    """
5    Learn latent factors using Stochastic Gradient Descent.
6
7    Minimizes: sum over observed (u,i) of (R_ui - U_u . V_i)^2
8               + reg * (||U||^2 + ||V||^2)
9    """
10    n_users, n_items = R.shape
11    # Initialize latent factors
12    U = np.random.normal(0, 0.1, (n_users, k))
13    V = np.random.normal(0, 0.1, (n_items, k))
14    # Bias terms
15    bu = np.zeros(n_users)
16    bi = np.zeros(n_items)
17    mu = R[R > 0].mean()  # Global mean
18
19    # Get observed entries
20    observed = list(zip(*np.where(R > 0)))
21
22    losses = []
23    for epoch in range(epochs):
24        np.random.shuffle(observed)
25        total_loss = 0
26
27        for u, i in observed:
28            # Prediction
29            pred = mu + bu[u] + bi[i] + np.dot(U[u], V[i])
30            error = R[u, i] - pred
31
32            # Update biases
33            bu[u] += lr * (error - reg * bu[u])
34            bi[i] += lr * (error - reg * bi[i])
35
36            # Update latent factors
37            U_old = U[u].copy()
38            U[u] += lr * (error * V[i] - reg * U[u])
39            V[i] += lr * (error * U_old - reg * V[i])
40
41            total_loss += error ** 2 + reg * (
42                np.sum(U[u] ** 2) + np.sum(V[i] ** 2) +
43                bu[u] ** 2 + bi[i] ** 2
44            )
45
46        losses.append(total_loss / len(observed))
47        if (epoch + 1) % 20 == 0:
48            print(f"Epoch {epoch+1}: loss = {losses[-1]:.4f}")
49
50    return U, V, bu, bi, mu
51
52
53# Example
54R = np.array([
55    [5, 3, 0, 1, 4],
56    [4, 0, 0, 1, 2],
57    [1, 1, 0, 5, 4],
58    [0, 0, 5, 4, 0],
59    [0, 3, 4, 0, 3],
60], dtype=float)
61
62U, V, bu, bi, mu = matrix_factorization_sgd(R, k=3, lr=0.01, reg=0.02, epochs=100)
63
64# Predict all ratings
65R_pred = mu + bu[:, None] + bi[None, :] + U @ V.T
66print("\nOriginal matrix:")
67print(R.astype(int))
68print("\nPredicted matrix (rounded):")
69print(np.round(R_pred, 1))
70print("\nPredicted missing entries:")
71for u in range(R.shape[0]):
72    for i in range(R.shape[1]):
73        if R[u, i] == 0:
74            print(f"  User {u}, Item {i}: {R_pred[u, i]:.2f}")

The Cold Start Problem

CF struggles when there is insufficient interaction data. **New user cold start**: a user with no history has no basis for similarity computation. **New item cold start**: an item with no ratings cannot be recommended. Solutions include: (1) asking new users for initial preferences, (2) using content-based features as fallback, (3) popularity-based defaults, (4) bandits for exploration. Hybrid systems that combine CF with content-based methods are the most robust solution.

Implicit Feedback: Weighted Matrix Factorization

For implicit feedback (clicks, views), we treat all interactions as positive signals with varying **confidence** levels. The classic approach (Hu et al., 2008) defines: preference p_ui = 1 if r_ui > 0, else 0; and confidence c_ui = 1 + alpha * r_ui. The optimization then minimizes: sum(c_ui * (p_ui - U_u . V_i)^2) + regularization. This handles the asymmetry: we are more confident a user likes an item they interacted with many times than one they never saw.

Collaborative Filtering

Two Flavors of CF

Explicit vs Implicit Feedback

Explicit Feedback	Implicit Feedback
Star ratings, thumbs up/down, reviews	Clicks, views, purchases, time spent
Clear signal of preference	Must infer preference from behavior
Sparse (users rarely rate)	Dense (every interaction is data)
May contain bias (rating inflation)	No negative signal (absence != dislike)

Most real-world systems rely heavily on implicit feedback because it is far more abundant.

python

1import numpy as np
2
3# Example user-item rating matrix (0 = unrated)
4# Rows = users, Columns = items (movies)
5ratings = np.array([
6    [5, 3, 0, 1, 4],  # User 0
7    [4, 0, 0, 1, 2],  # User 1
8    [1, 1, 0, 5, 4],  # User 2
9    [0, 0, 5, 4, 0],  # User 3
10    [0, 3, 4, 0, 3],  # User 4
11])
12
13movie_names = ["Action_1", "Comedy_1", "Drama_1", "SciFi_1", "Comedy_2"]
14n_users, n_items = ratings.shape
15
16print(f"Rating matrix: {n_users} users x {n_items} items")
17print(f"Total possible ratings: {n_users * n_items}")
18print(f"Observed ratings: {np.count_nonzero(ratings)}")
19print(f"Sparsity: {1 - np.count_nonzero(ratings) / (n_users * n_items):.1%}")

Similarity Metrics

The heart of collaborative filtering is measuring how similar two users (or items) are. The two most common metrics:

Cosine Similarity

Measures the cosine of the angle between two vectors. Ranges from -1 to 1 (or 0 to 1 for non-negative ratings):

sim(a, b) = (a . b) / (||a|| * ||b||)

Ignores magnitude differences (a user who rates everything 5 vs one who rates 3 can still be similar)

Works well with sparse data

Pearson Correlation

Centers the vectors by subtracting each user's mean rating before computing cosine similarity:

sim(a, b) = sum((a_i - mean_a)(b_i - mean_b)) / (std_a * std_b)

Accounts for different rating scales (generous vs strict raters)

More robust for explicit ratings

Adjusted Cosine (for Item-Based)

Subtracts the user mean from each rating before computing item similarity. This accounts for the fact that different users have different rating baselines.

python

1import numpy as np
2
3def cosine_similarity(a, b):
4    """Compute cosine similarity between two vectors, considering only co-rated items."""
5    # Only consider items both users have rated
6    mask = (a > 0) & (b > 0)
7    if mask.sum() == 0:
8        return 0.0
9    a_filtered = a[mask]
10    b_filtered = b[mask]
11    dot = np.dot(a_filtered, b_filtered)
12    norm = np.linalg.norm(a_filtered) * np.linalg.norm(b_filtered)
13    return dot / norm if norm > 0 else 0.0
14
15
16def pearson_correlation(a, b):
17    """Compute Pearson correlation between two users on co-rated items."""
18    mask = (a > 0) & (b > 0)
19    if mask.sum() < 2:  # Need at least 2 co-rated items
20        return 0.0
21    a_filtered = a[mask]
22    b_filtered = b[mask]
23    a_centered = a_filtered - a_filtered.mean()
24    b_centered = b_filtered - b_filtered.mean()
25    num = np.dot(a_centered, b_centered)
26    den = np.linalg.norm(a_centered) * np.linalg.norm(b_centered)
27    return num / den if den > 0 else 0.0
28
29
30# Compute user-user similarity matrix
31ratings = np.array([
32    [5, 3, 0, 1, 4],
33    [4, 0, 0, 1, 2],
34    [1, 1, 0, 5, 4],
35    [0, 0, 5, 4, 0],
36    [0, 3, 4, 0, 3],
37])
38n_users = ratings.shape[0]
39
40print("=== Cosine Similarity ===")
41cos_sim = np.zeros((n_users, n_users))
42for i in range(n_users):
43    for j in range(n_users):
44        cos_sim[i, j] = cosine_similarity(ratings[i], ratings[j])
45print(np.round(cos_sim, 3))
46
47print("\n=== Pearson Correlation ===")
48pear_sim = np.zeros((n_users, n_users))
49for i in range(n_users):
50    for j in range(n_users):
51        pear_sim[i, j] = pearson_correlation(ratings[i], ratings[j])
52print(np.round(pear_sim, 3))

User-Based CF: Predicting Ratings

Once we have a similarity matrix, we predict ratings using a weighted average of similar users' ratings:

pred(u, i) = mean_u + sum(sim(u, v) * (r_vi - mean_v)) / sum(|sim(u, v)|)

where the sum is over neighbors v who have rated item i.

Steps:

Item-Based CF

Instead of finding similar users, we find similar items. Item-based CF is often preferred in production because:

Item similarities are more stable over time (items don't change; user tastes do)

Can be precomputed offline since the item catalog changes slowly

Scales better when there are more users than items

pred(u, i) = sum(sim(i, j) * r_uj) / sum(|sim(i, j)|)

where the sum is over items j that user u has rated and are similar to item i.

python

1import numpy as np
2
3def predict_user_based(ratings, user_sim, target_user, target_item, k=3):
4    """
5    Predict rating for target_user on target_item using user-based CF.
6
7    Uses top-k most similar users who have rated the target item.
8    """
9    # User means (only over rated items)
10    user_means = np.array([
11        ratings[u][ratings[u] > 0].mean() if (ratings[u] > 0).any() else 0
12        for u in range(ratings.shape[0])
13    ])
14
15    target_mean = user_means[target_user]
16
17    # Find users who rated this item
18    rated_mask = ratings[:, target_item] > 0
19    rated_mask[target_user] = False  # Exclude the target user
20
21    candidates = np.where(rated_mask)[0]
22    if len(candidates) == 0:
23        return target_mean  # Fallback to user mean
24
25    # Get similarities and select top-k
26    sims = user_sim[target_user, candidates]
27    top_k_idx = np.argsort(-np.abs(sims))[:k]
28    neighbors = candidates[top_k_idx]
29
30    # Weighted average
31    numerator = 0.0
32    denominator = 0.0
33    for v in neighbors:
34        sim = user_sim[target_user, v]
35        numerator += sim * (ratings[v, target_item] - user_means[v])
36        denominator += abs(sim)
37
38    if denominator == 0:
39        return target_mean
40
41    return target_mean + numerator / denominator
42
43
44# Using our previous ratings and similarity
45ratings = np.array([
46    [5, 3, 0, 1, 4],
47    [4, 0, 0, 1, 2],
48    [1, 1, 0, 5, 4],
49    [0, 0, 5, 4, 0],
50    [0, 3, 4, 0, 3],
51])
52
53# Compute Pearson similarity
54n_users = ratings.shape[0]
55user_sim = np.zeros((n_users, n_users))
56for i in range(n_users):
57    for j in range(n_users):
58        if i == j:
59            user_sim[i, j] = 1.0
60        else:
61            mask = (ratings[i] > 0) & (ratings[j] > 0)
62            if mask.sum() >= 2:
63                a = ratings[i][mask] - ratings[i][ratings[i] > 0].mean()
64                b = ratings[j][mask] - ratings[j][ratings[j] > 0].mean()
65                d = np.linalg.norm(a) * np.linalg.norm(b)
66                user_sim[i, j] = np.dot(a, b) / d if d > 0 else 0
67
68# Predict: What would User 1 rate Movie 2 (Drama_1)?
69pred = predict_user_based(ratings, user_sim, target_user=1, target_item=2, k=3)
70print(f"Predicted rating for User 1 on Drama_1: {pred:.2f}")
71
72# Predict all missing ratings for User 1
73print("\nUser 1 predicted ratings:")
74for item in range(ratings.shape[1]):
75    if ratings[1, item] == 0:
76        p = predict_user_based(ratings, user_sim, 1, item, k=3)
77        print(f"  Item {item}: {p:.2f}")
78    else:
79        print(f"  Item {item}: {ratings[1, item]:.1f} (actual)")

Matrix Factorization (SVD & ALS)

The idea: decompose the user-item rating matrix R into two lower-rank matrices:

R ≈ U * V^T

where:

U is a (n_users x k) matrix of user latent factors

V is a (n_items x k) matrix of item latent factors

k is the number of latent dimensions (typically 20-200)

SVD (Singular Value Decomposition)

Classic linear algebra approach. For a complete matrix: R = U * Σ * V^T

But rating matrices are sparse (mostly zeros), so we use truncated SVD or regularized SVD that only fits on observed entries.

ALS (Alternating Least Squares)

1. Initialize U and V randomly 2. Fix V, solve for U (least squares problem) 3. Fix U, solve for V (least squares problem) 4. Repeat until convergence

ALS is popular because:

Each step has a closed-form solution (fast)

Easily parallelizable (each user/item can be updated independently)

Works well with implicit feedback (weighted ALS)

python

1import numpy as np
2
3def matrix_factorization_sgd(R, k=2, lr=0.01, reg=0.02, epochs=100):
4    """
5    Learn latent factors using Stochastic Gradient Descent.
6
7    Minimizes: sum over observed (u,i) of (R_ui - U_u . V_i)^2
8               + reg * (||U||^2 + ||V||^2)
9    """
10    n_users, n_items = R.shape
11    # Initialize latent factors
12    U = np.random.normal(0, 0.1, (n_users, k))
13    V = np.random.normal(0, 0.1, (n_items, k))
14    # Bias terms
15    bu = np.zeros(n_users)
16    bi = np.zeros(n_items)
17    mu = R[R > 0].mean()  # Global mean
18
19    # Get observed entries
20    observed = list(zip(*np.where(R > 0)))
21
22    losses = []
23    for epoch in range(epochs):
24        np.random.shuffle(observed)
25        total_loss = 0
26
27        for u, i in observed:
28            # Prediction
29            pred = mu + bu[u] + bi[i] + np.dot(U[u], V[i])
30            error = R[u, i] - pred
31
32            # Update biases
33            bu[u] += lr * (error - reg * bu[u])
34            bi[i] += lr * (error - reg * bi[i])
35
36            # Update latent factors
37            U_old = U[u].copy()
38            U[u] += lr * (error * V[i] - reg * U[u])
39            V[i] += lr * (error * U_old - reg * V[i])
40
41            total_loss += error ** 2 + reg * (
42                np.sum(U[u] ** 2) + np.sum(V[i] ** 2) +
43                bu[u] ** 2 + bi[i] ** 2
44            )
45
46        losses.append(total_loss / len(observed))
47        if (epoch + 1) % 20 == 0:
48            print(f"Epoch {epoch+1}: loss = {losses[-1]:.4f}")
49
50    return U, V, bu, bi, mu
51
52
53# Example
54R = np.array([
55    [5, 3, 0, 1, 4],
56    [4, 0, 0, 1, 2],
57    [1, 1, 0, 5, 4],
58    [0, 0, 5, 4, 0],
59    [0, 3, 4, 0, 3],
60], dtype=float)
61
62U, V, bu, bi, mu = matrix_factorization_sgd(R, k=3, lr=0.01, reg=0.02, epochs=100)
63
64# Predict all ratings
65R_pred = mu + bu[:, None] + bi[None, :] + U @ V.T
66print("\nOriginal matrix:")
67print(R.astype(int))
68print("\nPredicted matrix (rounded):")
69print(np.round(R_pred, 1))
70print("\nPredicted missing entries:")
71for u in range(R.shape[0]):
72    for i in range(R.shape[1]):
73        if R[u, i] == 0:
74            print(f"  User {u}, Item {i}: {R_pred[u, i]:.2f}")