Deep Learning for Recommender Systems
Traditional CF methods (SVD, ALS) learn linear interactions between user and item latent factors. Deep learning allows us to model nonlinear relationships, incorporate diverse features (text, images, sequences), and build end-to-end systems.
Why Deep Learning for RecSys?
1. Nonlinear interactions: A neural network can learn complex preference patterns that a dot product cannot capture 2. Feature fusion: Easily combine heterogeneous inputs (ID embeddings + text + images + context) 3. Sequential modeling: RNNs and transformers capture the order of user interactions 4. Representation learning: Embeddings are learned end-to-end, optimized for the recommendation task
The Deep RecSys Landscape
| Model | Key Idea | Use Case |
|---|---|---|
| Neural CF (NCF) | Replace dot product with MLP | General rating/click prediction |
| Two-Tower | Separate user & item encoders | Large-scale retrieval |
| Deep FM | Combine factorization with DNN | Feature-rich CTR prediction |
| GRU4Rec | RNN over session sequences | Session-based recommendation |
| SASRec | Self-attention over sequences | Sequential recommendation |
| DLRM | Embedding + interaction + MLP | Industrial-scale systems |
Neural Collaborative Filtering (NCF)
NCF (He et al., 2017) replaces the dot product of matrix factorization with a neural network that can learn arbitrary interaction functions.
Architecture
1. Embedding layers: Map user IDs and item IDs to dense vectors 2. Interaction layer: Concatenate (or element-wise multiply) user and item embeddings 3. MLP layers: Multiple fully connected layers with nonlinear activations 4. Output layer: Sigmoid for implicit feedback (click/no-click), linear for explicit ratings
GMF + MLP = NeuMF
The full NCF model (NeuMF) combines two pathways:
The outputs of both pathways are concatenated and fed to a final prediction layer.
GMF: p_gmf = user_emb_gmf * item_emb_gmf (element-wise)
MLP: p_mlp = MLP(concat(user_emb_mlp, item_emb_mlp))
Output: sigmoid(W * concat(p_gmf, p_mlp) + b)
1import numpy as np
2
3class NeuralCF:
4 """
5 Neural Collaborative Filtering (simplified NumPy implementation).
6 Combines GMF and MLP pathways.
7 """
8
9 def __init__(self, n_users, n_items, emb_dim=8, mlp_dims=[16, 8], lr=0.01):
10 self.n_users = n_users
11 self.n_items = n_items
12 self.emb_dim = emb_dim
13 self.lr = lr
14
15 # GMF embeddings
16 self.user_emb_gmf = np.random.normal(0, 0.01, (n_users, emb_dim))
17 self.item_emb_gmf = np.random.normal(0, 0.01, (n_items, emb_dim))
18
19 # MLP embeddings
20 self.user_emb_mlp = np.random.normal(0, 0.01, (n_users, emb_dim))
21 self.item_emb_mlp = np.random.normal(0, 0.01, (n_items, emb_dim))
22
23 # MLP layers
24 self.mlp_layers = []
25 input_dim = emb_dim * 2
26 for dim in mlp_dims:
27 W = np.random.normal(0, 0.01, (input_dim, dim))
28 b = np.zeros(dim)
29 self.mlp_layers.append((W, b))
30 input_dim = dim
31
32 # Final output layer: concat GMF + MLP output -> scalar
33 self.output_W = np.random.normal(0, 0.01, (emb_dim + mlp_dims[-1], 1))
34 self.output_b = np.zeros(1)
35
36 def relu(self, x):
37 return np.maximum(0, x)
38
39 def sigmoid(self, x):
40 return 1 / (1 + np.exp(-np.clip(x, -10, 10)))
41
42 def forward(self, user_id, item_id):
43 """Forward pass for a single (user, item) pair."""
44 # GMF pathway
45 gmf_user = self.user_emb_gmf[user_id]
46 gmf_item = self.item_emb_gmf[item_id]
47 gmf_out = gmf_user * gmf_item # Element-wise product
48
49 # MLP pathway
50 mlp_input = np.concatenate([
51 self.user_emb_mlp[user_id],
52 self.item_emb_mlp[item_id]
53 ])
54 h = mlp_input
55 for W, b in self.mlp_layers:
56 h = self.relu(h @ W + b)
57
58 # Combine GMF and MLP
59 combined = np.concatenate([gmf_out, h])
60 logit = combined @ self.output_W + self.output_b
61 return self.sigmoid(logit[0])
62
63 def predict_all(self, user_id):
64 """Predict scores for all items."""
65 scores = np.array([self.forward(user_id, i) for i in range(self.n_items)])
66 return scores
67
68
69# Demo
70n_users, n_items = 100, 50
71model = NeuralCF(n_users, n_items, emb_dim=8, mlp_dims=[16, 8])
72
73# Before training: random predictions
74user_id = 0
75scores = model.predict_all(user_id)
76print(f"NCF model: {n_users} users, {n_items} items")
77print(f"User {user_id} top-5 items (untrained): {np.argsort(-scores)[:5]}")
78print(f"Score range: [{scores.min():.4f}, {scores.max():.4f}]")
79print(f"\nParameters:")
80print(f" GMF embeddings: {n_users * 8 + n_items * 8:,}")
81print(f" MLP embeddings: {n_users * 8 + n_items * 8:,}")
82print(f" MLP weights: {sum(W.size + b.size for W, b in model.mlp_layers):,}")
83print(f" Output layer: {model.output_W.size + model.output_b.size}")Two-Tower Models
Two-tower (or dual encoder) models are the workhorse of large-scale retrieval. They independently encode users and items into embeddings, enabling efficient nearest-neighbor search at serving time.
Architecture
User Tower: Item Tower:
user_id item_id
+ user features + item features
| |
[DNN layers] [DNN layers]
| |
user_embedding item_embedding
\ /
cosine similarity
|
score
Key Advantages
1. Decoupled encoding: Item embeddings can be precomputed and indexed offline 2. Approximate nearest neighbor (ANN): At serving time, find top-K items using efficient ANN search (FAISS, ScaNN, HNSW) instead of scoring all items 3. Scalable: Handles millions of items -- only need to compute the user embedding at query time
Training
Trained with contrastive loss or sampled softmax:
1import numpy as np
2
3class TwoTowerModel:
4 """
5 Simplified two-tower model for recommendation.
6 User and item towers are separate MLPs producing embeddings.
7 """
8
9 def __init__(self, n_users, n_items, n_user_features, n_item_features,
10 emb_dim=16, hidden_dim=32):
11 self.emb_dim = emb_dim
12
13 # User tower
14 self.user_emb = np.random.normal(0, 0.01, (n_users, emb_dim))
15 self.user_W1 = np.random.normal(0, 0.01, (emb_dim + n_user_features, hidden_dim))
16 self.user_b1 = np.zeros(hidden_dim)
17 self.user_W2 = np.random.normal(0, 0.01, (hidden_dim, emb_dim))
18 self.user_b2 = np.zeros(emb_dim)
19
20 # Item tower
21 self.item_emb = np.random.normal(0, 0.01, (n_items, emb_dim))
22 self.item_W1 = np.random.normal(0, 0.01, (emb_dim + n_item_features, hidden_dim))
23 self.item_b1 = np.zeros(hidden_dim)
24 self.item_W2 = np.random.normal(0, 0.01, (hidden_dim, emb_dim))
25 self.item_b2 = np.zeros(emb_dim)
26
27 def relu(self, x):
28 return np.maximum(0, x)
29
30 def l2_normalize(self, x):
31 norm = np.linalg.norm(x)
32 return x / norm if norm > 0 else x
33
34 def encode_user(self, user_id, user_features):
35 """Encode user into embedding space."""
36 x = np.concatenate([self.user_emb[user_id], user_features])
37 h = self.relu(x @ self.user_W1 + self.user_b1)
38 emb = h @ self.user_W2 + self.user_b2
39 return self.l2_normalize(emb)
40
41 def encode_item(self, item_id, item_features):
42 """Encode item into embedding space."""
43 x = np.concatenate([self.item_emb[item_id], item_features])
44 h = self.relu(x @ self.item_W1 + self.item_b1)
45 emb = h @ self.item_W2 + self.item_b2
46 return self.l2_normalize(emb)
47
48 def score(self, user_emb, item_emb):
49 """Cosine similarity between user and item embeddings."""
50 return np.dot(user_emb, item_emb)
51
52 def retrieve_top_k(self, user_id, user_features, item_ids, item_features_list, k=5):
53 """Retrieve top-k items for a user."""
54 user_emb = self.encode_user(user_id, user_features)
55
56 # In production, item embeddings are precomputed and stored in an ANN index
57 scores = []
58 for item_id, item_feat in zip(item_ids, item_features_list):
59 item_emb = self.encode_item(item_id, item_feat)
60 s = self.score(user_emb, item_emb)
61 scores.append((item_id, s))
62
63 scores.sort(key=lambda x: -x[1])
64 return scores[:k]
65
66
67# Demo
68n_users, n_items = 1000, 5000
69n_user_feat, n_item_feat = 5, 10
70
71model = TwoTowerModel(n_users, n_items, n_user_feat, n_item_feat,
72 emb_dim=16, hidden_dim=32)
73
74# Simulate retrieval
75user_feat = np.random.randn(n_user_feat)
76item_ids = list(range(20)) # Score 20 candidate items
77item_feats = [np.random.randn(n_item_feat) for _ in item_ids]
78
79top_items = model.retrieve_top_k(0, user_feat, item_ids, item_feats, k=5)
80print("Two-Tower Retrieval (untrained):")
81for item_id, score in top_items:
82 print(f" Item {item_id}: score={score:.4f}")
83
84print(f"\nModel scale: {n_users:,} users, {n_items:,} items")
85print(f"User embedding: {model.encode_user(0, user_feat).shape}")
86print(f"Item embedding: {model.encode_item(0, item_feats[0]).shape}")Deep Factorization Machines (DeepFM)
DeepFM combines the strengths of:
This is especially powerful for click-through rate (CTR) prediction where you have many sparse categorical features (user demographics, item attributes, context).
Architecture
Input: [user_id, item_id, gender, city, device, time_of_day, ...]
|
Embedding Layer
/ \
FM Component DNN Component
(pairwise (higher-order
interactions) interactions)
\ /
Addition/Concat
|
Sigmoid
|
Click probability
FM Component
Captures all pairwise interactions between features:y_FM = w0 + sum(w_i * x_i) + sum_i sum_j * x_i * x_j DNN Component
Concatenates all embeddings and passes through multiple dense layers to capture complex feature interactions that are hard to express as pairwise products.Sequential Recommendation: GRU4Rec
For session-based or sequential recommendation, the order of user interactions matters. GRU4Rec (Hidasi et al., 2016) models the sequence of items a user interacted with using a Gated Recurrent Unit (GRU).
How It Works
1. Each item is embedded into a dense vector 2. The sequence of item embeddings is fed to a GRU 3. The GRU's hidden state captures the evolving user intent 4. The final hidden state is used to score candidate items
Attention-Based Alternative: SASRec
SASRec (Self-Attentive Sequential Recommendation) uses a transformer-style self-attention mechanism instead of an RNN, allowing it to attend to relevant past items regardless of their position in the sequence.
1import numpy as np
2
3class GRU4Rec:
4 """
5 Simplified GRU-based sequential recommender.
6 Processes a sequence of item interactions and predicts the next item.
7 """
8
9 def __init__(self, n_items, emb_dim=16, hidden_dim=32):
10 self.n_items = n_items
11 self.emb_dim = emb_dim
12 self.hidden_dim = hidden_dim
13
14 # Item embeddings
15 self.item_emb = np.random.normal(0, 0.01, (n_items, emb_dim))
16
17 # GRU parameters (simplified single-layer)
18 # Update gate
19 self.Wz = np.random.normal(0, 0.01, (emb_dim + hidden_dim, hidden_dim))
20 self.bz = np.zeros(hidden_dim)
21 # Reset gate
22 self.Wr = np.random.normal(0, 0.01, (emb_dim + hidden_dim, hidden_dim))
23 self.br = np.zeros(hidden_dim)
24 # Candidate hidden state
25 self.Wh = np.random.normal(0, 0.01, (emb_dim + hidden_dim, hidden_dim))
26 self.bh = np.zeros(hidden_dim)
27
28 # Output projection
29 self.Wo = np.random.normal(0, 0.01, (hidden_dim, n_items))
30
31 def sigmoid(self, x):
32 return 1 / (1 + np.exp(-np.clip(x, -10, 10)))
33
34 def tanh(self, x):
35 return np.tanh(np.clip(x, -10, 10))
36
37 def gru_step(self, x, h_prev):
38 """Single GRU step."""
39 combined = np.concatenate([x, h_prev])
40 z = self.sigmoid(combined @ self.Wz + self.bz) # Update gate
41 r = self.sigmoid(combined @ self.Wr + self.br) # Reset gate
42 combined_r = np.concatenate([x, r * h_prev])
43 h_candidate = self.tanh(combined_r @ self.Wh + self.bh)
44 h_new = (1 - z) * h_prev + z * h_candidate
45 return h_new
46
47 def forward(self, item_sequence):
48 """
49 Process a sequence of item IDs and return scores for next item.
50
51 Args:
52 item_sequence: list of item IDs (interaction history)
53
54 Returns:
55 scores: (n_items,) array of scores for each possible next item
56 """
57 h = np.zeros(self.hidden_dim)
58
59 for item_id in item_sequence:
60 x = self.item_emb[item_id]
61 h = self.gru_step(x, h)
62
63 # Score all items
64 scores = h @ self.Wo
65 return scores
66
67 def recommend(self, item_sequence, top_k=5, exclude_seen=True):
68 """Get top-k recommendations given a session."""
69 scores = self.forward(item_sequence)
70
71 if exclude_seen:
72 for item_id in item_sequence:
73 scores[item_id] = -np.inf
74
75 top_items = np.argsort(-scores)[:top_k]
76 return [(item_id, scores[item_id]) for item_id in top_items]
77
78
79# Demo
80n_items = 100
81model = GRU4Rec(n_items, emb_dim=16, hidden_dim=32)
82
83# Simulate a user session
84session = [5, 12, 7, 23] # User viewed items 5, 12, 7, 23 in order
85recs = model.recommend(session, top_k=5)
86
87print(f"Session: {session}")
88print(f"Recommendations (untrained):")
89for item_id, score in recs:
90 print(f" Item {item_id}: score={score:.4f}")
91
92# Show how recommendations change as session evolves
93print("\nEvolving session recommendations:")
94for length in range(1, len(session) + 1):
95 sub_session = session[:length]
96 recs = model.recommend(sub_session, top_k=3)
97 items = [r[0] for r in recs]
98 print(f" After viewing {sub_session}: recommend {items}")Embedding-Based Retrieval
In production systems with millions of items, you cannot score every item for every request. Embedding-based retrieval uses approximate nearest neighbor (ANN) search to efficiently find the most relevant items.
The Pipeline
1. Offline: Train a two-tower model. Compute and index all item embeddings using an ANN index (FAISS, ScaNN, Annoy, HNSW) 2. Online: Compute the user embedding from the user tower, query the ANN index for top-K nearest items 3. Re-rank: Score the top-K candidates with a more expensive model
ANN Algorithms
| Algorithm | Approach | Trade-off |
|---|---|---|
| HNSW | Hierarchical navigable small world graph | Best recall, more memory |
| IVF-PQ | Inverted file with product quantization | Good compression, fast |
| ScaNN | Anisotropic quantization | Google's production choice |
| Annoy | Random projection trees | Simple, good for moderate scale |
Key Consideration: Embedding Freshness
Item embeddings can be refreshed periodically (daily/weekly), but user embeddings should be computed at request time to capture recent interactions.