Deep Learning for Recommender Systems

Traditional CF methods (SVD, ALS) learn linear interactions between user and item latent factors. Deep learning allows us to model nonlinear relationships, incorporate diverse features (text, images, sequences), and build end-to-end systems.

Why Deep Learning for RecSys?

1. Nonlinear interactions: A neural network can learn complex preference patterns that a dot product cannot capture 2. Feature fusion: Easily combine heterogeneous inputs (ID embeddings + text + images + context) 3. Sequential modeling: RNNs and transformers capture the order of user interactions 4. Representation learning: Embeddings are learned end-to-end, optimized for the recommendation task

The Deep RecSys Landscape

Model	Key Idea	Use Case
Neural CF (NCF)	Replace dot product with MLP	General rating/click prediction
Two-Tower	Separate user & item encoders	Large-scale retrieval
Deep FM	Combine factorization with DNN	Feature-rich CTR prediction
GRU4Rec	RNN over session sequences	Session-based recommendation
SASRec	Self-attention over sequences	Sequential recommendation
DLRM	Embedding + interaction + MLP	Industrial-scale systems

Neural Collaborative Filtering (NCF)

NCF (He et al., 2017) replaces the dot product of matrix factorization with a neural network that can learn arbitrary interaction functions.

Architecture

1. Embedding layers: Map user IDs and item IDs to dense vectors 2. Interaction layer: Concatenate (or element-wise multiply) user and item embeddings 3. MLP layers: Multiple fully connected layers with nonlinear activations 4. Output layer: Sigmoid for implicit feedback (click/no-click), linear for explicit ratings

GMF + MLP = NeuMF

The full NCF model (NeuMF) combines two pathways:

GMF (Generalized Matrix Factorization): Element-wise product of user and item embeddings (like classical MF, but learned end-to-end)

MLP: Concatenated embeddings through hidden layers

The outputs of both pathways are concatenated and fed to a final prediction layer.

GMF: p_gmf = user_emb_gmf * item_emb_gmf  (element-wise)
MLP: p_mlp = MLP(concat(user_emb_mlp, item_emb_mlp))
Output: sigmoid(W * concat(p_gmf, p_mlp) + b)

python

1import numpy as np
2
3class NeuralCF:
4    """
5    Neural Collaborative Filtering (simplified NumPy implementation).
6    Combines GMF and MLP pathways.
7    """
8
9    def __init__(self, n_users, n_items, emb_dim=8, mlp_dims=[16, 8], lr=0.01):
10        self.n_users = n_users
11        self.n_items = n_items
12        self.emb_dim = emb_dim
13        self.lr = lr
14
15        # GMF embeddings
16        self.user_emb_gmf = np.random.normal(0, 0.01, (n_users, emb_dim))
17        self.item_emb_gmf = np.random.normal(0, 0.01, (n_items, emb_dim))
18
19        # MLP embeddings
20        self.user_emb_mlp = np.random.normal(0, 0.01, (n_users, emb_dim))
21        self.item_emb_mlp = np.random.normal(0, 0.01, (n_items, emb_dim))
22
23        # MLP layers
24        self.mlp_layers = []
25        input_dim = emb_dim * 2
26        for dim in mlp_dims:
27            W = np.random.normal(0, 0.01, (input_dim, dim))
28            b = np.zeros(dim)
29            self.mlp_layers.append((W, b))
30            input_dim = dim
31
32        # Final output layer: concat GMF + MLP output -> scalar
33        self.output_W = np.random.normal(0, 0.01, (emb_dim + mlp_dims[-1], 1))
34        self.output_b = np.zeros(1)
35
36    def relu(self, x):
37        return np.maximum(0, x)
38
39    def sigmoid(self, x):
40        return 1 / (1 + np.exp(-np.clip(x, -10, 10)))
41
42    def forward(self, user_id, item_id):
43        """Forward pass for a single (user, item) pair."""
44        # GMF pathway
45        gmf_user = self.user_emb_gmf[user_id]
46        gmf_item = self.item_emb_gmf[item_id]
47        gmf_out = gmf_user * gmf_item  # Element-wise product
48
49        # MLP pathway
50        mlp_input = np.concatenate([
51            self.user_emb_mlp[user_id],
52            self.item_emb_mlp[item_id]
53        ])
54        h = mlp_input
55        for W, b in self.mlp_layers:
56            h = self.relu(h @ W + b)
57
58        # Combine GMF and MLP
59        combined = np.concatenate([gmf_out, h])
60        logit = combined @ self.output_W + self.output_b
61        return self.sigmoid(logit[0])
62
63    def predict_all(self, user_id):
64        """Predict scores for all items."""
65        scores = np.array([self.forward(user_id, i) for i in range(self.n_items)])
66        return scores
67
68
69# Demo
70n_users, n_items = 100, 50
71model = NeuralCF(n_users, n_items, emb_dim=8, mlp_dims=[16, 8])
72
73# Before training: random predictions
74user_id = 0
75scores = model.predict_all(user_id)
76print(f"NCF model: {n_users} users, {n_items} items")
77print(f"User {user_id} top-5 items (untrained): {np.argsort(-scores)[:5]}")
78print(f"Score range: [{scores.min():.4f}, {scores.max():.4f}]")
79print(f"\nParameters:")
80print(f"  GMF embeddings: {n_users * 8 + n_items * 8:,}")
81print(f"  MLP embeddings: {n_users * 8 + n_items * 8:,}")
82print(f"  MLP weights: {sum(W.size + b.size for W, b in model.mlp_layers):,}")
83print(f"  Output layer: {model.output_W.size + model.output_b.size}")

Two-Tower Models

Two-tower (or dual encoder) models are the workhorse of large-scale retrieval. They independently encode users and items into embeddings, enabling efficient nearest-neighbor search at serving time.

Architecture

User Tower:          Item Tower:
  user_id              item_id
  + user features      + item features
      |                    |
  [DNN layers]         [DNN layers]
      |                    |
  user_embedding       item_embedding
      \                  /
       cosine similarity
            |
          score

Key Advantages

1. Decoupled encoding: Item embeddings can be precomputed and indexed offline 2. Approximate nearest neighbor (ANN): At serving time, find top-K items using efficient ANN search (FAISS, ScaNN, HNSW) instead of scoring all items 3. Scalable: Handles millions of items -- only need to compute the user embedding at query time

Training

Trained with contrastive loss or sampled softmax:

Positive pairs: (user, item) from actual interactions

Negative pairs: (user, random_item) -- items the user did not interact with

Loss pushes positive pairs closer and negative pairs further apart in embedding space

python

1import numpy as np
2
3class TwoTowerModel:
4    """
5    Simplified two-tower model for recommendation.
6    User and item towers are separate MLPs producing embeddings.
7    """
8
9    def __init__(self, n_users, n_items, n_user_features, n_item_features,
10                 emb_dim=16, hidden_dim=32):
11        self.emb_dim = emb_dim
12
13        # User tower
14        self.user_emb = np.random.normal(0, 0.01, (n_users, emb_dim))
15        self.user_W1 = np.random.normal(0, 0.01, (emb_dim + n_user_features, hidden_dim))
16        self.user_b1 = np.zeros(hidden_dim)
17        self.user_W2 = np.random.normal(0, 0.01, (hidden_dim, emb_dim))
18        self.user_b2 = np.zeros(emb_dim)
19
20        # Item tower
21        self.item_emb = np.random.normal(0, 0.01, (n_items, emb_dim))
22        self.item_W1 = np.random.normal(0, 0.01, (emb_dim + n_item_features, hidden_dim))
23        self.item_b1 = np.zeros(hidden_dim)
24        self.item_W2 = np.random.normal(0, 0.01, (hidden_dim, emb_dim))
25        self.item_b2 = np.zeros(emb_dim)
26
27    def relu(self, x):
28        return np.maximum(0, x)
29
30    def l2_normalize(self, x):
31        norm = np.linalg.norm(x)
32        return x / norm if norm > 0 else x
33
34    def encode_user(self, user_id, user_features):
35        """Encode user into embedding space."""
36        x = np.concatenate([self.user_emb[user_id], user_features])
37        h = self.relu(x @ self.user_W1 + self.user_b1)
38        emb = h @ self.user_W2 + self.user_b2
39        return self.l2_normalize(emb)
40
41    def encode_item(self, item_id, item_features):
42        """Encode item into embedding space."""
43        x = np.concatenate([self.item_emb[item_id], item_features])
44        h = self.relu(x @ self.item_W1 + self.item_b1)
45        emb = h @ self.item_W2 + self.item_b2
46        return self.l2_normalize(emb)
47
48    def score(self, user_emb, item_emb):
49        """Cosine similarity between user and item embeddings."""
50        return np.dot(user_emb, item_emb)
51
52    def retrieve_top_k(self, user_id, user_features, item_ids, item_features_list, k=5):
53        """Retrieve top-k items for a user."""
54        user_emb = self.encode_user(user_id, user_features)
55
56        # In production, item embeddings are precomputed and stored in an ANN index
57        scores = []
58        for item_id, item_feat in zip(item_ids, item_features_list):
59            item_emb = self.encode_item(item_id, item_feat)
60            s = self.score(user_emb, item_emb)
61            scores.append((item_id, s))
62
63        scores.sort(key=lambda x: -x[1])
64        return scores[:k]
65
66
67# Demo
68n_users, n_items = 1000, 5000
69n_user_feat, n_item_feat = 5, 10
70
71model = TwoTowerModel(n_users, n_items, n_user_feat, n_item_feat,
72                       emb_dim=16, hidden_dim=32)
73
74# Simulate retrieval
75user_feat = np.random.randn(n_user_feat)
76item_ids = list(range(20))  # Score 20 candidate items
77item_feats = [np.random.randn(n_item_feat) for _ in item_ids]
78
79top_items = model.retrieve_top_k(0, user_feat, item_ids, item_feats, k=5)
80print("Two-Tower Retrieval (untrained):")
81for item_id, score in top_items:
82    print(f"  Item {item_id}: score={score:.4f}")
83
84print(f"\nModel scale: {n_users:,} users, {n_items:,} items")
85print(f"User embedding: {model.encode_user(0, user_feat).shape}")
86print(f"Item embedding: {model.encode_item(0, item_feats[0]).shape}")

Deep Factorization Machines (DeepFM)

DeepFM combines the strengths of:

Factorization Machines: Capture pairwise feature interactions efficiently

Deep Neural Networks: Capture higher-order feature interactions

This is especially powerful for click-through rate (CTR) prediction where you have many sparse categorical features (user demographics, item attributes, context).

Architecture

Input: [user_id, item_id, gender, city, device, time_of_day, ...]
                    |
              Embedding Layer
              /            \
         FM Component    DNN Component
         (pairwise       (higher-order
          interactions)   interactions)
              \            /
              Addition/Concat
                    |
                 Sigmoid
                    |
              Click probability

FM Component

Captures all pairwise interactions between features: y_FM = w0 + sum(w_i * x_i) + sum_i sum_j * x_i * x_j

DNN Component

Concatenates all embeddings and passes through multiple dense layers to capture complex feature interactions that are hard to express as pairwise products.

Sequential Recommendation: GRU4Rec

For session-based or sequential recommendation, the order of user interactions matters. GRU4Rec (Hidasi et al., 2016) models the sequence of items a user interacted with using a Gated Recurrent Unit (GRU).

How It Works

1. Each item is embedded into a dense vector 2. The sequence of item embeddings is fed to a GRU 3. The GRU's hidden state captures the evolving user intent 4. The final hidden state is used to score candidate items

Attention-Based Alternative: SASRec

SASRec (Self-Attentive Sequential Recommendation) uses a transformer-style self-attention mechanism instead of an RNN, allowing it to attend to relevant past items regardless of their position in the sequence.

python

1import numpy as np
2
3class GRU4Rec:
4    """
5    Simplified GRU-based sequential recommender.
6    Processes a sequence of item interactions and predicts the next item.
7    """
8
9    def __init__(self, n_items, emb_dim=16, hidden_dim=32):
10        self.n_items = n_items
11        self.emb_dim = emb_dim
12        self.hidden_dim = hidden_dim
13
14        # Item embeddings
15        self.item_emb = np.random.normal(0, 0.01, (n_items, emb_dim))
16
17        # GRU parameters (simplified single-layer)
18        # Update gate
19        self.Wz = np.random.normal(0, 0.01, (emb_dim + hidden_dim, hidden_dim))
20        self.bz = np.zeros(hidden_dim)
21        # Reset gate
22        self.Wr = np.random.normal(0, 0.01, (emb_dim + hidden_dim, hidden_dim))
23        self.br = np.zeros(hidden_dim)
24        # Candidate hidden state
25        self.Wh = np.random.normal(0, 0.01, (emb_dim + hidden_dim, hidden_dim))
26        self.bh = np.zeros(hidden_dim)
27
28        # Output projection
29        self.Wo = np.random.normal(0, 0.01, (hidden_dim, n_items))
30
31    def sigmoid(self, x):
32        return 1 / (1 + np.exp(-np.clip(x, -10, 10)))
33
34    def tanh(self, x):
35        return np.tanh(np.clip(x, -10, 10))
36
37    def gru_step(self, x, h_prev):
38        """Single GRU step."""
39        combined = np.concatenate([x, h_prev])
40        z = self.sigmoid(combined @ self.Wz + self.bz)  # Update gate
41        r = self.sigmoid(combined @ self.Wr + self.br)  # Reset gate
42        combined_r = np.concatenate([x, r * h_prev])
43        h_candidate = self.tanh(combined_r @ self.Wh + self.bh)
44        h_new = (1 - z) * h_prev + z * h_candidate
45        return h_new
46
47    def forward(self, item_sequence):
48        """
49        Process a sequence of item IDs and return scores for next item.
50
51        Args:
52            item_sequence: list of item IDs (interaction history)
53
54        Returns:
55            scores: (n_items,) array of scores for each possible next item
56        """
57        h = np.zeros(self.hidden_dim)
58
59        for item_id in item_sequence:
60            x = self.item_emb[item_id]
61            h = self.gru_step(x, h)
62
63        # Score all items
64        scores = h @ self.Wo
65        return scores
66
67    def recommend(self, item_sequence, top_k=5, exclude_seen=True):
68        """Get top-k recommendations given a session."""
69        scores = self.forward(item_sequence)
70
71        if exclude_seen:
72            for item_id in item_sequence:
73                scores[item_id] = -np.inf
74
75        top_items = np.argsort(-scores)[:top_k]
76        return [(item_id, scores[item_id]) for item_id in top_items]
77
78
79# Demo
80n_items = 100
81model = GRU4Rec(n_items, emb_dim=16, hidden_dim=32)
82
83# Simulate a user session
84session = [5, 12, 7, 23]  # User viewed items 5, 12, 7, 23 in order
85recs = model.recommend(session, top_k=5)
86
87print(f"Session: {session}")
88print(f"Recommendations (untrained):")
89for item_id, score in recs:
90    print(f"  Item {item_id}: score={score:.4f}")
91
92# Show how recommendations change as session evolves
93print("\nEvolving session recommendations:")
94for length in range(1, len(session) + 1):
95    sub_session = session[:length]
96    recs = model.recommend(sub_session, top_k=3)
97    items = [r[0] for r in recs]
98    print(f"  After viewing {sub_session}: recommend {items}")

Embedding-Based Retrieval

In production systems with millions of items, you cannot score every item for every request. Embedding-based retrieval uses approximate nearest neighbor (ANN) search to efficiently find the most relevant items.

The Pipeline

1. Offline: Train a two-tower model. Compute and index all item embeddings using an ANN index (FAISS, ScaNN, Annoy, HNSW) 2. Online: Compute the user embedding from the user tower, query the ANN index for top-K nearest items 3. Re-rank: Score the top-K candidates with a more expensive model

ANN Algorithms

Algorithm	Approach	Trade-off
HNSW	Hierarchical navigable small world graph	Best recall, more memory
IVF-PQ	Inverted file with product quantization	Good compression, fast
ScaNN	Anisotropic quantization	Google's production choice
Annoy	Random projection trees	Simple, good for moderate scale

Key Consideration: Embedding Freshness

Item embeddings can be refreshed periodically (daily/weekly), but user embeddings should be computed at request time to capture recent interactions.

The Dot Product vs MLP Debate

Simple dot-product scoring (as in two-tower models) is fast but limited in expressiveness. MLP-based scoring (as in NCF) can model complex interactions but cannot be used for ANN retrieval since it requires both user and item embeddings simultaneously. The practical solution: use a simple dot-product model for candidate retrieval (fast, ANN-compatible), then re-rank the top candidates with a more expressive model (MLP, cross-attention, etc.).

Feature Interactions Matter

In CTR prediction, first-order features (user liked action movies) are less predictive than interactions (user liked action movies AND it is Friday evening AND user is on mobile). DeepFM and similar models explicitly capture these cross-feature interactions. In practice, careful feature engineering of interaction features can sometimes outperform deep models.

Deep Learning for Recommender Systems

Why Deep Learning for RecSys?

The Deep RecSys Landscape

Model	Key Idea	Use Case
Neural CF (NCF)	Replace dot product with MLP	General rating/click prediction
Two-Tower	Separate user & item encoders	Large-scale retrieval
Deep FM	Combine factorization with DNN	Feature-rich CTR prediction
GRU4Rec	RNN over session sequences	Session-based recommendation
SASRec	Self-attention over sequences	Sequential recommendation
DLRM	Embedding + interaction + MLP	Industrial-scale systems

Neural Collaborative Filtering (NCF)

NCF (He et al., 2017) replaces the dot product of matrix factorization with a neural network that can learn arbitrary interaction functions.

Architecture

GMF + MLP = NeuMF

The full NCF model (NeuMF) combines two pathways:

GMF (Generalized Matrix Factorization): Element-wise product of user and item embeddings (like classical MF, but learned end-to-end)

MLP: Concatenated embeddings through hidden layers

The outputs of both pathways are concatenated and fed to a final prediction layer.

GMF: p_gmf = user_emb_gmf * item_emb_gmf  (element-wise)
MLP: p_mlp = MLP(concat(user_emb_mlp, item_emb_mlp))
Output: sigmoid(W * concat(p_gmf, p_mlp) + b)

python

1import numpy as np
2
3class NeuralCF:
4    """
5    Neural Collaborative Filtering (simplified NumPy implementation).
6    Combines GMF and MLP pathways.
7    """
8
9    def __init__(self, n_users, n_items, emb_dim=8, mlp_dims=[16, 8], lr=0.01):
10        self.n_users = n_users
11        self.n_items = n_items
12        self.emb_dim = emb_dim
13        self.lr = lr
14
15        # GMF embeddings
16        self.user_emb_gmf = np.random.normal(0, 0.01, (n_users, emb_dim))
17        self.item_emb_gmf = np.random.normal(0, 0.01, (n_items, emb_dim))
18
19        # MLP embeddings
20        self.user_emb_mlp = np.random.normal(0, 0.01, (n_users, emb_dim))
21        self.item_emb_mlp = np.random.normal(0, 0.01, (n_items, emb_dim))
22
23        # MLP layers
24        self.mlp_layers = []
25        input_dim = emb_dim * 2
26        for dim in mlp_dims:
27            W = np.random.normal(0, 0.01, (input_dim, dim))
28            b = np.zeros(dim)
29            self.mlp_layers.append((W, b))
30            input_dim = dim
31
32        # Final output layer: concat GMF + MLP output -> scalar
33        self.output_W = np.random.normal(0, 0.01, (emb_dim + mlp_dims[-1], 1))
34        self.output_b = np.zeros(1)
35
36    def relu(self, x):
37        return np.maximum(0, x)
38
39    def sigmoid(self, x):
40        return 1 / (1 + np.exp(-np.clip(x, -10, 10)))
41
42    def forward(self, user_id, item_id):
43        """Forward pass for a single (user, item) pair."""
44        # GMF pathway
45        gmf_user = self.user_emb_gmf[user_id]
46        gmf_item = self.item_emb_gmf[item_id]
47        gmf_out = gmf_user * gmf_item  # Element-wise product
48
49        # MLP pathway
50        mlp_input = np.concatenate([
51            self.user_emb_mlp[user_id],
52            self.item_emb_mlp[item_id]
53        ])
54        h = mlp_input
55        for W, b in self.mlp_layers:
56            h = self.relu(h @ W + b)
57
58        # Combine GMF and MLP
59        combined = np.concatenate([gmf_out, h])
60        logit = combined @ self.output_W + self.output_b
61        return self.sigmoid(logit[0])
62
63    def predict_all(self, user_id):
64        """Predict scores for all items."""
65        scores = np.array([self.forward(user_id, i) for i in range(self.n_items)])
66        return scores
67
68
69# Demo
70n_users, n_items = 100, 50
71model = NeuralCF(n_users, n_items, emb_dim=8, mlp_dims=[16, 8])
72
73# Before training: random predictions
74user_id = 0
75scores = model.predict_all(user_id)
76print(f"NCF model: {n_users} users, {n_items} items")
77print(f"User {user_id} top-5 items (untrained): {np.argsort(-scores)[:5]}")
78print(f"Score range: [{scores.min():.4f}, {scores.max():.4f}]")
79print(f"\nParameters:")
80print(f"  GMF embeddings: {n_users * 8 + n_items * 8:,}")
81print(f"  MLP embeddings: {n_users * 8 + n_items * 8:,}")
82print(f"  MLP weights: {sum(W.size + b.size for W, b in model.mlp_layers):,}")
83print(f"  Output layer: {model.output_W.size + model.output_b.size}")

Two-Tower Models

Architecture

User Tower:          Item Tower:
  user_id              item_id
  + user features      + item features
      |                    |
  [DNN layers]         [DNN layers]
      |                    |
  user_embedding       item_embedding
      \                  /
       cosine similarity
            |
          score

Key Advantages

Training

Trained with contrastive loss or sampled softmax:

Positive pairs: (user, item) from actual interactions

Negative pairs: (user, random_item) -- items the user did not interact with

Loss pushes positive pairs closer and negative pairs further apart in embedding space

python

1import numpy as np
2
3class TwoTowerModel:
4    """
5    Simplified two-tower model for recommendation.
6    User and item towers are separate MLPs producing embeddings.
7    """
8
9    def __init__(self, n_users, n_items, n_user_features, n_item_features,
10                 emb_dim=16, hidden_dim=32):
11        self.emb_dim = emb_dim
12
13        # User tower
14        self.user_emb = np.random.normal(0, 0.01, (n_users, emb_dim))
15        self.user_W1 = np.random.normal(0, 0.01, (emb_dim + n_user_features, hidden_dim))
16        self.user_b1 = np.zeros(hidden_dim)
17        self.user_W2 = np.random.normal(0, 0.01, (hidden_dim, emb_dim))
18        self.user_b2 = np.zeros(emb_dim)
19
20        # Item tower
21        self.item_emb = np.random.normal(0, 0.01, (n_items, emb_dim))
22        self.item_W1 = np.random.normal(0, 0.01, (emb_dim + n_item_features, hidden_dim))
23        self.item_b1 = np.zeros(hidden_dim)
24        self.item_W2 = np.random.normal(0, 0.01, (hidden_dim, emb_dim))
25        self.item_b2 = np.zeros(emb_dim)
26
27    def relu(self, x):
28        return np.maximum(0, x)
29
30    def l2_normalize(self, x):
31        norm = np.linalg.norm(x)
32        return x / norm if norm > 0 else x
33
34    def encode_user(self, user_id, user_features):
35        """Encode user into embedding space."""
36        x = np.concatenate([self.user_emb[user_id], user_features])
37        h = self.relu(x @ self.user_W1 + self.user_b1)
38        emb = h @ self.user_W2 + self.user_b2
39        return self.l2_normalize(emb)
40
41    def encode_item(self, item_id, item_features):
42        """Encode item into embedding space."""
43        x = np.concatenate([self.item_emb[item_id], item_features])
44        h = self.relu(x @ self.item_W1 + self.item_b1)
45        emb = h @ self.item_W2 + self.item_b2
46        return self.l2_normalize(emb)
47
48    def score(self, user_emb, item_emb):
49        """Cosine similarity between user and item embeddings."""
50        return np.dot(user_emb, item_emb)
51
52    def retrieve_top_k(self, user_id, user_features, item_ids, item_features_list, k=5):
53        """Retrieve top-k items for a user."""
54        user_emb = self.encode_user(user_id, user_features)
55
56        # In production, item embeddings are precomputed and stored in an ANN index
57        scores = []
58        for item_id, item_feat in zip(item_ids, item_features_list):
59            item_emb = self.encode_item(item_id, item_feat)
60            s = self.score(user_emb, item_emb)
61            scores.append((item_id, s))
62
63        scores.sort(key=lambda x: -x[1])
64        return scores[:k]
65
66
67# Demo
68n_users, n_items = 1000, 5000
69n_user_feat, n_item_feat = 5, 10
70
71model = TwoTowerModel(n_users, n_items, n_user_feat, n_item_feat,
72                       emb_dim=16, hidden_dim=32)
73
74# Simulate retrieval
75user_feat = np.random.randn(n_user_feat)
76item_ids = list(range(20))  # Score 20 candidate items
77item_feats = [np.random.randn(n_item_feat) for _ in item_ids]
78
79top_items = model.retrieve_top_k(0, user_feat, item_ids, item_feats, k=5)
80print("Two-Tower Retrieval (untrained):")
81for item_id, score in top_items:
82    print(f"  Item {item_id}: score={score:.4f}")
83
84print(f"\nModel scale: {n_users:,} users, {n_items:,} items")
85print(f"User embedding: {model.encode_user(0, user_feat).shape}")
86print(f"Item embedding: {model.encode_item(0, item_feats[0]).shape}")

Deep Factorization Machines (DeepFM)

DeepFM combines the strengths of:

Factorization Machines: Capture pairwise feature interactions efficiently

Deep Neural Networks: Capture higher-order feature interactions

This is especially powerful for click-through rate (CTR) prediction where you have many sparse categorical features (user demographics, item attributes, context).

Architecture

Input: [user_id, item_id, gender, city, device, time_of_day, ...]
                    |
              Embedding Layer
              /            \
         FM Component    DNN Component
         (pairwise       (higher-order
          interactions)   interactions)
              \            /
              Addition/Concat
                    |
                 Sigmoid
                    |
              Click probability

FM Component

Captures all pairwise interactions between features: y_FM = w0 + sum(w_i * x_i) + sum_i sum_j * x_i * x_j

DNN Component

Concatenates all embeddings and passes through multiple dense layers to capture complex feature interactions that are hard to express as pairwise products.

Sequential Recommendation: GRU4Rec

How It Works

Attention-Based Alternative: SASRec

python

1import numpy as np
2
3class GRU4Rec:
4    """
5    Simplified GRU-based sequential recommender.
6    Processes a sequence of item interactions and predicts the next item.
7    """
8
9    def __init__(self, n_items, emb_dim=16, hidden_dim=32):
10        self.n_items = n_items
11        self.emb_dim = emb_dim
12        self.hidden_dim = hidden_dim
13
14        # Item embeddings
15        self.item_emb = np.random.normal(0, 0.01, (n_items, emb_dim))
16
17        # GRU parameters (simplified single-layer)
18        # Update gate
19        self.Wz = np.random.normal(0, 0.01, (emb_dim + hidden_dim, hidden_dim))
20        self.bz = np.zeros(hidden_dim)
21        # Reset gate
22        self.Wr = np.random.normal(0, 0.01, (emb_dim + hidden_dim, hidden_dim))
23        self.br = np.zeros(hidden_dim)
24        # Candidate hidden state
25        self.Wh = np.random.normal(0, 0.01, (emb_dim + hidden_dim, hidden_dim))
26        self.bh = np.zeros(hidden_dim)
27
28        # Output projection
29        self.Wo = np.random.normal(0, 0.01, (hidden_dim, n_items))
30
31    def sigmoid(self, x):
32        return 1 / (1 + np.exp(-np.clip(x, -10, 10)))
33
34    def tanh(self, x):
35        return np.tanh(np.clip(x, -10, 10))
36
37    def gru_step(self, x, h_prev):
38        """Single GRU step."""
39        combined = np.concatenate([x, h_prev])
40        z = self.sigmoid(combined @ self.Wz + self.bz)  # Update gate
41        r = self.sigmoid(combined @ self.Wr + self.br)  # Reset gate
42        combined_r = np.concatenate([x, r * h_prev])
43        h_candidate = self.tanh(combined_r @ self.Wh + self.bh)
44        h_new = (1 - z) * h_prev + z * h_candidate
45        return h_new
46
47    def forward(self, item_sequence):
48        """
49        Process a sequence of item IDs and return scores for next item.
50
51        Args:
52            item_sequence: list of item IDs (interaction history)
53
54        Returns:
55            scores: (n_items,) array of scores for each possible next item
56        """
57        h = np.zeros(self.hidden_dim)
58
59        for item_id in item_sequence:
60            x = self.item_emb[item_id]
61            h = self.gru_step(x, h)
62
63        # Score all items
64        scores = h @ self.Wo
65        return scores
66
67    def recommend(self, item_sequence, top_k=5, exclude_seen=True):
68        """Get top-k recommendations given a session."""
69        scores = self.forward(item_sequence)
70
71        if exclude_seen:
72            for item_id in item_sequence:
73                scores[item_id] = -np.inf
74
75        top_items = np.argsort(-scores)[:top_k]
76        return [(item_id, scores[item_id]) for item_id in top_items]
77
78
79# Demo
80n_items = 100
81model = GRU4Rec(n_items, emb_dim=16, hidden_dim=32)
82
83# Simulate a user session
84session = [5, 12, 7, 23]  # User viewed items 5, 12, 7, 23 in order
85recs = model.recommend(session, top_k=5)
86
87print(f"Session: {session}")
88print(f"Recommendations (untrained):")
89for item_id, score in recs:
90    print(f"  Item {item_id}: score={score:.4f}")
91
92# Show how recommendations change as session evolves
93print("\nEvolving session recommendations:")
94for length in range(1, len(session) + 1):
95    sub_session = session[:length]
96    recs = model.recommend(sub_session, top_k=3)
97    items = [r[0] for r in recs]
98    print(f"  After viewing {sub_session}: recommend {items}")

Embedding-Based Retrieval

The Pipeline

ANN Algorithms

Algorithm	Approach	Trade-off
HNSW	Hierarchical navigable small world graph	Best recall, more memory
IVF-PQ	Inverted file with product quantization	Good compression, fast
ScaNN	Anisotropic quantization	Google's production choice
Annoy	Random projection trees	Simple, good for moderate scale

Key Consideration: Embedding Freshness

Item embeddings can be refreshed periodically (daily/weekly), but user embeddings should be computed at request time to capture recent interactions.

Deep Learning Recommenders

Deep Learning for Recommender Systems

Why Deep Learning for RecSys?

The Deep RecSys Landscape

Neural Collaborative Filtering (NCF)

Architecture

GMF + MLP = NeuMF

Two-Tower Models

Architecture

Key Advantages

Training

Deep Factorization Machines (DeepFM)

Architecture

FM Component

DNN Component

Sequential Recommendation: GRU4Rec

How It Works

Attention-Based Alternative: SASRec

Embedding-Based Retrieval

The Pipeline

ANN Algorithms

Key Consideration: Embedding Freshness

The Dot Product vs MLP Debate

Feature Interactions Matter

Deep Learning Recommenders

Deep Learning for Recommender Systems

Why Deep Learning for RecSys?

The Deep RecSys Landscape

Neural Collaborative Filtering (NCF)

Architecture

GMF + MLP = NeuMF

Two-Tower Models

Architecture

Key Advantages

Training

Deep Factorization Machines (DeepFM)

Architecture

FM Component

DNN Component

Sequential Recommendation: GRU4Rec

How It Works

Attention-Based Alternative: SASRec

Embedding-Based Retrieval

The Pipeline

ANN Algorithms

Key Consideration: Embedding Freshness

The Dot Product vs MLP Debate

Feature Interactions Matter