Skip to main content

Deep Learning Recommenders

Neural CF, two-tower models, deep factorization machines, and sequential recommendations

~50 min
Listen to this lesson

Deep Learning for Recommender Systems

Traditional CF methods (SVD, ALS) learn linear interactions between user and item latent factors. Deep learning allows us to model nonlinear relationships, incorporate diverse features (text, images, sequences), and build end-to-end systems.

Why Deep Learning for RecSys?

1. Nonlinear interactions: A neural network can learn complex preference patterns that a dot product cannot capture 2. Feature fusion: Easily combine heterogeneous inputs (ID embeddings + text + images + context) 3. Sequential modeling: RNNs and transformers capture the order of user interactions 4. Representation learning: Embeddings are learned end-to-end, optimized for the recommendation task

The Deep RecSys Landscape

ModelKey IdeaUse Case
Neural CF (NCF)Replace dot product with MLPGeneral rating/click prediction
Two-TowerSeparate user & item encodersLarge-scale retrieval
Deep FMCombine factorization with DNNFeature-rich CTR prediction
GRU4RecRNN over session sequencesSession-based recommendation
SASRecSelf-attention over sequencesSequential recommendation
DLRMEmbedding + interaction + MLPIndustrial-scale systems

Neural Collaborative Filtering (NCF)

NCF (He et al., 2017) replaces the dot product of matrix factorization with a neural network that can learn arbitrary interaction functions.

Architecture

1. Embedding layers: Map user IDs and item IDs to dense vectors 2. Interaction layer: Concatenate (or element-wise multiply) user and item embeddings 3. MLP layers: Multiple fully connected layers with nonlinear activations 4. Output layer: Sigmoid for implicit feedback (click/no-click), linear for explicit ratings

GMF + MLP = NeuMF

The full NCF model (NeuMF) combines two pathways:

  • GMF (Generalized Matrix Factorization): Element-wise product of user and item embeddings (like classical MF, but learned end-to-end)
  • MLP: Concatenated embeddings through hidden layers
  • The outputs of both pathways are concatenated and fed to a final prediction layer.

    GMF: p_gmf = user_emb_gmf * item_emb_gmf  (element-wise)
    MLP: p_mlp = MLP(concat(user_emb_mlp, item_emb_mlp))
    Output: sigmoid(W * concat(p_gmf, p_mlp) + b)
    

    python
    1import numpy as np
    2
    3class NeuralCF:
    4    """
    5    Neural Collaborative Filtering (simplified NumPy implementation).
    6    Combines GMF and MLP pathways.
    7    """
    8
    9    def __init__(self, n_users, n_items, emb_dim=8, mlp_dims=[16, 8], lr=0.01):
    10        self.n_users = n_users
    11        self.n_items = n_items
    12        self.emb_dim = emb_dim
    13        self.lr = lr
    14
    15        # GMF embeddings
    16        self.user_emb_gmf = np.random.normal(0, 0.01, (n_users, emb_dim))
    17        self.item_emb_gmf = np.random.normal(0, 0.01, (n_items, emb_dim))
    18
    19        # MLP embeddings
    20        self.user_emb_mlp = np.random.normal(0, 0.01, (n_users, emb_dim))
    21        self.item_emb_mlp = np.random.normal(0, 0.01, (n_items, emb_dim))
    22
    23        # MLP layers
    24        self.mlp_layers = []
    25        input_dim = emb_dim * 2
    26        for dim in mlp_dims:
    27            W = np.random.normal(0, 0.01, (input_dim, dim))
    28            b = np.zeros(dim)
    29            self.mlp_layers.append((W, b))
    30            input_dim = dim
    31
    32        # Final output layer: concat GMF + MLP output -> scalar
    33        self.output_W = np.random.normal(0, 0.01, (emb_dim + mlp_dims[-1], 1))
    34        self.output_b = np.zeros(1)
    35
    36    def relu(self, x):
    37        return np.maximum(0, x)
    38
    39    def sigmoid(self, x):
    40        return 1 / (1 + np.exp(-np.clip(x, -10, 10)))
    41
    42    def forward(self, user_id, item_id):
    43        """Forward pass for a single (user, item) pair."""
    44        # GMF pathway
    45        gmf_user = self.user_emb_gmf[user_id]
    46        gmf_item = self.item_emb_gmf[item_id]
    47        gmf_out = gmf_user * gmf_item  # Element-wise product
    48
    49        # MLP pathway
    50        mlp_input = np.concatenate([
    51            self.user_emb_mlp[user_id],
    52            self.item_emb_mlp[item_id]
    53        ])
    54        h = mlp_input
    55        for W, b in self.mlp_layers:
    56            h = self.relu(h @ W + b)
    57
    58        # Combine GMF and MLP
    59        combined = np.concatenate([gmf_out, h])
    60        logit = combined @ self.output_W + self.output_b
    61        return self.sigmoid(logit[0])
    62
    63    def predict_all(self, user_id):
    64        """Predict scores for all items."""
    65        scores = np.array([self.forward(user_id, i) for i in range(self.n_items)])
    66        return scores
    67
    68
    69# Demo
    70n_users, n_items = 100, 50
    71model = NeuralCF(n_users, n_items, emb_dim=8, mlp_dims=[16, 8])
    72
    73# Before training: random predictions
    74user_id = 0
    75scores = model.predict_all(user_id)
    76print(f"NCF model: {n_users} users, {n_items} items")
    77print(f"User {user_id} top-5 items (untrained): {np.argsort(-scores)[:5]}")
    78print(f"Score range: [{scores.min():.4f}, {scores.max():.4f}]")
    79print(f"\nParameters:")
    80print(f"  GMF embeddings: {n_users * 8 + n_items * 8:,}")
    81print(f"  MLP embeddings: {n_users * 8 + n_items * 8:,}")
    82print(f"  MLP weights: {sum(W.size + b.size for W, b in model.mlp_layers):,}")
    83print(f"  Output layer: {model.output_W.size + model.output_b.size}")

    Two-Tower Models

    Two-tower (or dual encoder) models are the workhorse of large-scale retrieval. They independently encode users and items into embeddings, enabling efficient nearest-neighbor search at serving time.

    Architecture

    User Tower:          Item Tower:
      user_id              item_id
      + user features      + item features
          |                    |
      [DNN layers]         [DNN layers]
          |                    |
      user_embedding       item_embedding
          \                  /
           cosine similarity
                |
              score
    

    Key Advantages

    1. Decoupled encoding: Item embeddings can be precomputed and indexed offline 2. Approximate nearest neighbor (ANN): At serving time, find top-K items using efficient ANN search (FAISS, ScaNN, HNSW) instead of scoring all items 3. Scalable: Handles millions of items -- only need to compute the user embedding at query time

    Training

    Trained with contrastive loss or sampled softmax:

  • Positive pairs: (user, item) from actual interactions
  • Negative pairs: (user, random_item) -- items the user did not interact with
  • Loss pushes positive pairs closer and negative pairs further apart in embedding space
  • python
    1import numpy as np
    2
    3class TwoTowerModel:
    4    """
    5    Simplified two-tower model for recommendation.
    6    User and item towers are separate MLPs producing embeddings.
    7    """
    8
    9    def __init__(self, n_users, n_items, n_user_features, n_item_features,
    10                 emb_dim=16, hidden_dim=32):
    11        self.emb_dim = emb_dim
    12
    13        # User tower
    14        self.user_emb = np.random.normal(0, 0.01, (n_users, emb_dim))
    15        self.user_W1 = np.random.normal(0, 0.01, (emb_dim + n_user_features, hidden_dim))
    16        self.user_b1 = np.zeros(hidden_dim)
    17        self.user_W2 = np.random.normal(0, 0.01, (hidden_dim, emb_dim))
    18        self.user_b2 = np.zeros(emb_dim)
    19
    20        # Item tower
    21        self.item_emb = np.random.normal(0, 0.01, (n_items, emb_dim))
    22        self.item_W1 = np.random.normal(0, 0.01, (emb_dim + n_item_features, hidden_dim))
    23        self.item_b1 = np.zeros(hidden_dim)
    24        self.item_W2 = np.random.normal(0, 0.01, (hidden_dim, emb_dim))
    25        self.item_b2 = np.zeros(emb_dim)
    26
    27    def relu(self, x):
    28        return np.maximum(0, x)
    29
    30    def l2_normalize(self, x):
    31        norm = np.linalg.norm(x)
    32        return x / norm if norm > 0 else x
    33
    34    def encode_user(self, user_id, user_features):
    35        """Encode user into embedding space."""
    36        x = np.concatenate([self.user_emb[user_id], user_features])
    37        h = self.relu(x @ self.user_W1 + self.user_b1)
    38        emb = h @ self.user_W2 + self.user_b2
    39        return self.l2_normalize(emb)
    40
    41    def encode_item(self, item_id, item_features):
    42        """Encode item into embedding space."""
    43        x = np.concatenate([self.item_emb[item_id], item_features])
    44        h = self.relu(x @ self.item_W1 + self.item_b1)
    45        emb = h @ self.item_W2 + self.item_b2
    46        return self.l2_normalize(emb)
    47
    48    def score(self, user_emb, item_emb):
    49        """Cosine similarity between user and item embeddings."""
    50        return np.dot(user_emb, item_emb)
    51
    52    def retrieve_top_k(self, user_id, user_features, item_ids, item_features_list, k=5):
    53        """Retrieve top-k items for a user."""
    54        user_emb = self.encode_user(user_id, user_features)
    55
    56        # In production, item embeddings are precomputed and stored in an ANN index
    57        scores = []
    58        for item_id, item_feat in zip(item_ids, item_features_list):
    59            item_emb = self.encode_item(item_id, item_feat)
    60            s = self.score(user_emb, item_emb)
    61            scores.append((item_id, s))
    62
    63        scores.sort(key=lambda x: -x[1])
    64        return scores[:k]
    65
    66
    67# Demo
    68n_users, n_items = 1000, 5000
    69n_user_feat, n_item_feat = 5, 10
    70
    71model = TwoTowerModel(n_users, n_items, n_user_feat, n_item_feat,
    72                       emb_dim=16, hidden_dim=32)
    73
    74# Simulate retrieval
    75user_feat = np.random.randn(n_user_feat)
    76item_ids = list(range(20))  # Score 20 candidate items
    77item_feats = [np.random.randn(n_item_feat) for _ in item_ids]
    78
    79top_items = model.retrieve_top_k(0, user_feat, item_ids, item_feats, k=5)
    80print("Two-Tower Retrieval (untrained):")
    81for item_id, score in top_items:
    82    print(f"  Item {item_id}: score={score:.4f}")
    83
    84print(f"\nModel scale: {n_users:,} users, {n_items:,} items")
    85print(f"User embedding: {model.encode_user(0, user_feat).shape}")
    86print(f"Item embedding: {model.encode_item(0, item_feats[0]).shape}")

    Deep Factorization Machines (DeepFM)

    DeepFM combines the strengths of:

  • Factorization Machines: Capture pairwise feature interactions efficiently
  • Deep Neural Networks: Capture higher-order feature interactions
  • This is especially powerful for click-through rate (CTR) prediction where you have many sparse categorical features (user demographics, item attributes, context).

    Architecture

    Input: [user_id, item_id, gender, city, device, time_of_day, ...]
                        |
                  Embedding Layer
                  /            \
             FM Component    DNN Component
             (pairwise       (higher-order
              interactions)   interactions)
                  \            /
                  Addition/Concat
                        |
                     Sigmoid
                        |
                  Click probability
    

    FM Component

    Captures all pairwise interactions between features: y_FM = w0 + sum(w_i * x_i) + sum_i sum_j * x_i * x_j

    DNN Component

    Concatenates all embeddings and passes through multiple dense layers to capture complex feature interactions that are hard to express as pairwise products.

    Sequential Recommendation: GRU4Rec

    For session-based or sequential recommendation, the order of user interactions matters. GRU4Rec (Hidasi et al., 2016) models the sequence of items a user interacted with using a Gated Recurrent Unit (GRU).

    How It Works

    1. Each item is embedded into a dense vector 2. The sequence of item embeddings is fed to a GRU 3. The GRU's hidden state captures the evolving user intent 4. The final hidden state is used to score candidate items

    Attention-Based Alternative: SASRec

    SASRec (Self-Attentive Sequential Recommendation) uses a transformer-style self-attention mechanism instead of an RNN, allowing it to attend to relevant past items regardless of their position in the sequence.

    python
    1import numpy as np
    2
    3class GRU4Rec:
    4    """
    5    Simplified GRU-based sequential recommender.
    6    Processes a sequence of item interactions and predicts the next item.
    7    """
    8
    9    def __init__(self, n_items, emb_dim=16, hidden_dim=32):
    10        self.n_items = n_items
    11        self.emb_dim = emb_dim
    12        self.hidden_dim = hidden_dim
    13
    14        # Item embeddings
    15        self.item_emb = np.random.normal(0, 0.01, (n_items, emb_dim))
    16
    17        # GRU parameters (simplified single-layer)
    18        # Update gate
    19        self.Wz = np.random.normal(0, 0.01, (emb_dim + hidden_dim, hidden_dim))
    20        self.bz = np.zeros(hidden_dim)
    21        # Reset gate
    22        self.Wr = np.random.normal(0, 0.01, (emb_dim + hidden_dim, hidden_dim))
    23        self.br = np.zeros(hidden_dim)
    24        # Candidate hidden state
    25        self.Wh = np.random.normal(0, 0.01, (emb_dim + hidden_dim, hidden_dim))
    26        self.bh = np.zeros(hidden_dim)
    27
    28        # Output projection
    29        self.Wo = np.random.normal(0, 0.01, (hidden_dim, n_items))
    30
    31    def sigmoid(self, x):
    32        return 1 / (1 + np.exp(-np.clip(x, -10, 10)))
    33
    34    def tanh(self, x):
    35        return np.tanh(np.clip(x, -10, 10))
    36
    37    def gru_step(self, x, h_prev):
    38        """Single GRU step."""
    39        combined = np.concatenate([x, h_prev])
    40        z = self.sigmoid(combined @ self.Wz + self.bz)  # Update gate
    41        r = self.sigmoid(combined @ self.Wr + self.br)  # Reset gate
    42        combined_r = np.concatenate([x, r * h_prev])
    43        h_candidate = self.tanh(combined_r @ self.Wh + self.bh)
    44        h_new = (1 - z) * h_prev + z * h_candidate
    45        return h_new
    46
    47    def forward(self, item_sequence):
    48        """
    49        Process a sequence of item IDs and return scores for next item.
    50
    51        Args:
    52            item_sequence: list of item IDs (interaction history)
    53
    54        Returns:
    55            scores: (n_items,) array of scores for each possible next item
    56        """
    57        h = np.zeros(self.hidden_dim)
    58
    59        for item_id in item_sequence:
    60            x = self.item_emb[item_id]
    61            h = self.gru_step(x, h)
    62
    63        # Score all items
    64        scores = h @ self.Wo
    65        return scores
    66
    67    def recommend(self, item_sequence, top_k=5, exclude_seen=True):
    68        """Get top-k recommendations given a session."""
    69        scores = self.forward(item_sequence)
    70
    71        if exclude_seen:
    72            for item_id in item_sequence:
    73                scores[item_id] = -np.inf
    74
    75        top_items = np.argsort(-scores)[:top_k]
    76        return [(item_id, scores[item_id]) for item_id in top_items]
    77
    78
    79# Demo
    80n_items = 100
    81model = GRU4Rec(n_items, emb_dim=16, hidden_dim=32)
    82
    83# Simulate a user session
    84session = [5, 12, 7, 23]  # User viewed items 5, 12, 7, 23 in order
    85recs = model.recommend(session, top_k=5)
    86
    87print(f"Session: {session}")
    88print(f"Recommendations (untrained):")
    89for item_id, score in recs:
    90    print(f"  Item {item_id}: score={score:.4f}")
    91
    92# Show how recommendations change as session evolves
    93print("\nEvolving session recommendations:")
    94for length in range(1, len(session) + 1):
    95    sub_session = session[:length]
    96    recs = model.recommend(sub_session, top_k=3)
    97    items = [r[0] for r in recs]
    98    print(f"  After viewing {sub_session}: recommend {items}")

    Embedding-Based Retrieval

    In production systems with millions of items, you cannot score every item for every request. Embedding-based retrieval uses approximate nearest neighbor (ANN) search to efficiently find the most relevant items.

    The Pipeline

    1. Offline: Train a two-tower model. Compute and index all item embeddings using an ANN index (FAISS, ScaNN, Annoy, HNSW) 2. Online: Compute the user embedding from the user tower, query the ANN index for top-K nearest items 3. Re-rank: Score the top-K candidates with a more expensive model

    ANN Algorithms

    AlgorithmApproachTrade-off
    HNSWHierarchical navigable small world graphBest recall, more memory
    IVF-PQInverted file with product quantizationGood compression, fast
    ScaNNAnisotropic quantizationGoogle's production choice
    AnnoyRandom projection treesSimple, good for moderate scale

    Key Consideration: Embedding Freshness

    Item embeddings can be refreshed periodically (daily/weekly), but user embeddings should be computed at request time to capture recent interactions.

    The Dot Product vs MLP Debate

    Simple dot-product scoring (as in two-tower models) is fast but limited in expressiveness. MLP-based scoring (as in NCF) can model complex interactions but cannot be used for ANN retrieval since it requires both user and item embeddings simultaneously. The practical solution: use a simple dot-product model for candidate retrieval (fast, ANN-compatible), then re-rank the top candidates with a more expressive model (MLP, cross-attention, etc.).

    Feature Interactions Matter

    In CTR prediction, first-order features (user liked action movies) are less predictive than interactions (user liked action movies AND it is Friday evening AND user is on mobile). DeepFM and similar models explicitly capture these cross-feature interactions. In practice, careful feature engineering of interaction features can sometimes outperform deep models.