Skip to main content

Text Processing & Embeddings

Learn how raw text is transformed into numerical representations that neural networks can understand — from simple bag-of-words to powerful contextual embeddings.

~45 min
Listen to this lesson

Text Processing & Embeddings

Machines don't understand words — they understand numbers. The entire field of NLP revolves around one central challenge: how do we convert human language into numerical representations that preserve meaning?

Over the past two decades, the answer to that question has evolved dramatically:

EraTechniqueKey Idea
~2000Bag of Words (BoW)Count word occurrences
~2005TF-IDFWeight words by importance
2013Word2Vec / GloVeLearn dense vector representations
2018+Contextual Embeddings (BERT, GPT)Same word gets different vectors based on context
Each step brought us closer to representations that capture the richness of human language. Let's walk through each one.

Why Numbers, Not Text?

Neural networks perform matrix multiplication, gradient descent, and activation functions — all mathematical operations that require numerical inputs. A string like "The cat sat on the mat" is meaningless to a GPU. We need to convert it into a tensor of floating-point numbers while preserving as much semantic information as possible.

Bag of Words (BoW)

The simplest approach: create a vocabulary of all unique words, then represent each document as a vector of word counts.

Example:

  • Vocabulary: ["the", "cat", "sat", "on", "mat", "dog", "ran"]
  • "The cat sat on the mat" → [2, 1, 1, 1, 1, 0, 0]
  • "The dog ran" → [1, 0, 0, 0, 0, 1, 1]
  • Problems with BoW:

  • Loses all word order ("dog bites man" = "man bites dog")
  • Vectors are extremely sparse (most entries are zero)
  • No notion of word similarity ("happy" and "joyful" are as different as "happy" and "table")
  • TF-IDF: Term Frequency–Inverse Document Frequency

    TF-IDF improves on raw counts by asking: how important is this word to this specific document, relative to the whole corpus?

    $$\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)$$

  • TF(t, d) = frequency of term *t* in document *d*
  • IDF(t) = log(total documents / documents containing *t*)
  • Words like "the" appear in every document, so their IDF is near zero. Domain-specific terms like "embedding" have high IDF in a general corpus.

    python
    1from sklearn.feature_extraction.text import TfidfVectorizer
    2
    3documents = [
    4    "The cat sat on the mat",
    5    "The dog chased the cat",
    6    "A bird flew over the mat",
    7]
    8
    9vectorizer = TfidfVectorizer()
    10tfidf_matrix = vectorizer.fit_transform(documents)
    11
    12print("Vocabulary:", vectorizer.get_feature_names_out())
    13print("TF-IDF shape:", tfidf_matrix.shape)
    14print("Document 0 vector:\n", tfidf_matrix[0].toarray())
    15# Notice "the" has low weight (appears everywhere)
    16# while "sat" has high weight (unique to document 0)

    Word2Vec: Dense Learned Embeddings

    The breakthrough came in 2013 when Mikolov et al. showed that you could learn vector representations by training a shallow neural network on a simple task: predict a word from its context (CBOW) or predict the context from a word (Skip-gram).

    The magic is that the learned vectors capture semantic relationships:

    king - man + woman ≈ queen
    paris - france + italy ≈ rome
    

    Each word gets a dense vector (typically 100–300 dimensions) where similar words are close together in the vector space.

    Static vs. Contextual Embeddings

    Word2Vec and GloVe produce **static** embeddings — the word "bank" always gets the same vector whether it means a river bank or a financial institution. Contextual embeddings (BERT, GPT) produce different vectors depending on surrounding words, which is a massive improvement for ambiguous language.

    Tokenization in Practice

    Before any of these techniques work, we need to tokenize the raw text — split it into individual units. There are several strategies:

    StrategyExample: "unhappiness"ProsCons
    Word-level["unhappiness"]Simple, intuitiveLarge vocab, can't handle unknown words
    Character-level["u","n","h","a","p","p","i","n","e","s","s"]Tiny vocab, no OOVVery long sequences, harder to learn
    Subword (BPE)["un", "happiness"]Balanced vocab, handles OOVRequires training a tokenizer
    Modern models almost universally use subword tokenization (Byte-Pair Encoding or SentencePiece) because it strikes the best balance.

    Building Text Models in Keras

    Keras provides two key tools for text processing:

    1. TextVectorization — a preprocessing layer that tokenizes and indexes text 2. Embedding — a trainable layer that maps integer token IDs to dense vectors

    Let's build a complete text classification pipeline.

    python
    1import tensorflow as tf
    2from tensorflow.keras import layers, models
    3
    4# --- TextVectorization: converts raw strings → integer sequences ---
    5max_tokens = 10000
    6max_length = 200
    7
    8vectorize_layer = layers.TextVectorization(
    9    max_tokens=max_tokens,
    10    output_mode="int",
    11    output_sequence_length=max_length,
    12)
    13
    14# Adapt the layer to your training data (builds the vocabulary)
    15train_texts = ["This movie was great!", "Terrible film, waste of time.", ...]
    16vectorize_layer.adapt(train_texts)
    17
    18# See the vocabulary
    19vocab = vectorize_layer.get_vocabulary()
    20print(f"Vocabulary size: {len(vocab)}")
    21print(f"First 10 tokens: {vocab[:10]}")
    22
    23# Vectorize a sentence
    24sample = tf.constant(["This movie was great!"])
    25print(vectorize_layer(sample))  # e.g., [  12   45    8  203    0    0 ...]
    python
    1# --- Full text classification model ---
    2embedding_dim = 128
    3
    4model = models.Sequential([
    5    # Input: raw strings
    6    layers.Input(shape=(1,), dtype=tf.string),
    7    vectorize_layer,
    8
    9    # Embedding: (batch, seq_len) → (batch, seq_len, embedding_dim)
    10    layers.Embedding(
    11        input_dim=max_tokens,
    12        output_dim=embedding_dim,
    13        mask_zero=True,  # Ignore padding tokens
    14    ),
    15
    16    # GlobalAveragePooling1D: (batch, seq_len, 128) → (batch, 128)
    17    # Averages across the sequence dimension — simple but effective
    18    layers.GlobalAveragePooling1D(),
    19
    20    # Classification head
    21    layers.Dense(64, activation="relu"),
    22    layers.Dropout(0.3),
    23    layers.Dense(1, activation="sigmoid"),  # Binary classification
    24])
    25
    26model.compile(
    27    optimizer="adam",
    28    loss="binary_crossentropy",
    29    metrics=["accuracy"],
    30)
    31
    32model.summary()

    GlobalAveragePooling1D vs. Flatten

    GlobalAveragePooling1D averages all token embeddings into a single vector. It's much more parameter-efficient than Flatten (which would create a huge dense layer) and acts as a form of regularization. Think of it as creating a "sentence embedding" by averaging all word embeddings.

    IMDB Sentiment Classification — End to End

    Let's put it all together with the classic IMDB movie review dataset:

    python
    1import tensorflow as tf
    2from tensorflow.keras import layers, models
    3import tensorflow_datasets as tfds
    4
    5# Load IMDB dataset (25k train, 25k test)
    6(train_data, test_data), info = tfds.load(
    7    "imdb_reviews",
    8    split=["train", "test"],
    9    as_supervised=True,
    10    with_info=True,
    11)
    12
    13# Prepare batches
    14train_ds = train_data.shuffle(10000).batch(64).prefetch(tf.data.AUTOTUNE)
    15test_ds = test_data.batch(64).prefetch(tf.data.AUTOTUNE)
    16
    17# Build the TextVectorization layer
    18max_tokens = 20000
    19max_length = 500
    20
    21vectorize_layer = layers.TextVectorization(
    22    max_tokens=max_tokens,
    23    output_mode="int",
    24    output_sequence_length=max_length,
    25)
    26
    27# Adapt on training text only
    28train_text = train_data.map(lambda text, label: text)
    29vectorize_layer.adapt(train_text)
    30
    31# Build model
    32model = models.Sequential([
    33    vectorize_layer,
    34    layers.Embedding(max_tokens, 128, mask_zero=True),
    35    layers.GlobalAveragePooling1D(),
    36    layers.Dense(64, activation="relu"),
    37    layers.Dropout(0.4),
    38    layers.Dense(1, activation="sigmoid"),
    39])
    40
    41model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
    42
    43# Train
    44history = model.fit(train_ds, validation_data=test_ds, epochs=10)
    45# Expect ~87-89% accuracy with this simple architecture

    What's Next?

    This simple Embedding + GlobalAveragePooling model is a strong baseline, but it still ignores word order. In the next lesson, we'll see how Transformers use **self-attention** to consider every word in relation to every other word — achieving much higher accuracy.