Text Processing & Embeddings

Machines don't understand words — they understand numbers. The entire field of NLP revolves around one central challenge: how do we convert human language into numerical representations that preserve meaning?

Over the past two decades, the answer to that question has evolved dramatically:

Era	Technique	Key Idea
~2000	Bag of Words (BoW)	Count word occurrences
~2005	TF-IDF	Weight words by importance
2013	Word2Vec / GloVe	Learn dense vector representations
2018+	Contextual Embeddings (BERT, GPT)	Same word gets different vectors based on context

Each step brought us closer to representations that capture the richness of human language. Let's walk through each one.

Why Numbers, Not Text?

Neural networks perform matrix multiplication, gradient descent, and activation functions — all mathematical operations that require numerical inputs. A string like "The cat sat on the mat" is meaningless to a GPU. We need to convert it into a tensor of floating-point numbers while preserving as much semantic information as possible.

Bag of Words (BoW)

The simplest approach: create a vocabulary of all unique words, then represent each document as a vector of word counts.

Example:

Vocabulary: ["the", "cat", "sat", "on", "mat", "dog", "ran"]

"The cat sat on the mat" → [2, 1, 1, 1, 1, 0, 0]

"The dog ran" → [1, 0, 0, 0, 0, 1, 1]

Problems with BoW:

Loses all word order ("dog bites man" = "man bites dog")

Vectors are extremely sparse (most entries are zero)

No notion of word similarity ("happy" and "joyful" are as different as "happy" and "table")

TF-IDF: Term Frequency–Inverse Document Frequency

TF-IDF improves on raw counts by asking: how important is this word to this specific document, relative to the whole corpus?

$$\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)$$

TF(t, d) = frequency of term *t* in document *d*

IDF(t) = log(total documents / documents containing *t*)

Words like "the" appear in every document, so their IDF is near zero. Domain-specific terms like "embedding" have high IDF in a general corpus.

python

1from sklearn.feature_extraction.text import TfidfVectorizer
2
3documents = [
4    "The cat sat on the mat",
5    "The dog chased the cat",
6    "A bird flew over the mat",
7]
8
9vectorizer = TfidfVectorizer()
10tfidf_matrix = vectorizer.fit_transform(documents)
11
12print("Vocabulary:", vectorizer.get_feature_names_out())
13print("TF-IDF shape:", tfidf_matrix.shape)
14print("Document 0 vector:\n", tfidf_matrix[0].toarray())
15# Notice "the" has low weight (appears everywhere)
16# while "sat" has high weight (unique to document 0)

Word2Vec: Dense Learned Embeddings

The breakthrough came in 2013 when Mikolov et al. showed that you could learn vector representations by training a shallow neural network on a simple task: predict a word from its context (CBOW) or predict the context from a word (Skip-gram).

The magic is that the learned vectors capture semantic relationships:

king - man + woman ≈ queen
paris - france + italy ≈ rome

Each word gets a dense vector (typically 100–300 dimensions) where similar words are close together in the vector space.

Static vs. Contextual Embeddings

Word2Vec and GloVe produce **static** embeddings — the word "bank" always gets the same vector whether it means a river bank or a financial institution. Contextual embeddings (BERT, GPT) produce different vectors depending on surrounding words, which is a massive improvement for ambiguous language.

Tokenization in Practice

Before any of these techniques work, we need to tokenize the raw text — split it into individual units. There are several strategies:

Strategy	Example: "unhappiness"	Pros	Cons
Word-level	["unhappiness"]	Simple, intuitive	Large vocab, can't handle unknown words
Character-level	["u","n","h","a","p","p","i","n","e","s","s"]	Tiny vocab, no OOV	Very long sequences, harder to learn
Subword (BPE)	["un", "happiness"]	Balanced vocab, handles OOV	Requires training a tokenizer

Modern models almost universally use subword tokenization (Byte-Pair Encoding or SentencePiece) because it strikes the best balance.

Building Text Models in Keras

Keras provides two key tools for text processing:

1. TextVectorization — a preprocessing layer that tokenizes and indexes text 2. Embedding — a trainable layer that maps integer token IDs to dense vectors

Let's build a complete text classification pipeline.

python

1import tensorflow as tf
2from tensorflow.keras import layers, models
3
4# --- TextVectorization: converts raw strings → integer sequences ---
5max_tokens = 10000
6max_length = 200
7
8vectorize_layer = layers.TextVectorization(
9    max_tokens=max_tokens,
10    output_mode="int",
11    output_sequence_length=max_length,
12)
13
14# Adapt the layer to your training data (builds the vocabulary)
15train_texts = ["This movie was great!", "Terrible film, waste of time.", ...]
16vectorize_layer.adapt(train_texts)
17
18# See the vocabulary
19vocab = vectorize_layer.get_vocabulary()
20print(f"Vocabulary size: {len(vocab)}")
21print(f"First 10 tokens: {vocab[:10]}")
22
23# Vectorize a sentence
24sample = tf.constant(["This movie was great!"])
25print(vectorize_layer(sample))  # e.g., [  12   45    8  203    0    0 ...]

python

1# --- Full text classification model ---
2embedding_dim = 128
3
4model = models.Sequential([
5    # Input: raw strings
6    layers.Input(shape=(1,), dtype=tf.string),
7    vectorize_layer,
8
9    # Embedding: (batch, seq_len) → (batch, seq_len, embedding_dim)
10    layers.Embedding(
11        input_dim=max_tokens,
12        output_dim=embedding_dim,
13        mask_zero=True,  # Ignore padding tokens
14    ),
15
16    # GlobalAveragePooling1D: (batch, seq_len, 128) → (batch, 128)
17    # Averages across the sequence dimension — simple but effective
18    layers.GlobalAveragePooling1D(),
19
20    # Classification head
21    layers.Dense(64, activation="relu"),
22    layers.Dropout(0.3),
23    layers.Dense(1, activation="sigmoid"),  # Binary classification
24])
25
26model.compile(
27    optimizer="adam",
28    loss="binary_crossentropy",
29    metrics=["accuracy"],
30)
31
32model.summary()

GlobalAveragePooling1D vs. Flatten

GlobalAveragePooling1D averages all token embeddings into a single vector. It's much more parameter-efficient than Flatten (which would create a huge dense layer) and acts as a form of regularization. Think of it as creating a "sentence embedding" by averaging all word embeddings.

IMDB Sentiment Classification — End to End

Let's put it all together with the classic IMDB movie review dataset:

python

1import tensorflow as tf
2from tensorflow.keras import layers, models
3import tensorflow_datasets as tfds
4
5# Load IMDB dataset (25k train, 25k test)
6(train_data, test_data), info = tfds.load(
7    "imdb_reviews",
8    split=["train", "test"],
9    as_supervised=True,
10    with_info=True,
11)
12
13# Prepare batches
14train_ds = train_data.shuffle(10000).batch(64).prefetch(tf.data.AUTOTUNE)
15test_ds = test_data.batch(64).prefetch(tf.data.AUTOTUNE)
16
17# Build the TextVectorization layer
18max_tokens = 20000
19max_length = 500
20
21vectorize_layer = layers.TextVectorization(
22    max_tokens=max_tokens,
23    output_mode="int",
24    output_sequence_length=max_length,
25)
26
27# Adapt on training text only
28train_text = train_data.map(lambda text, label: text)
29vectorize_layer.adapt(train_text)
30
31# Build model
32model = models.Sequential([
33    vectorize_layer,
34    layers.Embedding(max_tokens, 128, mask_zero=True),
35    layers.GlobalAveragePooling1D(),
36    layers.Dense(64, activation="relu"),
37    layers.Dropout(0.4),
38    layers.Dense(1, activation="sigmoid"),
39])
40
41model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
42
43# Train
44history = model.fit(train_ds, validation_data=test_ds, epochs=10)
45# Expect ~87-89% accuracy with this simple architecture

What's Next?

This simple Embedding + GlobalAveragePooling model is a strong baseline, but it still ignores word order. In the next lesson, we'll see how Transformers use **self-attention** to consider every word in relation to every other word — achieving much higher accuracy.

Text Processing & Embeddings

Over the past two decades, the answer to that question has evolved dramatically:

Era	Technique	Key Idea
~2000	Bag of Words (BoW)	Count word occurrences
~2005	TF-IDF	Weight words by importance
2013	Word2Vec / GloVe	Learn dense vector representations
2018+	Contextual Embeddings (BERT, GPT)	Same word gets different vectors based on context

Each step brought us closer to representations that capture the richness of human language. Let's walk through each one.

Why Numbers, Not Text?

Bag of Words (BoW)

The simplest approach: create a vocabulary of all unique words, then represent each document as a vector of word counts.

Example:

Vocabulary: ["the", "cat", "sat", "on", "mat", "dog", "ran"]

"The cat sat on the mat" → [2, 1, 1, 1, 1, 0, 0]

"The dog ran" → [1, 0, 0, 0, 0, 1, 1]

Problems with BoW:

Loses all word order ("dog bites man" = "man bites dog")

Vectors are extremely sparse (most entries are zero)

No notion of word similarity ("happy" and "joyful" are as different as "happy" and "table")

TF-IDF: Term Frequency–Inverse Document Frequency

TF-IDF improves on raw counts by asking: how important is this word to this specific document, relative to the whole corpus?

$$\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)$$

TF(t, d) = frequency of term *t* in document *d*

IDF(t) = log(total documents / documents containing *t*)

Words like "the" appear in every document, so their IDF is near zero. Domain-specific terms like "embedding" have high IDF in a general corpus.

python

1from sklearn.feature_extraction.text import TfidfVectorizer
2
3documents = [
4    "The cat sat on the mat",
5    "The dog chased the cat",
6    "A bird flew over the mat",
7]
8
9vectorizer = TfidfVectorizer()
10tfidf_matrix = vectorizer.fit_transform(documents)
11
12print("Vocabulary:", vectorizer.get_feature_names_out())
13print("TF-IDF shape:", tfidf_matrix.shape)
14print("Document 0 vector:\n", tfidf_matrix[0].toarray())
15# Notice "the" has low weight (appears everywhere)
16# while "sat" has high weight (unique to document 0)

Word2Vec: Dense Learned Embeddings

The magic is that the learned vectors capture semantic relationships:

king - man + woman ≈ queen
paris - france + italy ≈ rome

Each word gets a dense vector (typically 100–300 dimensions) where similar words are close together in the vector space.

Static vs. Contextual Embeddings

Tokenization in Practice

Before any of these techniques work, we need to tokenize the raw text — split it into individual units. There are several strategies:

Strategy	Example: "unhappiness"	Pros	Cons
Word-level	["unhappiness"]	Simple, intuitive	Large vocab, can't handle unknown words
Character-level	["u","n","h","a","p","p","i","n","e","s","s"]	Tiny vocab, no OOV	Very long sequences, harder to learn
Subword (BPE)	["un", "happiness"]	Balanced vocab, handles OOV	Requires training a tokenizer

Modern models almost universally use subword tokenization (Byte-Pair Encoding or SentencePiece) because it strikes the best balance.

Building Text Models in Keras

Keras provides two key tools for text processing:

1. TextVectorization — a preprocessing layer that tokenizes and indexes text 2. Embedding — a trainable layer that maps integer token IDs to dense vectors

Let's build a complete text classification pipeline.

python

1import tensorflow as tf
2from tensorflow.keras import layers, models
3
4# --- TextVectorization: converts raw strings → integer sequences ---
5max_tokens = 10000
6max_length = 200
7
8vectorize_layer = layers.TextVectorization(
9    max_tokens=max_tokens,
10    output_mode="int",
11    output_sequence_length=max_length,
12)
13
14# Adapt the layer to your training data (builds the vocabulary)
15train_texts = ["This movie was great!", "Terrible film, waste of time.", ...]
16vectorize_layer.adapt(train_texts)
17
18# See the vocabulary
19vocab = vectorize_layer.get_vocabulary()
20print(f"Vocabulary size: {len(vocab)}")
21print(f"First 10 tokens: {vocab[:10]}")
22
23# Vectorize a sentence
24sample = tf.constant(["This movie was great!"])
25print(vectorize_layer(sample))  # e.g., [  12   45    8  203    0    0 ...]

python

1# --- Full text classification model ---
2embedding_dim = 128
3
4model = models.Sequential([
5    # Input: raw strings
6    layers.Input(shape=(1,), dtype=tf.string),
7    vectorize_layer,
8
9    # Embedding: (batch, seq_len) → (batch, seq_len, embedding_dim)
10    layers.Embedding(
11        input_dim=max_tokens,
12        output_dim=embedding_dim,
13        mask_zero=True,  # Ignore padding tokens
14    ),
15
16    # GlobalAveragePooling1D: (batch, seq_len, 128) → (batch, 128)
17    # Averages across the sequence dimension — simple but effective
18    layers.GlobalAveragePooling1D(),
19
20    # Classification head
21    layers.Dense(64, activation="relu"),
22    layers.Dropout(0.3),
23    layers.Dense(1, activation="sigmoid"),  # Binary classification
24])
25
26model.compile(
27    optimizer="adam",
28    loss="binary_crossentropy",
29    metrics=["accuracy"],
30)
31
32model.summary()

GlobalAveragePooling1D vs. Flatten

IMDB Sentiment Classification — End to End

Let's put it all together with the classic IMDB movie review dataset:

python

1import tensorflow as tf
2from tensorflow.keras import layers, models
3import tensorflow_datasets as tfds
4
5# Load IMDB dataset (25k train, 25k test)
6(train_data, test_data), info = tfds.load(
7    "imdb_reviews",
8    split=["train", "test"],
9    as_supervised=True,
10    with_info=True,
11)
12
13# Prepare batches
14train_ds = train_data.shuffle(10000).batch(64).prefetch(tf.data.AUTOTUNE)
15test_ds = test_data.batch(64).prefetch(tf.data.AUTOTUNE)
16
17# Build the TextVectorization layer
18max_tokens = 20000
19max_length = 500
20
21vectorize_layer = layers.TextVectorization(
22    max_tokens=max_tokens,
23    output_mode="int",
24    output_sequence_length=max_length,
25)
26
27# Adapt on training text only
28train_text = train_data.map(lambda text, label: text)
29vectorize_layer.adapt(train_text)
30
31# Build model
32model = models.Sequential([
33    vectorize_layer,
34    layers.Embedding(max_tokens, 128, mask_zero=True),
35    layers.GlobalAveragePooling1D(),
36    layers.Dense(64, activation="relu"),
37    layers.Dropout(0.4),
38    layers.Dense(1, activation="sigmoid"),
39])
40
41model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
42
43# Train
44history = model.fit(train_ds, validation_data=test_ds, epochs=10)
45# Expect ~87-89% accuracy with this simple architecture