Tokenizers Deep Dive
Tokenization is the critical first step in any NLP pipeline. It converts raw text into numerical tokens that models can process. The choice of tokenizer significantly affects model performance, vocabulary efficiency, and handling of rare words.
Why Tokenization Matters
Models don't see text — they see numbers. Tokenization bridges this gap:
"Hello, world!" → [101, 7592, 1010, 2088, 999, 102]
A poor tokenizer means:
Modern tokenizers solve these with subword tokenization.
The Tokenization Spectrum
Byte-Pair Encoding (BPE)
BPE is the most common subword algorithm, used by GPT-2, GPT-3, GPT-4, LLaMA, and many others.
How BPE Works
1. Start with a character-level vocabulary 2. Count the most frequent pair of adjacent tokens 3. Merge that pair into a new token 4. Repeat until desired vocabulary size is reached
Example
Corpus: "low lower newest widest"Step 0 (characters): l o w l o w e r n e w e s t w i d e s t
Step 1: Most frequent pair is (e, s) → merge to "es"
Step 2: Most frequent pair is (es, t) → merge to "est"
Step 3: Most frequent pair is (l, o) → merge to "lo"
Step 4: Most frequent pair is (lo, w) → merge to "low"
...
from transformers import AutoTokenizerGPT-2 uses BPE
tokenizer = AutoTokenizer.from_pretrained("gpt2")text = "Tokenization is fascinating!"
tokens = tokenizer.tokenize(text)
print(f"Tokens: {tokens}")
['Token', 'ization', ' is', ' fascinating', '!']
ids = tokenizer.encode(text)
print(f"IDs: {ids}")
Decode back
decoded = tokenizer.decode(ids)
print(f"Decoded: {decoded}")
WordPiece
WordPiece is used by BERT and its variants. It's similar to BPE but uses a different merge criterion based on likelihood rather than frequency.
# BERT uses WordPiece
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")text = "Tokenization is fascinating!"
tokens = tokenizer.tokenize(text)
print(f"Tokens: {tokens}")
['token', '##ization', 'is', 'fascinating', '!']
Note: ## prefix indicates continuation of previous word
The ## prefix in WordPiece indicates that the token is a continuation of the previous token (not a new word).
SentencePiece
SentencePiece treats the input as a raw byte stream — no pre-tokenization needed. Used by T5, ALBERT, XLNet, and many multilingual models.
# T5 uses SentencePiece
tokenizer = AutoTokenizer.from_pretrained("t5-small")text = "Tokenization is fascinating!"
tokens = tokenizer.tokenize(text)
print(f"Tokens: {tokens}")
['▁Token', 'ization', '▁is', '▁fasci', 'nating', '!']
Note: ▁ (U+2581) represents a space
SentencePiece uses the ▁ character to represent spaces, making it language-agnostic and excellent for multilingual models.
Unigram
The Unigram algorithm starts with a large vocabulary and iteratively removes tokens, keeping those that minimize the overall loss. Used in SentencePiece as an alternative to BPE.
Key difference from BPE: Unigram is probabilistic — it can produce multiple valid tokenizations and picks the most likely one.
Special Tokens
Every model uses special tokens for structure:
| Token | Purpose | Used By |
|---|---|---|
[CLS] | Classification token (start of input) | BERT |
[SEP] | Separator between segments | BERT |
[PAD] | Padding to uniform length | All |
[MASK] | Masked token for MLM training | BERT |
[UNK] | Unknown/out-of-vocabulary token | All |
, | Start/end of sequence | RoBERTa, GPT |
| Padding | T5, GPT |
from transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
See all special tokens
print(f"CLS: {tokenizer.cls_token} (ID: {tokenizer.cls_token_id})")
print(f"SEP: {tokenizer.sep_token} (ID: {tokenizer.sep_token_id})")
print(f"PAD: {tokenizer.pad_token} (ID: {tokenizer.pad_token_id})")
print(f"MASK: {tokenizer.mask_token} (ID: {tokenizer.mask_token_id})")
print(f"UNK: {tokenizer.unk_token} (ID: {tokenizer.unk_token_id})")When you call tokenizer(), special tokens are added automatically
encoded = tokenizer("Hello world")
tokens = tokenizer.convert_ids_to_tokens(encoded["input_ids"])
print(f"Tokens with specials: {tokens}")
['[CLS]', 'hello', 'world', '[SEP]']
The tokenizer.json Format
Fast tokenizers store their full configuration in a tokenizer.json file:
{
"version": "1.0",
"model": {
"type": "BPE",
"vocab": {"a": 0, "b": 1, ...},
"merges": ["a b", "ab c", ...]
},
"pre_tokenizer": {"type": "ByteLevel"},
"post_processor": {
"type": "RobertaProcessing",
"sep": ["", 2],
"cls": ["", 0]
},
"decoder": {"type": "ByteLevel"}
}
Fast vs Slow Tokenizers
from transformers import AutoTokenizerFast tokenizer (Rust-based, default)
fast_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
print(f"Fast: {fast_tokenizer.is_fast}") # TrueSlow tokenizer (Python-based)
slow_tokenizer = AutoTokenizer.from_pretrained(
"bert-base-uncased", use_fast=False
)
print(f"Fast: {slow_tokenizer.is_fast}") # False
| Feature | Fast (Rust) | Slow (Python) |
|---|---|---|
| Speed | 10-100x faster | Baseline |
| Offset mapping | Yes | No |
| Batch encoding | Parallel | Sequential |
| Custom training | Yes (tokenizers lib) | Limited |
| Compatibility | Most models | All models |
Offset Mapping (Fast Tokenizers Only)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")text = "Hello, world!"
encoding = tokenizer(text, return_offsets_mapping=True)
for token_id, (start, end) in zip(
encoding["input_ids"], encoding["offset_mapping"]
):
token = tokenizer.decode([token_id])
original = text[start:end]
print(f"Token: {token:10s} | Original: {original:10s} | Span: ({start}, {end})")
This is essential for tasks like NER where you need to map tokens back to original character positions.
Multilingual Tokenization Pitfalls
Training a Tokenizer from Scratch
Use the tokenizers library to train custom tokenizers:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import WhitespaceCreate a BPE tokenizer
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()Define the trainer
trainer = BpeTrainer(
vocab_size=5000,
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
min_frequency=2,
)Train on files
tokenizer.train(files=["data.txt"], trainer=trainer)Or train on a list of strings
tokenizer.train_from_iterator(
["line one", "line two", "line three"],
trainer=trainer,
)Test
output = tokenizer.encode("Hello world")
print(f"Tokens: {output.tokens}")
print(f"IDs: {output.ids}")Save
tokenizer.save("my-tokenizer.json")