Tokenizers Deep Dive

Tokenization is the critical first step in any NLP pipeline. It converts raw text into numerical tokens that models can process. The choice of tokenizer significantly affects model performance, vocabulary efficiency, and handling of rare words.

Why Tokenization Matters

Models don't see text — they see numbers. Tokenization bridges this gap:

"Hello, world!" → [101, 7592, 1010, 2088, 999, 102]

A poor tokenizer means:

Out-of-vocabulary (OOV) words → lost information

Huge vocabularies → large embedding matrices, slow training

Inefficient encoding → wasted context window

Modern tokenizers solve these with subword tokenization.

The Tokenization Spectrum

Character-level: Maximum coverage, minimum efficiency (long sequences) Word-level: Fast, but can't handle unknown words Subword-level: The sweet spot — handles any word by decomposing rare words into known subword units All modern LLMs use subword tokenization.

Byte-Pair Encoding (BPE)

BPE is the most common subword algorithm, used by GPT-2, GPT-3, GPT-4, LLaMA, and many others.

How BPE Works

1. Start with a character-level vocabulary 2. Count the most frequent pair of adjacent tokens 3. Merge that pair into a new token 4. Repeat until desired vocabulary size is reached

Example

Corpus: "low lower newest widest"Step 0 (characters): l o w   l o w e r   n e w e s t   w i d e s t
Step 1: Most frequent pair is (e, s) → merge to "es"
Step 2: Most frequent pair is (es, t) → merge to "est"
Step 3: Most frequent pair is (l, o) → merge to "lo"
Step 4: Most frequent pair is (lo, w) → merge to "low"
...

from transformers import AutoTokenizer
GPT-2 uses BPE
tokenizer = AutoTokenizer.from_pretrained("gpt2")
text = "Tokenization is fascinating!"
tokens = tokenizer.tokenize(text)
print(f"Tokens: {tokens}")
['Token', 'ization', ' is', ' fascinating', '!']
ids = tokenizer.encode(text)
print(f"IDs: {ids}")
Decode back
decoded = tokenizer.decode(ids)
print(f"Decoded: {decoded}")

WordPiece

WordPiece is used by BERT and its variants. It's similar to BPE but uses a different merge criterion based on likelihood rather than frequency.

# BERT uses WordPiece
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "Tokenization is fascinating!"
tokens = tokenizer.tokenize(text)
print(f"Tokens: {tokens}")
['token', '##ization', 'is', 'fascinating', '!']
Note: ## prefix indicates continuation of previous word

The ## prefix in WordPiece indicates that the token is a continuation of the previous token (not a new word).

SentencePiece

SentencePiece treats the input as a raw byte stream — no pre-tokenization needed. Used by T5, ALBERT, XLNet, and many multilingual models.

# T5 uses SentencePiece
tokenizer = AutoTokenizer.from_pretrained("t5-small")
text = "Tokenization is fascinating!"
tokens = tokenizer.tokenize(text)
print(f"Tokens: {tokens}")
['▁Token', 'ization', '▁is', '▁fasci', 'nating', '!']
Note: ▁ (U+2581) represents a space

SentencePiece uses the ▁ character to represent spaces, making it language-agnostic and excellent for multilingual models.

Unigram

The Unigram algorithm starts with a large vocabulary and iteratively removes tokens, keeping those that minimize the overall loss. Used in SentencePiece as an alternative to BPE.

Key difference from BPE: Unigram is probabilistic — it can produce multiple valid tokenizations and picks the most likely one.

Special Tokens

Every model uses special tokens for structure:

Token	Purpose	Used By
`[CLS]`	Classification token (start of input)	BERT
`[SEP]`	Separator between segments	BERT
`[PAD]`	Padding to uniform length	All
`[MASK]`	Masked token for MLM training	BERT
`[UNK]`	Unknown/out-of-vocabulary token	All
,	Start/end of sequence	RoBERTa, GPT
	Padding	T5, GPT

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
See all special tokens
print(f"CLS: {tokenizer.cls_token} (ID: {tokenizer.cls_token_id})")
print(f"SEP: {tokenizer.sep_token} (ID: {tokenizer.sep_token_id})")
print(f"PAD: {tokenizer.pad_token} (ID: {tokenizer.pad_token_id})")
print(f"MASK: {tokenizer.mask_token} (ID: {tokenizer.mask_token_id})")
print(f"UNK: {tokenizer.unk_token} (ID: {tokenizer.unk_token_id})")
When you call tokenizer(), special tokens are added automatically
encoded = tokenizer("Hello world")
tokens = tokenizer.convert_ids_to_tokens(encoded["input_ids"])
print(f"Tokens with specials: {tokens}")
['[CLS]', 'hello', 'world', '[SEP]']

The tokenizer.json Format

Fast tokenizers store their full configuration in a tokenizer.json file:

{
  "version": "1.0",
  "model": {
    "type": "BPE",
    "vocab": {"a": 0, "b": 1, ...},
    "merges": ["a b", "ab c", ...]
  },
  "pre_tokenizer": {"type": "ByteLevel"},
  "post_processor": {
    "type": "RobertaProcessing",
    "sep": ["", 2],
    "cls": ["", 0]
  },
  "decoder": {"type": "ByteLevel"}
}

Fast vs Slow Tokenizers

from transformers import AutoTokenizer
Fast tokenizer (Rust-based, default)
fast_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
print(f"Fast: {fast_tokenizer.is_fast}")  # True
Slow tokenizer (Python-based)
slow_tokenizer = AutoTokenizer.from_pretrained(
    "bert-base-uncased", use_fast=False
)
print(f"Fast: {slow_tokenizer.is_fast}")  # False

Feature Fast (Rust) Slow (Python)
Speed 10-100x faster Baseline
Offset mapping Yes No
Batch encoding Parallel Sequential
Custom training Yes (tokenizers lib) Limited
Compatibility Most models All models

Offset Mapping (Fast Tokenizers Only)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") text = "Hello, world!" encoding = tokenizer(text, return_offsets_mapping=True)
for token_id, (start, end) in zip( encoding["input_ids"], encoding["offset_mapping"] ): token = tokenizer.decode([token_id]) original = text[start:end] print(f"Token: {token:10s} | Original: {original:10s} | Span: ({start}, {end})")
This is essential for tasks like NER where you need to map tokens back to original character positions.

Feature	Fast (Rust)	Slow (Python)
Speed	10-100x faster	Baseline
Offset mapping	Yes	No
Batch encoding	Parallel	Sequential
Custom training	Yes (tokenizers lib)	Limited
Compatibility	Most models	All models

~~Multilingual Tokenization Pitfalls~~

Tokenizers trained primarily on English will produce many more tokens for non-English text, wasting context window and reducing performance. A single Chinese character might become 3-4 tokens with a GPT-2 tokenizer. Always use models with tokenizers trained on your target language(s).

Training a Tokenizer from Scratch

~~Use the tokenizers library to train custom tokenizers:~~

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
Create a BPE tokenizer
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
Define the trainer
trainer = BpeTrainer(
    vocab_size=5000,
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
    min_frequency=2,
)
Train on files
tokenizer.train(files=["data.txt"], trainer=trainer)
Or train on a list of strings
tokenizer.train_from_iterator(
    ["line one", "line two", "line three"],
    trainer=trainer,
)
Test
output = tokenizer.encode("Hello world")
print(f"Tokens: {output.tokens}")
print(f"IDs: {output.ids}")
Save
tokenizer.save("my-tokenizer.json")

Tokenizers Deep Dive

Why Tokenization Matters

~~Models don't see text — they see numbers. Tokenization bridges this gap:~~

"Hello, world!" → [101, 7592, 1010, 2088, 999, 102]

~~A poor tokenizer means:~~

~~Out-of-vocabulary (OOV) words → lost information~~

Huge vocabularies → large embedding matrices, slow training

Inefficient encoding → wasted context window
Modern tokenizers solve these with subword tokenization.

~~The Tokenization Spectrum~~

Byte-Pair Encoding (BPE)

~~BPE is the most common subword algorithm, used by GPT-2, GPT-3, GPT-4, LLaMA, and many others.~~

How BPE Works

~~1. Start with a character-level vocabulary 2. Count the most frequent pair of adjacent tokens 3. Merge that pair into a new token 4. Repeat until desired vocabulary size is reached~~

Example

Corpus: "low lower newest widest"Step 0 (characters): l o w   l o w e r   n e w e s t   w i d e s t
Step 1: Most frequent pair is (e, s) → merge to "es"
Step 2: Most frequent pair is (es, t) → merge to "est"
Step 3: Most frequent pair is (l, o) → merge to "lo"
Step 4: Most frequent pair is (lo, w) → merge to "low"
...

from transformers import AutoTokenizer
GPT-2 uses BPE
tokenizer = AutoTokenizer.from_pretrained("gpt2")
text = "Tokenization is fascinating!"
tokens = tokenizer.tokenize(text)
print(f"Tokens: {tokens}")
['Token', 'ization', ' is', ' fascinating', '!']
ids = tokenizer.encode(text)
print(f"IDs: {ids}")
Decode back
decoded = tokenizer.decode(ids)
print(f"Decoded: {decoded}")

WordPiece

~~WordPiece is used by BERT and its variants. It's similar to BPE but uses a different merge criterion based on likelihood rather than frequency.~~

# BERT uses WordPiece
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "Tokenization is fascinating!"
tokens = tokenizer.tokenize(text)
print(f"Tokens: {tokens}")
['token', '##ization', 'is', 'fascinating', '!']
Note: ## prefix indicates continuation of previous word

~~The ## prefix in WordPiece indicates that the token is a continuation of the previous token (not a new word).~~

SentencePiece

~~SentencePiece treats the input as a raw byte stream — no pre-tokenization needed. Used by T5, ALBERT, XLNet, and many multilingual models.~~

# T5 uses SentencePiece
tokenizer = AutoTokenizer.from_pretrained("t5-small")
text = "Tokenization is fascinating!"
tokens = tokenizer.tokenize(text)
print(f"Tokens: {tokens}")
['▁Token', 'ization', '▁is', '▁fasci', 'nating', '!']
Note: ▁ (U+2581) represents a space

~~SentencePiece uses the ▁ character to represent spaces, making it language-agnostic and excellent for multilingual models.~~

Unigram

~~The Unigram algorithm starts with a large vocabulary and iteratively removes tokens, keeping those that minimize the overall loss. Used in SentencePiece as an alternative to BPE.~~

~~Key difference from BPE: Unigram is probabilistic — it can produce multiple valid tokenizations and picks the most likely one.~~

Special Tokens

~~Every model uses special tokens for structure:~~

Token	Purpose	Used By
`[CLS]`	Classification token (start of input)	BERT
`[SEP]`	Separator between segments	BERT
`[PAD]`	Padding to uniform length	All
`[MASK]`	Masked token for MLM training	BERT
`[UNK]`	Unknown/out-of-vocabulary token	All
,	Start/end of sequence	RoBERTa, GPT
	Padding	T5, GPT

from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") See all special tokens print(f"CLS: {tokenizer.cls_token} (ID: {tokenizer.cls_token_id})") print(f"SEP: {tokenizer.sep_token} (ID: {tokenizer.sep_token_id})") print(f"PAD: {tokenizer.pad_token} (ID: {tokenizer.pad_token_id})") print(f"MASK: {tokenizer.mask_token} (ID: {tokenizer.mask_token_id})") print(f"UNK: {tokenizer.unk_token} (ID: {tokenizer.unk_token_id})") When you call tokenizer(), special tokens are added automatically encoded = tokenizer("Hello world") tokens = tokenizer.convert_ids_to_tokens(encoded["input_ids"]) print(f"Tokens with specials: {tokens}") ['[CLS]', 'hello', 'world', '[SEP]']
The tokenizer.json Format
Fast tokenizers store their full configuration in a tokenizer.json file:

{
  "version": "1.0",
  "model": {
    "type": "BPE",
    "vocab": {"a": 0, "b": 1, ...},
    "merges": ["a b", "ab c", ...]
  },
  "pre_tokenizer": {"type": "ByteLevel"},
  "post_processor": {
    "type": "RobertaProcessing",
    "sep": ["", 2],
    "cls": ["", 0]
  },
  "decoder": {"type": "ByteLevel"}
}

Fast vs Slow Tokenizers

from transformers import AutoTokenizer
Fast tokenizer (Rust-based, default)
fast_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
print(f"Fast: {fast_tokenizer.is_fast}")  # True
Slow tokenizer (Python-based)
slow_tokenizer = AutoTokenizer.from_pretrained(
    "bert-base-uncased", use_fast=False
)
print(f"Fast: {slow_tokenizer.is_fast}")  # False

Feature	Fast (Rust)	Slow (Python)
Speed	10-100x faster	Baseline
Offset mapping	Yes	No
Batch encoding	Parallel	Sequential
Custom training	Yes (tokenizers lib)	Limited
Compatibility	Most models	All models

~~Multilingual Tokenization Pitfalls~~

Training a Tokenizer from Scratch

~~Use the tokenizers library to train custom tokenizers:~~

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
Create a BPE tokenizer
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
Define the trainer
trainer = BpeTrainer(
    vocab_size=5000,
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
    min_frequency=2,
)
Train on files
tokenizer.train(files=["data.txt"], trainer=trainer)
Or train on a list of strings
tokenizer.train_from_iterator(
    ["line one", "line two", "line three"],
    trainer=trainer,
)
Test
output = tokenizer.encode("Hello world")
print(f"Tokens: {output.tokens}")
print(f"IDs: {output.ids}")
Save
tokenizer.save("my-tokenizer.json")