Skip to main content

Tokenizers Deep Dive

BPE, WordPiece, SentencePiece, special tokens, fast vs slow tokenizers

~45 min
Listen to this lesson

Tokenizers Deep Dive

Tokenization is the critical first step in any NLP pipeline. It converts raw text into numerical tokens that models can process. The choice of tokenizer significantly affects model performance, vocabulary efficiency, and handling of rare words.

Why Tokenization Matters

Models don't see text — they see numbers. Tokenization bridges this gap:

"Hello, world!" → [101, 7592, 1010, 2088, 999, 102]

A poor tokenizer means:

  • Out-of-vocabulary (OOV) words → lost information
  • Huge vocabularies → large embedding matrices, slow training
  • Inefficient encoding → wasted context window
  • Modern tokenizers solve these with subword tokenization.

    The Tokenization Spectrum

    Character-level: Maximum coverage, minimum efficiency (long sequences) Word-level: Fast, but can't handle unknown words Subword-level: The sweet spot — handles any word by decomposing rare words into known subword units All modern LLMs use subword tokenization.

    Byte-Pair Encoding (BPE)

    BPE is the most common subword algorithm, used by GPT-2, GPT-3, GPT-4, LLaMA, and many others.

    How BPE Works

    1. Start with a character-level vocabulary 2. Count the most frequent pair of adjacent tokens 3. Merge that pair into a new token 4. Repeat until desired vocabulary size is reached

    Example

    Corpus: "low lower newest widest"

    Step 0 (characters): l o w l o w e r n e w e s t w i d e s t Step 1: Most frequent pair is (e, s) → merge to "es" Step 2: Most frequent pair is (es, t) → merge to "est" Step 3: Most frequent pair is (l, o) → merge to "lo" Step 4: Most frequent pair is (lo, w) → merge to "low" ...

    from transformers import AutoTokenizer

    GPT-2 uses BPE

    tokenizer = AutoTokenizer.from_pretrained("gpt2")

    text = "Tokenization is fascinating!" tokens = tokenizer.tokenize(text) print(f"Tokens: {tokens}")

    ['Token', 'ization', ' is', ' fascinating', '!']

    ids = tokenizer.encode(text) print(f"IDs: {ids}")

    Decode back

    decoded = tokenizer.decode(ids) print(f"Decoded: {decoded}")

    WordPiece

    WordPiece is used by BERT and its variants. It's similar to BPE but uses a different merge criterion based on likelihood rather than frequency.

    # BERT uses WordPiece
    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

    text = "Tokenization is fascinating!" tokens = tokenizer.tokenize(text) print(f"Tokens: {tokens}")

    ['token', '##ization', 'is', 'fascinating', '!']

    Note: ## prefix indicates continuation of previous word

    The ## prefix in WordPiece indicates that the token is a continuation of the previous token (not a new word).

    SentencePiece

    SentencePiece treats the input as a raw byte stream — no pre-tokenization needed. Used by T5, ALBERT, XLNet, and many multilingual models.

    # T5 uses SentencePiece
    tokenizer = AutoTokenizer.from_pretrained("t5-small")

    text = "Tokenization is fascinating!" tokens = tokenizer.tokenize(text) print(f"Tokens: {tokens}")

    ['▁Token', 'ization', '▁is', '▁fasci', 'nating', '!']

    Note: ▁ (U+2581) represents a space

    SentencePiece uses the character to represent spaces, making it language-agnostic and excellent for multilingual models.

    Unigram

    The Unigram algorithm starts with a large vocabulary and iteratively removes tokens, keeping those that minimize the overall loss. Used in SentencePiece as an alternative to BPE.

    Key difference from BPE: Unigram is probabilistic — it can produce multiple valid tokenizations and picks the most likely one.

    Special Tokens

    Every model uses special tokens for structure:

    TokenPurposeUsed By
    [CLS]Classification token (start of input)BERT
    [SEP]Separator between segmentsBERT
    [PAD]Padding to uniform lengthAll
    [MASK]Masked token for MLM trainingBERT
    [UNK]Unknown/out-of-vocabulary tokenAll
    , Start/end of sequenceRoBERTa, GPT
    PaddingT5, GPT
    from transformers import AutoTokenizer

    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

    See all special tokens

    print(f"CLS: {tokenizer.cls_token} (ID: {tokenizer.cls_token_id})") print(f"SEP: {tokenizer.sep_token} (ID: {tokenizer.sep_token_id})") print(f"PAD: {tokenizer.pad_token} (ID: {tokenizer.pad_token_id})") print(f"MASK: {tokenizer.mask_token} (ID: {tokenizer.mask_token_id})") print(f"UNK: {tokenizer.unk_token} (ID: {tokenizer.unk_token_id})")

    When you call tokenizer(), special tokens are added automatically

    encoded = tokenizer("Hello world") tokens = tokenizer.convert_ids_to_tokens(encoded["input_ids"]) print(f"Tokens with specials: {tokens}")

    ['[CLS]', 'hello', 'world', '[SEP]']

    The tokenizer.json Format

    Fast tokenizers store their full configuration in a tokenizer.json file:

    {
      "version": "1.0",
      "model": {
        "type": "BPE",
        "vocab": {"a": 0, "b": 1, ...},
        "merges": ["a b", "ab c", ...]
      },
      "pre_tokenizer": {"type": "ByteLevel"},
      "post_processor": {
        "type": "RobertaProcessing",
        "sep": ["", 2],
        "cls": ["", 0]
      },
      "decoder": {"type": "ByteLevel"}
    }
    

    Fast vs Slow Tokenizers

    from transformers import AutoTokenizer

    Fast tokenizer (Rust-based, default)

    fast_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") print(f"Fast: {fast_tokenizer.is_fast}") # True

    Slow tokenizer (Python-based)

    slow_tokenizer = AutoTokenizer.from_pretrained( "bert-base-uncased", use_fast=False ) print(f"Fast: {slow_tokenizer.is_fast}") # False
    FeatureFast (Rust)Slow (Python)
    Speed10-100x fasterBaseline
    Offset mappingYesNo
    Batch encodingParallelSequential
    Custom trainingYes (tokenizers lib)Limited
    CompatibilityMost modelsAll models

    Offset Mapping (Fast Tokenizers Only)

    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

    text = "Hello, world!" encoding = tokenizer(text, return_offsets_mapping=True)

    for token_id, (start, end) in zip( encoding["input_ids"], encoding["offset_mapping"] ): token = tokenizer.decode([token_id]) original = text[start:end] print(f"Token: {token:10s} | Original: {original:10s} | Span: ({start}, {end})")

    This is essential for tasks like NER where you need to map tokens back to original character positions.

    Multilingual Tokenization Pitfalls

    Tokenizers trained primarily on English will produce many more tokens for non-English text, wasting context window and reducing performance. A single Chinese character might become 3-4 tokens with a GPT-2 tokenizer. Always use models with tokenizers trained on your target language(s).

    Training a Tokenizer from Scratch

    Use the tokenizers library to train custom tokenizers:

    from tokenizers import Tokenizer
    from tokenizers.models import BPE
    from tokenizers.trainers import BpeTrainer
    from tokenizers.pre_tokenizers import Whitespace

    Create a BPE tokenizer

    tokenizer = Tokenizer(BPE(unk_token="[UNK]")) tokenizer.pre_tokenizer = Whitespace()

    Define the trainer

    trainer = BpeTrainer( vocab_size=5000, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], min_frequency=2, )

    Train on files

    tokenizer.train(files=["data.txt"], trainer=trainer)

    Or train on a list of strings

    tokenizer.train_from_iterator( ["line one", "line two", "line three"], trainer=trainer, )

    Test

    output = tokenizer.encode("Hello world") print(f"Tokens: {output.tokens}") print(f"IDs: {output.ids}")

    Save

    tokenizer.save("my-tokenizer.json")