Skip to content
Chapter 4. How Text Becomes Numbers

Chapter 4. How Text Becomes Numbers

Every language model operates on numbers, not text. Before a model can process a single word of your prompt, that text must be converted into a sequence of integers. This conversion process is called tokenization, and it determines everything from how well the model understands your input to how much your API call costs. Get tokenization wrong, and the model sees garbled nonsense. Get it right, and you unlock the full power of the architecture we’ll build up in later chapters.


Why Not Just Use Characters?

The simplest approach to turning text into numbers would be to assign each character a number. The letter “a” becomes 97, “b” becomes 98, and so on (these are ASCII codes). The sentence “Hello world” would become a sequence of 11 numbers, one per character.

This approach has a serious problem: sequences become extremely long. A typical English word is 5 characters. A 1,000-word document would produce a sequence of roughly 5,000 characters. A 10,000-word document would produce 50,000 characters. Language models process sequences using attention (Chapter 7), which has a computational cost that grows with the square of the sequence length. Doubling the sequence length quadruples the compute. Character-level tokenization would make sequences so long that processing them would be prohibitively slow and expensive.

There’s a second problem: individual characters carry almost no meaning. The letter “t” by itself tells the model nothing. The model would need to learn, from scratch, that “t” followed by “h” followed by “e” means “the.” That’s a lot of work for the model to do at every layer, for every word, in every sentence. It wastes the model’s capacity on reconstructing basic words instead of understanding meaning.

Why Not Just Use Whole Words?

The opposite extreme is to assign each word a unique number. “Hello” becomes 1, “world” becomes 2, “the” becomes 3, and so on. This keeps sequences short, but it creates a different problem: the vocabulary becomes enormous and can never be complete.

The English language alone has over 170,000 words in current use (according to the Oxford English Dictionary), plus technical terms, proper nouns, slang, misspellings, and words from other languages. A word-level vocabulary would need hundreds of thousands of entries just for English. Add code (variable names, function names), mathematical notation, URLs, and the hundreds of other languages that modern models support, and you’d need millions of entries.

Every entry in the vocabulary requires a corresponding row in the model’s embedding table (Chapter 5). If the vocabulary has 1 million entries and the embedding dimension is 5,120 (as in LLaMA 4 Maverick), the embedding table alone would contain 5.12 billion numbers. That’s an enormous amount of parameters dedicated just to the lookup table.

Worse, word-level tokenization can’t handle words the model has never seen. If someone types “ChatGPT” and that exact string isn’t in the vocabulary, the model has no way to represent it. This is called the out-of-vocabulary (OOV) problem, and it’s a dealbreaker for any practical system.

The Sweet Spot: Subword Tokenization

Modern language models use a middle ground called subword tokenization. Instead of splitting text into characters or whole words, they split it into pieces that are somewhere in between: common words stay whole, while rare words get broken into smaller, reusable pieces.

For example, the word “tokenization” might be split into [“token”, “ization”]. The word “unhappiness” might become [“un”, “happiness”]. A rare technical term like “backpropagation” might become [“back”, “prop”, “agation”]. Common words like “the”, “is”, and “hello” stay as single tokens.

This approach solves all three problems:

  1. Sequences stay reasonably short. Common words are single tokens, so a typical English sentence of 15 words might produce 20 tokens instead of 75 characters.
  2. The vocabulary stays manageable. A vocabulary of 100,000 to 200,000 subword tokens can represent any text in any language, because rare words are composed from common pieces.
  3. No out-of-vocabulary problem. Even a word the model has never seen can be broken into known subword pieces. In the worst case, it falls back to individual bytes, which can represent any character in any language.

The dominant algorithm for building these subword vocabularies is called Byte Pair Encoding (BPE).


Byte Pair Encoding: The Algorithm Step by Step

Byte Pair Encoding was originally invented by Philip Gage in 1994 as a data compression algorithm. In 2016, Rico Sennrich, Barry Haddow, and Alexandra Birch adapted it for neural machine translation, and it has since become the standard tokenization method for large language models.

Source: Gage, “A New Algorithm for Data Compression,” C Users Journal, 1994; Sennrich, Haddow, and Birch, “Neural Machine Translation of Rare Words with Subword Units,” ACL 2016.

The core idea is simple: start with individual characters (or bytes), then repeatedly merge the most frequent pair of adjacent tokens into a new token, until you reach your desired vocabulary size.

Let’s walk through BPE on a small corpus to see exactly how it works.

The Training Corpus

Suppose our entire training corpus consists of these words, with their frequencies:

"low"     : 5 times
"lower"   : 2 times
"newest"  : 6 times
"widest"  : 3 times

Step 0: Start with Characters

First, we split every word into individual characters and add a special end-of-word marker (often written as </w> or _). The end-of-word marker is important because it lets the tokenizer distinguish between “est” at the end of a word (like “newest”) and “est” at the beginning (like “estimate”).

"low"     → ['l', 'o', 'w', '</w>']           × 5
"lower"   → ['l', 'o', 'w', 'e', 'r', '</w>'] × 2
"newest"  → ['n', 'e', 'w', 'e', 's', 't', '</w>'] × 6
"widest"  → ['w', 'i', 'd', 'e', 's', 't', '</w>'] × 3

Our initial vocabulary is just the set of individual characters: {l, o, w, e, r, n, s, t, i, d, }.

Step 1: Count All Adjacent Pairs

We count how many times each pair of adjacent tokens appears across the entire corpus:

('e', 's')  → 6 (from "newest") + 3 (from "widest") = 9
('s', 't')  → 6 + 3 = 9
('t', '</w>') → 6 + 3 = 9
('l', 'o')  → 5 + 2 = 7
('o', 'w')  → 5 + 2 = 7
('n', 'e')  → 6
('e', 'w')  → 6
('w', 'e')  → 2 (from "lower")
('w', '</w>') → 5 (from "low")
('e', 'r')  → 2
('r', '</w>') → 2
('w', 'i')  → 3
('i', 'd')  → 3
('d', 'e')  → 3

Step 2: Merge the Most Frequent Pair

The most frequent pairs are (’e’, ’s’), (’s’, ’t’), and (’t’, ‘’), all appearing 9 times. We pick one (typically the first encountered). Let’s merge (’e’, ’s’) into a new token ’es’.

Now our corpus becomes:

"low"     → ['l', 'o', 'w', '</w>']           × 5
"lower"   → ['l', 'o', 'w', 'e', 'r', '</w>'] × 2
"newest"  → ['n', 'e', 'w', 'es', 't', '</w>'] × 6
"widest"  → ['w', 'i', 'd', 'es', 't', '</w>'] × 3

Our vocabulary now includes: {l, o, w, e, r, n, s, t, i, d, , es}.

Step 3: Repeat

We count pairs again with the updated corpus and merge the next most frequent pair. Now (’es’, ’t’) appears 9 times (6 from “newest” + 3 from “widest”), so we merge it into ’est':

"low"     → ['l', 'o', 'w', '</w>']           × 5
"lower"   → ['l', 'o', 'w', 'e', 'r', '</w>'] × 2
"newest"  → ['n', 'e', 'w', 'est', '</w>']     × 6
"widest"  → ['w', 'i', 'd', 'est', '</w>']     × 3

Vocabulary: {l, o, w, e, r, n, s, t, i, d, , es, est}.

Next, (’est’, ‘’) appears 9 times, so we merge it into ’est’:

"low"     → ['l', 'o', 'w', '</w>']           × 5
"lower"   → ['l', 'o', 'w', 'e', 'r', '</w>'] × 2
"newest"  → ['n', 'e', 'w', 'est</w>']         × 6
"widest"  → ['w', 'i', 'd', 'est</w>']         × 3

Next, (’l’, ‘o’) appears 7 times, so we merge into ’lo’:

"low"     → ['lo', 'w', '</w>']               × 5
"lower"   → ['lo', 'w', 'e', 'r', '</w>']     × 2
"newest"  → ['n', 'e', 'w', 'est</w>']         × 6
"widest"  → ['w', 'i', 'd', 'est</w>']         × 3

Then (’lo’, ‘w’) appears 7 times, merge into ’low’:

"low"     → ['low', '</w>']                   × 5
"lower"   → ['low', 'e', 'r', '</w>']         × 2
"newest"  → ['n', 'e', 'w', 'est</w>']         × 6
"widest"  → ['w', 'i', 'd', 'est</w>']         × 3

And so on. Each merge creates a new token and adds it to the vocabulary. We keep merging until we reach our target vocabulary size.

The Key Insight

After enough merges, the vocabulary contains a mix of:

  • Individual characters (the starting point)
  • Common subwords like “est”, “ing”, “tion”, “un”
  • Whole common words like “the”, “low”, “new”

This is exactly the subword vocabulary we wanted. Common patterns get their own tokens (efficient), while rare words can always be decomposed into known pieces (no OOV problem). The algorithm automatically discovers the right granularity based on what appears frequently in the training data.


Byte-Level BPE: What Modern Models Actually Use

The BPE example above started with characters. Modern language models use a variant called byte-level BPE, which starts with individual bytes instead of characters.

Why bytes? Because bytes are universal. Every piece of text, in every language, in every encoding, is ultimately a sequence of bytes (numbers from 0 to 255). By starting with bytes, the tokenizer can handle any input: English, Chinese, Arabic, emoji, code, mathematical symbols, or even binary data. There are only 256 possible starting tokens (one per byte value), which is a small, fixed base vocabulary.

A single English character like “A” is one byte (65 in UTF-8 encoding). But characters from other scripts may use multiple bytes. The Chinese character “中” is three bytes in UTF-8 (228, 184, 173). The emoji “😀” is four bytes (240, 159, 152, 128). Byte-level BPE handles all of these uniformly: it starts with the raw bytes and merges frequent pairs, regardless of what language or script they come from.

This is the approach used by GPT-2, GPT-4o, LLaMA 3, LLaMA 4, and most other modern models. The BPE merges are learned from the training corpus, so languages and scripts that appear more frequently in the training data get more efficient tokenization (fewer tokens per word), while less common languages may require more tokens for the same amount of text. We’ll see the real-world impact of this later in the chapter.


The Vocabulary: A Simple Lookup Table

The result of BPE training is a vocabulary: a fixed list of tokens, each assigned a unique integer ID. This vocabulary is stored as a simple lookup table, and it never changes after training. Every time the model processes text, it uses this same table to convert text to token IDs and back.

Here’s what a small slice of a real vocabulary looks like (from GPT-2’s 50,257-token vocabulary):

Token IDToken
0!
1"
262the
263in
464of
15496Hello
50256<|endoftext|>

Source: GPT-2 vocabulary, 50,257 tokens. OpenAI, 2019.

The vocabulary includes individual bytes, common subwords, whole words, and special tokens (like the end-of-text marker). The order reflects the BPE merge history: tokens with lower IDs were created earlier in the merge process (more common), while tokens with higher IDs were created later (less common or composed of more pieces).

Vocabulary Sizes Across Real Models

Different models use different vocabulary sizes. Here’s how they compare as of March 2026:

ModelVocabulary SizeTokenizer TypeSource
GPT-250,257BPE (byte-level)OpenAI, 2019
GPT-4100,256BPE (cl100k_base)OpenAI, 2023
GPT-4o~200,000BPE (o200k_base)OpenAI, May 2024
GPT-5~200,000BPE (o200k_base Harmony)OpenAI, August 2025
LLaMA 3128,256BPE (byte-level, via tiktoken)Meta, 2024
LLaMA 4 Maverick202,048BPE (byte-level)Meta, April 2025
DeepSeek-V3128,000BPEDeepSeek, December 2024
Mistral 7B32,000SentencePiece BPEMistral AI, 2023
Mistral Small 3.1131,072Tekken (BPE via tiktoken)Mistral AI, March 2025
Qwen 3151,936BPEAlibaba, April 2025

Sources: GPT-2 from OpenAI (2019); GPT-4 cl100k_base from tiktoken library; GPT-4o o200k_base from OpenAI (May 2024); GPT-5 o200k_base Harmony variant from OpenAI and tiktoken documentation (August 2025); LLaMA 3 from Meta (2024); LLaMA 4 Maverick from Ollama model registry and Meta AI (April 2025); DeepSeek-V3 from DeepSeek technical report (December 2024); Mistral 7B from Mistral AI (2023); Mistral Small 3.1 from Hugging Face model card (Tekken tokenizer, 131K vocabulary); Qwen 3 from Hugging Face transformers Qwen3Config (vocab_size: 151,936).

The trend is clear: vocabulary sizes have grown significantly over time. GPT-2 used about 50,000 tokens in 2019. By 2024-2025, most frontier models use 100,000 to 200,000 tokens. The Qwen 3.5 series (released late 2025) pushed even further to 250,000 tokens. Larger vocabularies mean more common words and subwords get their own dedicated token, which makes tokenization more efficient (fewer tokens per sentence) at the cost of a larger embedding table.

Source: Qwen 3.5 vocabulary size from qwen-ai.com (250,000 tokens, described as 69% larger than Qwen 3).

Why Vocabulary Size Matters

The vocabulary size affects three things directly:

  1. Tokenization efficiency. A larger vocabulary means more words and subwords have their own token, so text gets compressed into fewer tokens. This means more text fits into the model’s context window and each API call processes more content per token.

  2. Embedding table size. The embedding table has one row per vocabulary entry, with each row being the model’s embedding dimension in length. For LLaMA 4 Maverick with a vocabulary of 202,048 and an embedding dimension of 5,120, the embedding table contains 202,048 x 5,120 = 1.03 billion numbers. In 16-bit floating point, that’s about 2 GB just for the embedding table. We’ll cover this in detail in Chapter 5.

  3. Output layer size. At the end of the model, there’s a matrix that converts the final hidden state into a score for every token in the vocabulary (one score per vocabulary entry). A larger vocabulary means more scores to compute at every generation step, which adds to inference cost.

There’s a tradeoff: larger vocabularies improve tokenization efficiency but increase the model’s parameter count and the cost of computing the final softmax (Chapter 2). The vocabulary sizes in the table above represent each team’s judgment about where that tradeoff is optimal.


SentencePiece, tiktoken, and Tekken: The Tokenizer Libraries

BPE is the algorithm. But to actually use it, you need a software library that implements the algorithm efficiently. Three libraries dominate the landscape:

SentencePiece

SentencePiece is an open-source tokenization library developed by Google. It treats the input as a raw stream of Unicode characters (no pre-tokenization step like splitting on spaces), which makes it language-agnostic. SentencePiece supports both BPE and another algorithm called the Unigram Language Model.

SentencePiece was used by the original LLaMA (vocabulary of 32,000), Mistral 7B, and many other models. It’s particularly popular for multilingual models because it doesn’t assume any particular language’s word boundaries.

Source: Kudo and Richardson, “SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing,” EMNLP 2018.

tiktoken

tiktoken is OpenAI’s tokenization library, written in Rust for speed with Python bindings. It implements byte-level BPE and is used by all OpenAI models (GPT-2, GPT-4, GPT-4o, GPT-5). tiktoken is also used by LLaMA 3, LLaMA 4, and Mistral’s newer Tekken tokenizer.

tiktoken defines several encoding schemes:

  • r50k_base: Used by GPT-3 (50,000 tokens)
  • cl100k_base: Used by GPT-4 and GPT-3.5-Turbo (~100,000 tokens)
  • o200k_base: Used by GPT-4o, o1, o3-mini, and other recent models (~200,000 tokens)
  • o200k_base Harmony: A variant of o200k_base used by GPT-5, with additional tokens for tool use and structured output

Source: tiktoken GitHub repository (github.com/openai/tiktoken); OpenAI documentation; Modal.com analysis of o200k Harmony.

Tekken

Tekken is Mistral AI’s tokenizer, introduced with their newer models (Mistral Small 3.1, Ministral). It uses BPE via tiktoken under the hood but was trained on a corpus of over 100 languages, making it more efficient for multilingual text than earlier Mistral tokenizers. Tekken uses a vocabulary of 131,072 tokens (approximately 130,000 regular tokens plus about 1,000 control tokens).

Source: Mistral AI documentation; Hugging Face model cards for Mistral Small 3.1; Restack.io Tekken tokenizer overview.

The choice of tokenizer library matters less than the vocabulary it produces. Two different libraries implementing BPE with the same training data and vocabulary size would produce similar (though not identical) results. What matters most is the vocabulary itself: which tokens exist, how they were learned, and how efficiently they represent the text your model will process.


How Tokenization Works in Practice

Let’s see tokenization in action using real tools. We’ll use OpenAI’s tiktoken library to tokenize text with the same tokenizer used by GPT-4o.

Installing and Using tiktoken

import tiktoken

# Load the tokenizer used by GPT-4o
enc = tiktoken.get_encoding("o200k_base")

text = "Hello, world! How are you?"
tokens = enc.encode(text)
print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"Number of tokens: {len(tokens)}")

# Decode back to text
decoded = enc.decode(tokens)
print(f"Decoded: {decoded}")

Output:

Text: Hello, world! How are you?
Tokens: [13225, 11, 2375, 0, 3253, 553, 481, 30]
Number of tokens: 8
Decoded: Hello, world! How are you?

The sentence “Hello, world! How are you?” becomes 8 tokens. Each token is an integer that maps to an entry in the vocabulary. Let’s see what each token represents:

for token_id in tokens:
    token_bytes = enc.decode_single_token_bytes(token_id)
    print(f"  ID {token_id:>6d}{token_bytes}")

Output:

  ID  13225 → b'Hello'
  ID     11 → b','
  ID   2375 → b' world'
  ID      0 → b'!'
  ID   3253 → b' How'
  ID    553 → b' are'
  ID    481 → b' you'
  ID     30 → b'?'

Notice several things:

  • Common words like “Hello”, “world”, “How”, “are”, and “you” are each a single token.
  • Punctuation marks like “,” “!” and “?” are their own tokens.
  • Spaces are attached to the beginning of the following word, not treated as separate tokens. " world" (with a leading space) is one token, not two. This is a design choice in byte-level BPE that avoids wasting tokens on whitespace.

Tokenizing Different Types of Text

Let’s see how the tokenizer handles different kinds of input:

import tiktoken

enc = tiktoken.get_encoding("o200k_base")

examples = [
    "The capital of France is Paris.",
    "def fibonacci(n):\n    if n <= 1:\n        return n",
    "E = mc²",
    "こんにちは世界",          # "Hello world" in Japanese
    "مرحبا بالعالم",          # "Hello world" in Arabic
    "Hello 😀 World 🌍",
]

for text in examples:
    tokens = enc.encode(text)
    print(f"Text: {text}")
    print(f"  Tokens: {len(tokens)}")
    print()

Running this produces results like:

Text: The capital of France is Paris.
  Tokens: 7

Text: def fibonacci(n):
    if n <= 1:
        return n
  Tokens: 14

Text: E = mc²
  Tokens: 4

Text: こんにちは世界
  Tokens: 2

Text: مرحبا بالعالم
  Tokens: 4

Text: Hello 😀 World 🌍
  Tokens: 5

A simple English sentence takes 7 tokens. Python code takes more tokens because of special characters, indentation, and syntax. Japanese text is remarkably efficient here: “こんにちは世界” (meaning “Hello world”) compresses to just 2 tokens because the o200k_base tokenizer was trained on multilingual data and has dedicated tokens for common Japanese character sequences. Arabic text takes 4 tokens for a similar greeting. The emoji example shows that emoji are handled as tokens too, though they may take 1-2 tokens each depending on the specific emoji.

(Note: Exact token counts may vary slightly across tiktoken versions. The numbers above were produced with tiktoken 0.12.0 using the o200k_base encoding.)


Edge Cases: Where Tokenization Gets Tricky

Tokenization seems straightforward for simple English text, but real-world input is messy. Here are the edge cases that matter.

Multilingual Text

Tokenizers trained primarily on English text are less efficient for other languages. The same meaning expressed in different languages can produce very different token counts. This happens because the BPE merges are learned from the training corpus, and languages that appear more frequently get more dedicated tokens.

Research by Petrov et al. (2023) found that tokenizers can produce dramatically different token counts across languages for semantically equivalent text. According to their NeurIPS 2023 paper, the same text translated into different languages can have tokenization lengths differing by up to 15 times, and these disparities persist even for tokenizers intentionally trained for multilingual support. Latin-script languages like English, Spanish, and French tend to be tokenized efficiently, while languages using non-Latin scripts (Arabic, Hindi, Thai, Chinese) often require significantly more tokens for the same semantic content, with typical ratios of 2 to 5 times more tokens and extreme cases reaching even higher. A related study by Ahia et al. (2023) confirmed that some languages require up to 5 times as many tokens as English, and that this disparity is not solely caused by data imbalance but is also rooted in inherent properties of languages and their writing scripts.

Sources: Petrov et al., “Language Model Tokenizers Introduce Unfairness Between Languages,” arXiv:2305.15425, NeurIPS 2023; Ahia et al., “Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models,” arXiv:2305.13707, EMNLP 2023.

This has a direct financial impact. Since API providers charge per token, users writing in less-efficiently-tokenized languages pay more for the same amount of content. At GPT-4o’s pricing of $2.50 per million input tokens, a document that takes 1,000 tokens in English might take 3,000 to 5,000 tokens in a less-efficiently-tokenized language, multiplying the cost by the same factor.

Source: GPT-4o pricing from OpenAI (as of March 2026): $2.50 per million input tokens, $10.00 per million output tokens.

Newer tokenizers have improved on this. OpenAI’s o200k_base tokenizer (used by GPT-4o and GPT-5) was specifically trained on a more multilingual corpus than its predecessor cl100k_base, reducing the efficiency gap for non-English languages. Similarly, Mistral’s Tekken tokenizer was trained on over 100 languages to improve multilingual compression.

Code

Programming languages present unique challenges. Code contains:

  • Indentation (spaces or tabs that carry syntactic meaning in Python)
  • Special characters (brackets, semicolons, operators)
  • Variable names (which can be anything: myVariableName, x, calculate_total_revenue)
  • String literals (which contain natural language text)
  • Numbers (which may be tokenized digit by digit or as multi-digit chunks)

Most modern tokenizers handle code reasonably well because their training corpora include large amounts of source code. The o200k_base tokenizer, for example, has dedicated tokens for common code patterns like def , return , import , and common indentation patterns.

Emoji

Emoji are encoded as multi-byte UTF-8 sequences. A simple emoji like “😀” is 4 bytes in UTF-8. The tokenizer may represent it as a single token (if it appears frequently enough in the training data to have been merged into one) or as multiple tokens (if it’s rare).

Compound emoji are even more complex. The “family” emoji “👨‍👩‍👧‍👦” is actually composed of four individual emoji joined by invisible zero-width joiner (ZWJ) characters, totaling 25 bytes. This can easily become 5 or more tokens.

Numbers and Math

Numbers are a well-known weakness of tokenization. The number “123456” might be tokenized as [“123”, “456”] or [“12”, “345”, “6”] or even [“1”, “2”, “3”, “4”, “5”, “6”], depending on the tokenizer. This inconsistency makes arithmetic harder for language models, because the model sees different token boundaries for different numbers.

This is one reason why language models struggle with precise arithmetic. When the model sees the tokens [“123”, “456”], it doesn’t inherently know that this represents the number 123,456. It has to learn the relationship between token sequences and numerical values, which is much harder than if every digit were its own token (but that would make sequences very long for large numbers).

Whitespace and Formatting

Different tokenizers handle whitespace differently. Some tokenizers:

  • Attach leading spaces to the following token (" world" is one token)
  • Treat spaces as separate tokens
  • Collapse multiple spaces into special tokens
  • Have dedicated tokens for newlines, tabs, and indentation patterns

The GPT-family tokenizers (using tiktoken) attach leading spaces to the following word. This means “Hello world” tokenizes differently from “Helloworld”: the first produces [“Hello”, " world"] (2 tokens), while the second might produce [“Hello”, “world”] or [“Hellow”, “orld”] depending on the vocabulary.


The Tokenization-Cost Connection

Understanding tokenization is essential for understanding API costs. Every major LLM provider charges by the token, not by the word or character. This means the efficiency of the tokenizer directly affects how much you pay.

Let’s make this concrete. Suppose you’re building an application that processes customer support emails. Each email averages 200 words. In English, with the o200k_base tokenizer, 200 words typically produce about 250-300 tokens (roughly 1.3-1.5 tokens per word). At GPT-4o’s pricing of $2.50 per million input tokens:

Cost per email = 275 tokens × ($2.50 / 1,000,000) = $0.000688
Cost for 100,000 emails = $0.069

That’s about 7 cents for 100,000 emails. But if those emails are in a language that tokenizes less efficiently, say producing roughly 2.5 tokens per word instead of 1.4:

Cost per email = 500 tokens × ($2.50 / 1,000,000) = $0.00125
Cost for 100,000 emails = $0.125

Nearly double the cost for the same semantic content, purely because of tokenization efficiency.

The same logic applies to context windows. If a model has a 128,000-token context window, and your language tokenizes at 1.4 tokens per word, you can fit about 91,000 words. If your language tokenizes at 2.5 tokens per word, you can only fit about 51,000 words in the same context window. The model’s effective capacity depends on the tokenizer’s efficiency for your specific language.


Hands-On: Build a BPE Tokenizer from Scratch

Now let’s implement BPE from scratch in Python. This implementation follows the same algorithm we walked through earlier, but on real text. No external libraries needed beyond Python’s standard library.

from collections import Counter

def get_pair_counts(token_sequences):
    """Count frequency of each adjacent pair across all sequences."""
    pairs = Counter()
    for seq, freq in token_sequences.items():
        for i in range(len(seq) - 1):
            pairs[(seq[i], seq[i + 1])] += freq
    return pairs

def merge_pair(token_sequences, pair, new_token):
    """Replace all occurrences of pair with new_token in all sequences."""
    merged = {}
    for seq, freq in token_sequences.items():
        new_seq = []
        i = 0
        while i < len(seq):
            if i < len(seq) - 1 and seq[i] == pair[0] and seq[i + 1] == pair[1]:
                new_seq.append(new_token)
                i += 2
            else:
                new_seq.append(seq[i])
                i += 1
        merged[tuple(new_seq)] = freq
    return merged

def train_bpe(text, num_merges):
    """Train a BPE tokenizer on the given text."""
    # Split text into words and count frequencies
    words = text.split()
    word_freqs = Counter(words)

    # Initialize: split each word into characters + end-of-word marker
    token_sequences = {}
    for word, freq in word_freqs.items():
        chars = tuple(list(word) + ["</w>"])
        token_sequences[chars] = freq

    # Collect initial vocabulary
    vocab = set()
    for seq in token_sequences:
        vocab.update(seq)

    merges = []

    for i in range(num_merges):
        pairs = get_pair_counts(token_sequences)
        if not pairs:
            break

        # Find the most frequent pair
        best_pair = max(pairs, key=pairs.get)
        best_count = pairs[best_pair]

        # Create new token by joining the pair
        new_token = best_pair[0] + best_pair[1]

        print(f"Merge {i+1}: {best_pair} → '{new_token}' (count: {best_count})")

        # Apply the merge
        token_sequences = merge_pair(token_sequences, best_pair, new_token)
        vocab.add(new_token)
        merges.append(best_pair)

    return vocab, merges, token_sequences

# --- Train on a small corpus ---
corpus = ("low " * 5 + "lower " * 2 + "newest " * 6 + "widest " * 3).strip()
print(f"Corpus: {corpus}\n")

vocab, merges, final_sequences = train_bpe(corpus, num_merges=10)

print(f"\nFinal vocabulary ({len(vocab)} tokens):")
print(sorted(vocab, key=len))

print(f"\nFinal tokenization:")
for seq, freq in final_sequences.items():
    print(f"  {list(seq)} × {freq}")

Running this produces:

Corpus: low low low low low lower lower newest newest newest newest newest newest widest widest widest

Merge 1: ('e', 's') → 'es' (count: 9)
Merge 2: ('es', 't') → 'est' (count: 9)
Merge 3: ('est', '</w>') → 'est</w>' (count: 9)
Merge 4: ('l', 'o') → 'lo' (count: 7)
Merge 5: ('lo', 'w') → 'low' (count: 7)
Merge 6: ('n', 'e') → 'ne' (count: 6)
Merge 7: ('ne', 'w') → 'new' (count: 6)
Merge 8: ('new', 'est</w>') → 'newest</w>' (count: 6)
Merge 9: ('low', '</w>') → 'low</w>' (count: 5)
Merge 10: ('w', 'i') → 'wi' (count: 3)

Final vocabulary (17 tokens):
['d', 'e', 'i', 'l', 'n', 'o', 'r', 's', 't', 'w', 'es', 'lo', 'ne', 'wi', 'est', 'low', 'new', '</w>', 'est</w>', 'low</w>', 'newest</w>']

Final tokenization:
  ['low</w>'] × 5
  ['low', 'e', 'r', '</w>'] × 2
  ['newest</w>'] × 6
  ['wi', 'd', 'est</w>'] × 3

After 10 merges, the tokenizer has learned that:

  • “newest” is so common it became a single token
  • “low” is common enough to be its own token
  • “est” (a common English suffix) got merged early
  • “widest” is split into [“wi”, “d”, “est”] because “wi” and “d” weren’t frequent enough to merge further in just 10 steps

This is exactly how production tokenizers work, just at a much larger scale. Instead of 10 merges on a tiny corpus, GPT-4o’s tokenizer performed roughly 200,000 merges on a corpus of billions of words.

Tokenizing New Text with Our BPE

Once we have the merge rules, we can tokenize any new text by applying the merges in order:

def tokenize(word, merges):
    """Tokenize a word using learned BPE merges."""
    tokens = list(word) + ["</w>"]

    for pair in merges:
        new_token = pair[0] + pair[1]
        i = 0
        new_tokens = []
        while i < len(tokens):
            if i < len(tokens) - 1 and tokens[i] == pair[0] and tokens[i + 1] == pair[1]:
                new_tokens.append(new_token)
                i += 2
            else:
                new_tokens.append(tokens[i])
                i += 1
        tokens = new_tokens

    return tokens

# Tokenize words using our learned merges
test_words = ["low", "lower", "newest", "widest", "lowest", "newer"]
for word in test_words:
    tokens = tokenize(word, merges)
    print(f"  '{word}' → {tokens}")

Output:

  'low' → ['low</w>']
  'lower' → ['low', 'e', 'r', '</w>']
  'newest' → ['newest</w>']
  'widest' → ['wi', 'd', 'est</w>']
  'lowest' → ['low', 'est</w>']
  'newer' → ['new', 'e', 'r', '</w>']

Notice how “lowest” (a word that wasn’t in our training corpus) gets tokenized as [“low”, “est”]. The tokenizer has never seen “lowest” before, but it can represent it by combining “low” (learned from “low” and “lower”) with “est” (learned from “newest” and “widest”). This is the power of subword tokenization: it generalizes to unseen words by composing known pieces.

Similarly, “newer” becomes [“new”, “e”, “r”, “”]. The tokenizer knows “new” as a unit (from “newest”) and falls back to individual characters for the unfamiliar suffix “er” in this context.


Tokenizing Real Text with tiktoken

Let’s use tiktoken to explore how a production tokenizer handles real-world text. This code lets you see exactly how any text gets broken into tokens:

import tiktoken

enc = tiktoken.get_encoding("o200k_base")

def show_tokens(text):
    """Display the tokenization of a text string."""
    token_ids = enc.encode(text)
    print(f"Text: {text}")
    print(f"Token count: {len(token_ids)}")
    print(f"Tokens:")
    for tid in token_ids:
        token_str = enc.decode([tid])
        print(f"  [{tid:>6d}] '{token_str}'")
    print()

show_tokens("The capital of France is Paris.")
show_tokens("Tokenization is fundamental to LLMs.")
show_tokens("def hello():\n    print('Hello, world!')")

This will show you the exact token boundaries for each piece of text. You can experiment with different inputs to build intuition for how the tokenizer splits text.

Comparing Tokenizer Efficiency

Here’s a practical comparison showing how different text types produce different token-to-word ratios:

import tiktoken

enc = tiktoken.get_encoding("o200k_base")

texts = {
    "Simple English": "The quick brown fox jumps over the lazy dog near the river bank.",
    "Technical English": "Backpropagation computes gradients via the chain rule through differentiable layers.",
    "Python code": "def fibonacci(n):\n    return n if n <= 1 else fibonacci(n-1) + fibonacci(n-2)",
    "JSON data": '{"name": "Alice", "age": 30, "city": "New York"}',
    "URL": "https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct",
}

print(f"{'Type':<20} {'Words':>6} {'Tokens':>7} {'Ratio':>6}")
print("-" * 42)
for label, text in texts.items():
    words = len(text.split())
    tokens = len(enc.encode(text))
    ratio = tokens / words if words > 0 else 0
    print(f"{label:<20} {words:>6} {tokens:>7} {ratio:>6.2f}")

This will show that simple English text typically produces about 1.2-1.4 tokens per word, while code and structured data produce higher ratios because of special characters, punctuation, and formatting.


Special Tokens: Beyond Regular Text

Every tokenizer includes a set of special tokens that serve specific purposes in the model’s operation. These tokens don’t represent text from the input; they’re control signals that tell the model about the structure of the input.

Common special tokens include:

  • Beginning of sequence (<|begin_of_text|> or <s>): Marks the start of a new input.
  • End of sequence (<|end_of_text|> or </s> or <|endoftext|>): Tells the model the input is complete. During generation, the model outputs this token to signal it’s done.
  • Padding (<pad>): Used to fill sequences to a uniform length when processing multiple inputs in a batch.
  • System/User/Assistant markers: In chat models, special tokens mark the boundaries between system instructions, user messages, and assistant responses. For example, <|start_header_id|>user<|end_header_id|> in LLaMA 4.

These special tokens are assigned IDs in the vocabulary just like regular tokens, but they’re never produced by the BPE algorithm. They’re manually added to the vocabulary for specific purposes.

When you send a message to a chat API, the system wraps your text in special tokens before the model sees it. A conversation like:

System: You are a helpful assistant.
User: What is the capital of France?

Gets converted into something like:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

These special tokens consume part of the context window. A system prompt of 500 words might add 20-30 extra tokens just from the formatting markers. This is why the effective context available for your actual content is always slightly less than the model’s advertised context window.


From Tokens to the Model: What Happens Next

Once text is tokenized into a sequence of integer IDs, those IDs are used to look up vectors in the embedding table (Chapter 5). Each token ID is an index into a table with one row per vocabulary entry. The token ID 13225 (“Hello” in o200k_base) maps to row 13225 of the embedding table, which contains a vector of 5,120 numbers (in LLaMA 4 Maverick) that represents the meaning of “Hello.”

This is the bridge between text and the mathematical operations from Chapters 2 and 3. Tokenization converts text to integers. The embedding table converts integers to vectors. And from that point on, everything is matrix multiplications, dot products, and softmax, exactly as we covered in the previous chapters.

The full pipeline looks like this:

"The capital of France is Paris."
        ↓ tokenization
[976, 9029, 328, 10128, 382, 12650, 13]
        ↓ embedding lookup
[vector_976, vector_9029, vector_328, vector_10128, vector_382, vector_12650, vector_13]
        ↓ Transformer layers (Chapters 7-10)
[transformed vectors...]
        ↓ output projection + softmax
probability distribution over 202,048 tokens

Every step after tokenization operates on numbers. The tokenizer is the only component that touches raw text.


Key Takeaways

  • Language models operate on numbers, not text. Tokenization is the process of converting text into a sequence of integer IDs that the model can process. It’s the first step in every LLM pipeline.

  • Character-level tokenization produces sequences that are too long (expensive to process). Word-level tokenization produces vocabularies that are too large and can’t handle unseen words. Subword tokenization is the sweet spot: common words stay whole, rare words are broken into reusable pieces.

  • Byte Pair Encoding (BPE) is the dominant tokenization algorithm. It starts with individual bytes (or characters), then repeatedly merges the most frequent adjacent pair into a new token until the desired vocabulary size is reached. It was originally a compression algorithm (Gage, 1994) adapted for NLP (Sennrich et al., 2016).

  • Modern models use byte-level BPE, which starts from raw bytes (0-255) instead of characters. This makes the tokenizer universal: it can handle any language, any script, any emoji, and any special character without an out-of-vocabulary problem.

  • Vocabulary sizes have grown over time: from 50,000 (GPT-2, 2019) to 100,000 (GPT-4, 2023) to 200,000+ (GPT-4o, LLaMA 4 Maverick, 2024-2025) and even 250,000 (Qwen 3.5, late 2025). Larger vocabularies improve tokenization efficiency (fewer tokens per sentence) but increase the model’s embedding table size and output computation cost.

  • The three major tokenizer libraries are SentencePiece (Google, language-agnostic), tiktoken (OpenAI, fast byte-level BPE), and Tekken (Mistral, multilingual BPE). The algorithm matters less than the vocabulary: what tokens exist and how efficiently they represent your text.

  • Tokenization efficiency varies by language. English text typically produces 1.2-1.5 tokens per word, while some non-Latin-script languages may produce many more tokens for equivalent semantic content, with disparities of up to 15 times documented in research (Petrov et al., NeurIPS 2023; Ahia et al., EMNLP 2023). This directly affects API costs and effective context window size.

  • Special tokens (beginning/end of sequence, role markers, padding) are manually added to the vocabulary for control purposes. They consume part of the context window and are invisible to the user but essential for the model’s operation.

  • The tokenizer is the bridge between human-readable text and the mathematical world of vectors and matrices. After tokenization, each token ID is looked up in the embedding table to produce a vector, and from that point on, everything is the linear algebra from Chapter 2.


What’s Next

You now know how text becomes a sequence of integer IDs. But an integer by itself carries no meaning; the model needs a rich numerical representation of each token. In Chapter 5, we’ll cover embeddings: how each token ID maps to a high-dimensional vector, what those vectors represent, how they’re learned during training, and why the embedding table in a production model can be several gigabytes in size.