Tokens and Tokenization
Before any text reaches an LLM, it goes through a step that’s easy to overlook but critical to understand: tokenization. It affects everything from model cost to context window limits to why LLMs sometimes make surprisingly basic spelling mistakes.
What Is a Token?
A token is the basic unit of text that an LLM processes. But tokens aren’t the same as characters, syllables, or words — they’re something in between, determined by a vocabulary learned during training.
Text: "Tokenization is fascinating!"Tokens: ["Token", "ization", " is", " fascinat", "ing", "!"]Count: 6 tokensAs a rough heuristic: 1 token ≈ 4 characters ≈ 0.75 words in English. But this varies significantly by language.
Why Not Just Use Words or Characters?
Character-level models: Each character is one token. Great for handling any input, terrible efficiency — “internationalization” needs 20 tokens. Sequences get very long, and the model has to work harder to learn that c, a, t means something specific.
Word-level models: Each word is one token. Nice and intuitive, but a vocabulary of 500K+ words is unwieldy. Rare words, misspellings, and new proper nouns cause “out-of-vocabulary” failures.
Subword tokenization (what LLMs use): A learned vocabulary of common subword units. Frequent words become single tokens; rare words are split into recognizable pieces. The sweet spot.
Byte Pair Encoding (BPE)
The most common tokenization algorithm. The idea is elegant:
- Start with individual characters as the vocabulary
- Count the most common pair of adjacent tokens
- Merge that pair into a new single token
- Repeat until the vocabulary reaches the desired size (e.g., 50,000 tokens)
Training data contains "low" frequently:Initial: l, o, w → (l,o) is common → mergeStep 1: lo, w → (lo,w) is common → mergeStep 2: low → "low" is now a single tokenGPT-4 uses a BPE vocabulary of ~100,000 tokens. LLaMA 3 uses ~128,000.
SentencePiece and Unigram Tokenization
An alternative to BPE, used by LLaMA, Gemma, T5, and others. Instead of merging bottom-up, it trains a full vocabulary model using a unigram language model and prunes tokens that reduce the model’s perplexity least.
Key advantage: language-agnostic. Works equally well on Japanese, Arabic, Chinese, or code without pre-tokenizing on whitespace. BPE assumes spaces separate words; SentencePiece doesn’t.
Tokenization in Practice
Let’s look at some real examples using GPT-4’s tokenizer:
"Hello, world!" → ["Hello", ",", " world", "!"] = 4 tokens"ChatGPT" → ["Chat", "G", "PT"] = 3 tokens"Supercalifragilistic" → ["Super", "cal", "if", "ragil", "istic"] = 5 tokens
# Code tokenizes differently:"def hello_world():" → ["def", " hello", "_world", "():", ""] = 5 tokens
# Non-English is less efficient:"こんにちは" → ["こん", "にち", "は"] = 3 tokens (but 5 characters)The Non-English Efficiency Gap
This is a real issue. English gets roughly 1 token per word. Many other languages use 2–4 tokens per word. This means:
- Non-English users pay more per equivalent amount of text
- Models are less capable in languages with less training data AND worse token efficiency
- Arabic, Hebrew, and some Asian languages are particularly affected
Newer models (LLaMA 3, Gemini, Qwen 2.5) have invested in larger, more multilingual vocabularies to address this.
Why Tokenization Matters for Developers
Cost
Every major LLM API charges per token (input + output). A document that is 1,000 words in English might be 800 tokens; the same document in Hindi might be 1,400 tokens at the same information density.
OpenAI GPT-4o pricing (as of 2025):Input: $5.00 per 1M tokensOutput: $15.00 per 1M tokens
1M words ≈ 1.33M tokens ≈ $6.65 for inputContext Window Limits
Your model has a maximum context window (e.g., 128K tokens for GPT-4). If your prompt + conversation history exceeds this, older content gets truncated. Tracking token counts isn’t optional for production systems.
Unexpected Behaviors
Because LLMs see tokens, not characters, some things break in surprising ways:
- Counting letters: “How many ‘r’s in ‘strawberry’?” — the model never sees individual letters
- Spelling: The model generates token by token, so it can misspell words that span token boundaries awkwardly
- Arithmetic with long numbers: “12345678” might be tokenized as [“123”, “456”, “78”] — three separate entities to reason about
Counting Tokens in Code
# Using tiktoken (OpenAI's tokenizer library)import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")text = "Hello, how are you today?"tokens = enc.encode(text)print(f"Token count: {len(tokens)}") # 6print(f"Tokens: {tokens}") # [9906, 11, 1268, 527, 499, 3432, 30]
# Decode backprint(enc.decode(tokens)) # "Hello, how are you today?"# Using Hugging Face tokenizersfrom transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")tokens = tokenizer("Hello, how are you?", return_tensors="pt")print(tokens.input_ids.shape) # [1, 7]Special Tokens
Tokenizers include special tokens that mark important boundaries. These aren’t in the natural text — they’re added by the tokenizer.
GPT-4 special tokens:<|endoftext|> ← end of document<|fim_prefix|> ← fill-in-the-middle prefix<|fim_middle|> ← fill-in-the-middle middle<|fim_suffix|> ← fill-in-the-middle suffix
LLaMA 3 special tokens:<|begin_of_text|> ← start of sequence<|start_header_id|>system<|end_header_id|> ← system message<|eot_id|> ← end of turnUnderstanding special tokens matters when you’re building chat systems or calling APIs at a lower level — they define the structure of multi-turn conversations.
The Evolving Picture
Tokenization is an active area of research. A few directions worth watching:
Byte-level BPE: Models trained on raw bytes (no tokenization assumptions). Claude’s tokenizer operates at this level, giving it excellent handling of arbitrary byte sequences.
Larger vocabularies: LLaMA 3’s 128K vocabulary vs. LLaMA 2’s 32K makes it meaningfully more efficient for multilingual text.
Character-level Transformers: Researchers are revisiting character-level models with modern architectures. Mamba and Hyena models show promise for learning directly from characters without the “token lottery.”
For now, subword BPE/SentencePiece remains the standard — but it’s worth knowing its limitations, because they directly affect what your LLM application can and can’t do well.