Tokens and Tokenization

Before any text reaches an LLM, it goes through a step that’s easy to overlook but critical to understand: tokenization. It affects everything from model cost to context window limits to why LLMs sometimes make surprisingly basic spelling mistakes.

What Is a Token?

A token is the basic unit of text that an LLM processes. But tokens aren’t the same as characters, syllables, or words — they’re something in between, determined by a vocabulary learned during training.

Text:   "Tokenization is fascinating!"
Tokens: ["Token", "ization", " is", " fascinat", "ing", "!"]
Count:  6 tokens

As a rough heuristic: 1 token ≈ 4 characters ≈ 0.75 words in English. But this varies significantly by language.

Why Not Just Use Words or Characters?

Character-level models: Each character is one token. Great for handling any input, terrible efficiency — “internationalization” needs 20 tokens. Sequences get very long, and the model has to work harder to learn that c, a, t means something specific.

Word-level models: Each word is one token. Nice and intuitive, but a vocabulary of 500K+ words is unwieldy. Rare words, misspellings, and new proper nouns cause “out-of-vocabulary” failures.

Subword tokenization (what LLMs use): A learned vocabulary of common subword units. Frequent words become single tokens; rare words are split into recognizable pieces. The sweet spot.

Byte Pair Encoding (BPE)

The most common tokenization algorithm. The idea is elegant:

Start with individual characters as the vocabulary
Count the most common pair of adjacent tokens
Merge that pair into a new single token
Repeat until the vocabulary reaches the desired size (e.g., 50,000 tokens)

Training data contains "low" frequently:
Initial:  l, o, w         → (l,o) is common → merge
Step 1:   lo, w           → (lo,w) is common → merge
Step 2:   low             → "low" is now a single token

GPT-4 uses a BPE vocabulary of ~100,000 tokens. LLaMA 3 uses ~128,000.

SentencePiece and Unigram Tokenization

An alternative to BPE, used by LLaMA, Gemma, T5, and others. Instead of merging bottom-up, it trains a full vocabulary model using a unigram language model and prunes tokens that reduce the model’s perplexity least.

Key advantage: language-agnostic. Works equally well on Japanese, Arabic, Chinese, or code without pre-tokenizing on whitespace. BPE assumes spaces separate words; SentencePiece doesn’t.

Tokenization in Practice

Let’s look at some real examples using GPT-4’s tokenizer:

"Hello, world!"          → ["Hello", ",", " world", "!"]               = 4 tokens
"ChatGPT"                → ["Chat", "G", "PT"]                         = 3 tokens
"Supercalifragilistic"   → ["Super", "cal", "if", "ragil", "istic"]   = 5 tokens

# Code tokenizes differently:
"def hello_world():"     → ["def", " hello", "_world", "():", ""]     = 5 tokens

# Non-English is less efficient:
"こんにちは"              → ["こん", "にち", "は"]                      = 3 tokens (but 5 characters)

The Non-English Efficiency Gap

This is a real issue. English gets roughly 1 token per word. Many other languages use 2–4 tokens per word. This means:

Non-English users pay more per equivalent amount of text
Models are less capable in languages with less training data AND worse token efficiency
Arabic, Hebrew, and some Asian languages are particularly affected

Newer models (LLaMA 3, Gemini, Qwen 2.5) have invested in larger, more multilingual vocabularies to address this.

Why Tokenization Matters for Developers

Cost

Every major LLM API charges per token (input + output). A document that is 1,000 words in English might be 800 tokens; the same document in Hindi might be 1,400 tokens at the same information density.

OpenAI GPT-4o pricing (as of 2025):
Input:  $5.00 per 1M tokens
Output: $15.00 per 1M tokens

1M words ≈ 1.33M tokens ≈ $6.65 for input

Context Window Limits

Your model has a maximum context window (e.g., 128K tokens for GPT-4). If your prompt + conversation history exceeds this, older content gets truncated. Tracking token counts isn’t optional for production systems.

Unexpected Behaviors

Because LLMs see tokens, not characters, some things break in surprising ways:

Counting letters: “How many ‘r’s in ‘strawberry’?” — the model never sees individual letters
Spelling: The model generates token by token, so it can misspell words that span token boundaries awkwardly
Arithmetic with long numbers: “12345678” might be tokenized as [“123”, “456”, “78”] — three separate entities to reason about

Counting Tokens in Code

# Using tiktoken (OpenAI's tokenizer library)
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")
text = "Hello, how are you today?"
tokens = enc.encode(text)
print(f"Token count: {len(tokens)}")  # 6
print(f"Tokens: {tokens}")            # [9906, 11, 1268, 527, 499, 3432, 30]

# Decode back
print(enc.decode(tokens))             # "Hello, how are you today?"

# Using Hugging Face tokenizers
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
tokens = tokenizer("Hello, how are you?", return_tensors="pt")
print(tokens.input_ids.shape)  # [1, 7]

Special Tokens

Tokenizers include special tokens that mark important boundaries. These aren’t in the natural text — they’re added by the tokenizer.

GPT-4 special tokens:
<|endoftext|>    ← end of document
<|fim_prefix|>   ← fill-in-the-middle prefix
<|fim_middle|>   ← fill-in-the-middle middle
<|fim_suffix|>   ← fill-in-the-middle suffix

LLaMA 3 special tokens:
<|begin_of_text|>            ← start of sequence
<|start_header_id|>system<|end_header_id|>  ← system message
<|eot_id|>                   ← end of turn

Understanding special tokens matters when you’re building chat systems or calling APIs at a lower level — they define the structure of multi-turn conversations.

The Evolving Picture

Tokenization is an active area of research. A few directions worth watching:

Byte-level BPE: Models trained on raw bytes (no tokenization assumptions). Claude’s tokenizer operates at this level, giving it excellent handling of arbitrary byte sequences.

Larger vocabularies: LLaMA 3’s 128K vocabulary vs. LLaMA 2’s 32K makes it meaningfully more efficient for multilingual text.

Character-level Transformers: Researchers are revisiting character-level models with modern architectures. Mamba and Hyena models show promise for learning directly from characters without the “token lottery.”

For now, subword BPE/SentencePiece remains the standard — but it’s worth knowing its limitations, because they directly affect what your LLM application can and can’t do well.