GenAI Tokens and Tokenization
Every large language model works with numbers — not words. Before any text reaches the model, it gets broken down into smaller units called tokens. The process of breaking text into tokens is called tokenization. Understanding tokens explains many behaviors of LLMs, including why they sometimes misspell words, how pricing works, and why certain prompts consume more context space than others.
What Is a Token?
A token is a chunk of text — usually a word, part of a word, or a punctuation mark. Tokens are not always full words. Common words like "the" or "is" become a single token. Longer or rarer words may be split into two or more tokens.
Example sentence: "Tokenization is fascinating!" Tokens: ───────────────────────────────────────────── "Token" | "ization" | " is" | " fas" | "cin" | "ating" | "!" 1 2 3 4 5 6 7 ───────────────────────────────────────────── Total: 7 tokens
As a rough rule of thumb, one token equals about 3–4 characters of English text, or approximately 0.75 words.
Why Not Just Use Words?
Using whole words as units creates several problems:
- Vocabulary becomes too large — millions of unique words across all languages
- New words, names, and technical terms would not exist in the vocabulary
- Compound words, prefixes, and suffixes would need separate entries
Token-based approaches solve this. By splitting words into common sub-word pieces, a vocabulary of 50,000–100,000 tokens can represent virtually any text in any language, including code.
How Tokenization Works — BPE Algorithm
Most modern LLMs use an algorithm called Byte Pair Encoding (BPE) to build their token vocabulary. The process works like this:
Step 1: Start with individual characters "l", "o", "w", "e", "r" Step 2: Count most frequent character pairs in the training data "lo" appears many times → merge into one token: "lo" Step 3: Repeat — find next most frequent pair "low" appears many times → merge: "low" Step 4: Continue until vocabulary reaches target size (e.g., 50,000 tokens) Result: Common words are single tokens, rare words are split into sub-parts
Token Examples Across Different Content Types
| Text | Approximate Token Count | Notes |
|---|---|---|
| "Hello!" | 2 | "Hello" + "!" |
| "Artificial intelligence" | 3 | "Artificial" + " intell" + "igence" |
| 1 paragraph (~100 words) | ~130 tokens | English text averages ~0.75 words per token |
| A 300-page novel | ~100,000 tokens | May exceed some model context windows |
| Python code snippet (20 lines) | ~80–120 tokens | Code tokenizes differently than prose |
Tokens and Pricing
When using LLM APIs, pricing is based on token count — not word count. API providers charge separately for input tokens (the prompt sent to the model) and output tokens (the response generated by the model).
API Call Example
─────────────────────────────────────────────────────────
Prompt sent: "Write a 100-word summary of photosynthesis."
→ ~11 input tokens
Response: 100-word summary
→ ~130 output tokens
Total billed: 141 tokens
Knowing token counts helps optimize costs, especially in high-volume production applications.
How Token Limits Affect Prompts
Every model has a maximum context window measured in tokens. Both the input and the output count toward this limit.
Model Context Window: 8,000 tokens ────────────────────────────────────────────────────── │ System Prompt │ User Message │ History │ → Input tokens used │ ~200 tok │ ~500 tok │~1000tok │ Output space remaining: 8000 - 1700 = 6300 tokens max for response ──────────────────────────────────────────────────────
If the input alone fills the entire context window, the model has no room to generate a response. Managing token usage is an important skill for building efficient applications.
Tokens and Non-English Languages
English text is highly efficient in tokenization because most tokens are full common words. Non-English languages and non-Latin scripts (like Hindi, Arabic, Chinese, or Japanese) often use more tokens per word because fewer sub-word patterns exist in the training vocabulary.
English: "Hello, how are you?" → ~6 tokens Hindi: "नमस्ते, आप कैसे हैं?" → ~15–20 tokens (more byte-level splits)
This means non-English prompts can consume more of the context window and cost more when billed by token count.
Special Tokens
Beyond regular text tokens, models use special tokens for specific purposes:
| Special Token | Purpose |
|---|---|
| <|endoftext|> | Signals the end of a document or response |
| <|system|> | Marks the beginning of system instructions |
| <|user|> | Marks the beginning of user input |
| <|assistant|> | Marks the beginning of the model's response |
| [PAD] | Padding token used during batch processing |
Why This Matters for Prompt Writing
Because models work with tokens:
- Typos and unusual spellings may tokenize differently and confuse the model
- Repeating the same information multiple times in a prompt wastes tokens
- Very long prompts may push important instructions outside what the model focuses on
- Code, numbers, and special characters often use more tokens than expected
With tokenization understood, the next critical topic is training data — where the model's knowledge actually comes from, and why data quality shapes every aspect of model behavior.
