GenAI Tokens and Tokenization

Every large language model works with numbers — not words. Before any text reaches the model, it gets broken down into smaller units called tokens. The process of breaking text into tokens is called tokenization. Understanding tokens explains many behaviors of LLMs, including why they sometimes misspell words, how pricing works, and why certain prompts consume more context space than others.

What Is a Token?

A token is a chunk of text — usually a word, part of a word, or a punctuation mark. Tokens are not always full words. Common words like "the" or "is" become a single token. Longer or rarer words may be split into two or more tokens.

Example sentence: "Tokenization is fascinating!"

Tokens:
─────────────────────────────────────────────
"Token"  |  "ization"  |  " is"  |  " fas"  |  "cin"  |  "ating"  |  "!"
  1            2           3          4          5          6          7
─────────────────────────────────────────────
Total: 7 tokens

As a rough rule of thumb, one token equals about 3–4 characters of English text, or approximately 0.75 words.

Why Not Just Use Words?

Using whole words as units creates several problems:

Vocabulary becomes too large — millions of unique words across all languages
New words, names, and technical terms would not exist in the vocabulary
Compound words, prefixes, and suffixes would need separate entries

Token-based approaches solve this. By splitting words into common sub-word pieces, a vocabulary of 50,000–100,000 tokens can represent virtually any text in any language, including code.

How Tokenization Works — BPE Algorithm

Most modern LLMs use an algorithm called Byte Pair Encoding (BPE) to build their token vocabulary. The process works like this:

Step 1: Start with individual characters
  "l", "o", "w", "e", "r"

Step 2: Count most frequent character pairs in the training data
  "lo" appears many times → merge into one token: "lo"

Step 3: Repeat — find next most frequent pair
  "low" appears many times → merge: "low"

Step 4: Continue until vocabulary reaches target size (e.g., 50,000 tokens)

Result: Common words are single tokens, rare words are split into sub-parts

Token Examples Across Different Content Types

Text	Approximate Token Count	Notes
"Hello!"	2	"Hello" + "!"
"Artificial intelligence"	3	"Artificial" + " intell" + "igence"
1 paragraph (~100 words)	~130 tokens	English text averages ~0.75 words per token
A 300-page novel	~100,000 tokens	May exceed some model context windows
Python code snippet (20 lines)	~80–120 tokens	Code tokenizes differently than prose

Tokens and Pricing

When using LLM APIs, pricing is based on token count — not word count. API providers charge separately for input tokens (the prompt sent to the model) and output tokens (the response generated by the model).

API Call Example
─────────────────────────────────────────────────────────
Prompt sent:  "Write a 100-word summary of photosynthesis."
               → ~11 input tokens

Response:     100-word summary
               → ~130 output tokens

Total billed: 141 tokens

Knowing token counts helps optimize costs, especially in high-volume production applications.

How Token Limits Affect Prompts

Every model has a maximum context window measured in tokens. Both the input and the output count toward this limit.

Model Context Window: 8,000 tokens
──────────────────────────────────────────────────────
│ System Prompt │     User Message     │ History │ → Input tokens used
│   ~200 tok    │      ~500 tok        │~1000tok │

Output space remaining: 8000 - 1700 = 6300 tokens max for response
──────────────────────────────────────────────────────

If the input alone fills the entire context window, the model has no room to generate a response. Managing token usage is an important skill for building efficient applications.

Tokens and Non-English Languages

English text is highly efficient in tokenization because most tokens are full common words. Non-English languages and non-Latin scripts (like Hindi, Arabic, Chinese, or Japanese) often use more tokens per word because fewer sub-word patterns exist in the training vocabulary.

English: "Hello, how are you?" → ~6 tokens
Hindi:   "नमस्ते, आप कैसे हैं?" → ~15–20 tokens (more byte-level splits)

This means non-English prompts can consume more of the context window and cost more when billed by token count.

Special Tokens

Beyond regular text tokens, models use special tokens for specific purposes:

Special Token	Purpose
<\|endoftext\|>	Signals the end of a document or response
<\|system\|>	Marks the beginning of system instructions
<\|user\|>	Marks the beginning of user input
<\|assistant\|>	Marks the beginning of the model's response
[PAD]	Padding token used during batch processing

Why This Matters for Prompt Writing

Because models work with tokens:

Typos and unusual spellings may tokenize differently and confuse the model
Repeating the same information multiple times in a prompt wastes tokens
Very long prompts may push important instructions outside what the model focuses on
Code, numbers, and special characters often use more tokens than expected

With tokenization understood, the next critical topic is training data — where the model's knowledge actually comes from, and why data quality shapes every aspect of model behavior.

Previous lessons

Back to courses

Next lessons