Elasticsearch Analyzers and Tokenizers
Analyzers determine how text is transformed before Elasticsearch stores it and before your search terms are processed. Understanding analyzers helps you control exactly what matches and what does not.
What Happens When You Index Text
Original text: "The Quick Brown Fox Jumps!"
Analysis pipeline:
Step 1 — Character Filter: remove or replace characters
"The Quick Brown Fox Jumps!"
|
v
Step 2 — Tokenizer: split into words (tokens)
["The", "Quick", "Brown", "Fox", "Jumps"]
|
v
Step 3 — Token Filters: transform tokens
lowercase: ["the", "quick", "brown", "fox", "jumps"]
stop words removed: ["quick", "brown", "fox", "jumps"]
Stored tokens: ["quick", "brown", "fox", "jumps"]
When you search for "QUICK FOX," the same pipeline transforms your query into ["quick", "fox"] — so it matches even though the casing is different.
Built-in Analyzers
| Analyzer | What It Does | Best For |
|---|---|---|
| standard | Splits on whitespace/punctuation, lowercases, removes common words | General English text |
| simple | Splits on non-letter characters, lowercases | Text without numbers |
| whitespace | Splits on spaces only, no lowercasing | Codes, usernames |
| keyword | No analysis — stores the whole string as-is | IDs, categories, exact values |
| english | Standard + stemming (running → run) + English stop words | English articles and blogs |
| language-specific | Hindi, French, German etc. analyzers | Non-English content |
Testing an Analyzer
Use the Analyze API to see exactly what tokens an analyzer produces — without indexing any data:
GET /_analyze
{
"analyzer": "english",
"text": "The runners are running quickly through forests"
}
Tokens produced:
["runner", "run", "quickli", "forest"]
Note: "The", "are", "through" removed (stop words)
"runners" → "runner" (stemming)
"running" → "run" (stemming)
"forests" → "forest" (stemming)
Stemming means that searching for "run" also matches "running," "runner," and "ran."
Tokenizers
A tokenizer is the component that splits text into individual tokens. The analyzer wraps it.
| Tokenizer | Splits On | Example Output for "hello-world 123" |
|---|---|---|
| standard | Word boundaries and punctuation | ["hello", "world", "123"] |
| whitespace | Spaces only | ["hello-world", "123"] |
| keyword | Nothing — whole string is one token | ["hello-world 123"] |
| ngram | Generates character sequences | ["he","el","ll","lo","ow","wo"...] |
| edge_ngram | Generates prefixes only | ["h","he","hel","hell","hello"] |
edge_ngram — Powering Autocomplete
Edge n-gram tokenizer generates every prefix of a word. Indexing "laptop" creates tokens: l, la, lap, lapt, lapto, laptop. Now typing any prefix instantly matches:
Index: "laptop" Tokens: ["l","la","lap","lapt","lapto","laptop"] User types "lap" → matches instantly User types "lapto" → matches instantly No wildcard query needed → very fast
Custom Analyzer
Build your own analyzer by combining a tokenizer with token filters:
PUT /products
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "stop", "stemmer"]
}
}
}
},
"mappings": {
"properties": {
"description": {
"type": "text",
"analyzer": "my_custom_analyzer"
}
}
}
}
Using Different Analyzers at Index and Search Time
You can use one analyzer when storing (to build autocomplete prefixes) and a different one when searching (to match the full token):
"description": {
"type": "text",
"analyzer": "autocomplete", <-- at index time: build prefixes
"search_analyzer": "standard" <-- at search time: exact token match
}
This prevents prefix tokens from matching each other — "lap" should not match "laptop" as a prefix when you actually type the full word "laptop."
Common Token Filters
| Filter | Effect |
|---|---|
| lowercase | Makes all tokens lowercase |
| stop | Removes common words (the, is, at, which) |
| stemmer | Reduces words to their root form |
| synonym | Maps one word to its synonyms (car → automobile) |
| trim | Removes leading and trailing whitespace |
| unique | Removes duplicate tokens |
Synonym Filter in Action
When users search for "car," they should also find listings that say "automobile" or "vehicle":
"filter": {
"my_synonyms": {
"type": "synonym",
"synonyms": [
"car, automobile, vehicle",
"mobile, cellphone, smartphone"
]
}
}
Search: "buy a car"
Also matches: "buy an automobile", "vehicle for sale"
