Elasticsearch Analyzers and Tokenizers

Analyzers determine how text is transformed before Elasticsearch stores it and before your search terms are processed. Understanding analyzers helps you control exactly what matches and what does not.

What Happens When You Index Text

Original text: "The Quick Brown Fox Jumps!"

Analysis pipeline:
  Step 1 — Character Filter:  remove or replace characters
            "The Quick Brown Fox Jumps!"
                     |
                     v
  Step 2 — Tokenizer:  split into words (tokens)
            ["The", "Quick", "Brown", "Fox", "Jumps"]
                     |
                     v
  Step 3 — Token Filters:  transform tokens
            lowercase: ["the", "quick", "brown", "fox", "jumps"]
            stop words removed: ["quick", "brown", "fox", "jumps"]

Stored tokens: ["quick", "brown", "fox", "jumps"]

When you search for "QUICK FOX," the same pipeline transforms your query into ["quick", "fox"] — so it matches even though the casing is different.

Built-in Analyzers

Analyzer	What It Does	Best For
standard	Splits on whitespace/punctuation, lowercases, removes common words	General English text
simple	Splits on non-letter characters, lowercases	Text without numbers
whitespace	Splits on spaces only, no lowercasing	Codes, usernames
keyword	No analysis — stores the whole string as-is	IDs, categories, exact values
english	Standard + stemming (running → run) + English stop words	English articles and blogs
language-specific	Hindi, French, German etc. analyzers	Non-English content

Testing an Analyzer

Use the Analyze API to see exactly what tokens an analyzer produces — without indexing any data:

GET /_analyze
{
  "analyzer": "english",
  "text": "The runners are running quickly through forests"
}

Tokens produced:
["runner", "run", "quickli", "forest"]

Note: "The", "are", "through" removed (stop words)
      "runners" → "runner"  (stemming)
      "running" → "run"     (stemming)
      "forests" → "forest"  (stemming)

Stemming means that searching for "run" also matches "running," "runner," and "ran."

Tokenizers

A tokenizer is the component that splits text into individual tokens. The analyzer wraps it.

Tokenizer	Splits On	Example Output for "hello-world 123"
standard	Word boundaries and punctuation	["hello", "world", "123"]
whitespace	Spaces only	["hello-world", "123"]
keyword	Nothing — whole string is one token	["hello-world 123"]
ngram	Generates character sequences	["he","el","ll","lo","ow","wo"...]
edge_ngram	Generates prefixes only	["h","he","hel","hell","hello"]

edge_ngram — Powering Autocomplete

Edge n-gram tokenizer generates every prefix of a word. Indexing "laptop" creates tokens: l, la, lap, lapt, lapto, laptop. Now typing any prefix instantly matches:

Index:  "laptop"
Tokens: ["l","la","lap","lapt","lapto","laptop"]

User types "lap" → matches instantly
User types "lapto" → matches instantly
No wildcard query needed → very fast

Custom Analyzer

Build your own analyzer by combining a tokenizer with token filters:

PUT /products
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "stop", "stemmer"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "description": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  }
}

Using Different Analyzers at Index and Search Time

You can use one analyzer when storing (to build autocomplete prefixes) and a different one when searching (to match the full token):

"description": {
  "type": "text",
  "analyzer": "autocomplete",          <-- at index time: build prefixes
  "search_analyzer": "standard"        <-- at search time: exact token match
}

This prevents prefix tokens from matching each other — "lap" should not match "laptop" as a prefix when you actually type the full word "laptop."

Common Token Filters

Filter	Effect
lowercase	Makes all tokens lowercase
stop	Removes common words (the, is, at, which)
stemmer	Reduces words to their root form
synonym	Maps one word to its synonyms (car → automobile)
trim	Removes leading and trailing whitespace
unique	Removes duplicate tokens

Synonym Filter in Action

When users search for "car," they should also find listings that say "automobile" or "vehicle":

"filter": {
  "my_synonyms": {
    "type": "synonym",
    "synonyms": [
      "car, automobile, vehicle",
      "mobile, cellphone, smartphone"
    ]
  }
}

Search: "buy a car"
Also matches: "buy an automobile", "vehicle for sale"

Previous lesson

Back to course

Next lesson