Elasticsearch Analyzers and Tokenizers

Analyzers determine how text is transformed before Elasticsearch stores it and before your search terms are processed. Understanding analyzers helps you control exactly what matches and what does not.

What Happens When You Index Text

Original text: "The Quick Brown Fox Jumps!"

Analysis pipeline:
  Step 1 — Character Filter:  remove or replace characters
            "The Quick Brown Fox Jumps!"
                     |
                     v
  Step 2 — Tokenizer:  split into words (tokens)
            ["The", "Quick", "Brown", "Fox", "Jumps"]
                     |
                     v
  Step 3 — Token Filters:  transform tokens
            lowercase: ["the", "quick", "brown", "fox", "jumps"]
            stop words removed: ["quick", "brown", "fox", "jumps"]

Stored tokens: ["quick", "brown", "fox", "jumps"]

When you search for "QUICK FOX," the same pipeline transforms your query into ["quick", "fox"] — so it matches even though the casing is different.

Built-in Analyzers

AnalyzerWhat It DoesBest For
standardSplits on whitespace/punctuation, lowercases, removes common wordsGeneral English text
simpleSplits on non-letter characters, lowercasesText without numbers
whitespaceSplits on spaces only, no lowercasingCodes, usernames
keywordNo analysis — stores the whole string as-isIDs, categories, exact values
englishStandard + stemming (running → run) + English stop wordsEnglish articles and blogs
language-specificHindi, French, German etc. analyzersNon-English content

Testing an Analyzer

Use the Analyze API to see exactly what tokens an analyzer produces — without indexing any data:

GET /_analyze
{
  "analyzer": "english",
  "text": "The runners are running quickly through forests"
}
Tokens produced:
["runner", "run", "quickli", "forest"]

Note: "The", "are", "through" removed (stop words)
      "runners" → "runner"  (stemming)
      "running" → "run"     (stemming)
      "forests" → "forest"  (stemming)

Stemming means that searching for "run" also matches "running," "runner," and "ran."

Tokenizers

A tokenizer is the component that splits text into individual tokens. The analyzer wraps it.

TokenizerSplits OnExample Output for "hello-world 123"
standardWord boundaries and punctuation["hello", "world", "123"]
whitespaceSpaces only["hello-world", "123"]
keywordNothing — whole string is one token["hello-world 123"]
ngramGenerates character sequences["he","el","ll","lo","ow","wo"...]
edge_ngramGenerates prefixes only["h","he","hel","hell","hello"]

edge_ngram — Powering Autocomplete

Edge n-gram tokenizer generates every prefix of a word. Indexing "laptop" creates tokens: l, la, lap, lapt, lapto, laptop. Now typing any prefix instantly matches:

Index:  "laptop"
Tokens: ["l","la","lap","lapt","lapto","laptop"]

User types "lap" → matches instantly
User types "lapto" → matches instantly
No wildcard query needed → very fast

Custom Analyzer

Build your own analyzer by combining a tokenizer with token filters:

PUT /products
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "stop", "stemmer"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "description": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  }
}

Using Different Analyzers at Index and Search Time

You can use one analyzer when storing (to build autocomplete prefixes) and a different one when searching (to match the full token):

"description": {
  "type": "text",
  "analyzer": "autocomplete",          <-- at index time: build prefixes
  "search_analyzer": "standard"        <-- at search time: exact token match
}

This prevents prefix tokens from matching each other — "lap" should not match "laptop" as a prefix when you actually type the full word "laptop."

Common Token Filters

FilterEffect
lowercaseMakes all tokens lowercase
stopRemoves common words (the, is, at, which)
stemmerReduces words to their root form
synonymMaps one word to its synonyms (car → automobile)
trimRemoves leading and trailing whitespace
uniqueRemoves duplicate tokens

Synonym Filter in Action

When users search for "car," they should also find listings that say "automobile" or "vehicle":

"filter": {
  "my_synonyms": {
    "type": "synonym",
    "synonyms": [
      "car, automobile, vehicle",
      "mobile, cellphone, smartphone"
    ]
  }
}

Search: "buy a car"
Also matches: "buy an automobile", "vehicle for sale"

Leave a Comment