Elasticsearch Shards and Replicas
Shards and replicas are how Elasticsearch splits data across multiple nodes, handles failures, and scales to billions of documents. Getting shard settings right is one of the most important production decisions.
What a Shard Is
An index is too large to store on a single node. Elasticsearch divides it into smaller pieces called shards. Each shard is a fully functional Lucene index — it can store documents and handle searches independently.
Index: products (10 million documents) Without sharding: Node1: [ ALL 10 million documents ] ← slow, limited by one machine With 5 shards: Node1: [ Shard 1: 2M docs ] [ Shard 4: 2M docs ] Node2: [ Shard 2: 2M docs ] Node3: [ Shard 3: 2M docs ] [ Shard 5: 2M docs ] Search: all 5 shards searched in parallel → 5× faster
Primary Shards vs Replica Shards
| Type | Role | Can Serve Searches? |
|---|---|---|
| Primary Shard | Receives writes (new documents), authoritative copy | Yes |
| Replica Shard | Exact copy of a primary shard for backup and read scaling | Yes |
3 Primary Shards, 1 Replica each: Node1: [P1] [R2] Node2: [P2] [R3] Node3: [P3] [R1] P = primary, R = replica (copy of the matching primary) Elasticsearch ensures: P1 and R1 are NEVER on the same node
If Node1 goes down, R1 on Node3 gets promoted to primary. No data is lost and search continues without interruption.
Replica Failure Recovery — Step by Step
Normal state: Node1: [P1][R2] Node2: [P2][R3] Node3: [P3][R1] Node1 goes down: Node2: [P2][R3] Node3: [P3][R1] Master notices P1 is missing Master promotes R1 (on Node3) to Primary: Node2: [P2][R3] Node3: [P3][P1] Master creates a new replica for P2 on Node3: Node2: [P2][R3] Node3: [P3][P1][R2] Cluster returns to GREEN status ✔
Setting Shard Count When Creating an Index
PUT /products
{
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1
}
}
Primary shard count cannot be changed after the index is created. Replica count can be changed at any time.
How to Choose the Right Shard Count
A commonly used rule is to keep each shard between 10 GB and 50 GB. If your index will hold 150 GB of data, use 3 to 5 primary shards.
Data size guide: Expected data: 30 GB → 1 primary shard Expected data: 150 GB → 3–5 primary shards Expected data: 1 TB → 20–25 primary shards Expected data: 5 TB → 100+ primary shards (consider multiple indexes)
Having too many small shards wastes memory. Each shard uses about 10 MB of heap memory just to exist, regardless of how much data it holds.
Oversharding Problem
Bad design: 1,000 indexes × 5 shards × 2 replicas = 15,000 shards Each shard: ~10 MB heap = 150 GB heap overhead → Cluster runs out of memory before storing real data Good design: Time-based indexes (logs-2024-01, logs-2024-02) 1 or 2 shards per monthly index → Manageable overhead, easy to delete old data
Changing Replica Count Live
PUT /products/_settings
{
"number_of_replicas": 2
}
Increasing replicas improves read throughput and redundancy. Decreasing replicas reduces storage costs when data loss risk is acceptable (for example, logs that can be rebuilt).
Shard Allocation Awareness
Tell Elasticsearch which physical rack or availability zone each node sits in, so it distributes replicas across zones instead of putting both a primary and its replica on the same physical hardware:
In elasticsearch.yml on each node: node.attr.zone: zone-a (or zone-b, zone-c) In cluster settings: cluster.routing.allocation.awareness.attributes: zone Result: zone-a: [P1] [P2] zone-b: [R1] [R2] If the entire zone-a fails, zone-b still has all replicas ✔
Shrink and Split APIs
If an index has too many shards after data shrinks, use the Shrink API to combine them. If an index has too few shards for growing data, use the Split API to divide them without reindexing all documents.
# Shrink 5 shards → 1 shard
POST /products/_shrink/products_shrunk
{
"settings": {
"number_of_shards": 1
}
}
# Split 1 shard → 5 shards
POST /products/_split/products_split
{
"settings": {
"number_of_shards": 5
}
}
