Elasticsearch Monitoring and the ELK Stack
Elasticsearch does not operate in isolation. The ELK Stack — Elasticsearch, Logstash, and Kibana — forms a complete pipeline for collecting, processing, storing, and visualizing data. Monitoring keeps your cluster healthy and alerts you before problems become outages.
The ELK Stack Architecture
Data Sources Collection Storage + Search Visualization
(servers, apps, devices)
+------------+
Web Server Logs -------> | |
Application Logs -------> | Logstash | ---------> [ Elasticsearch ] ---------> [ Kibana ]
System Metrics -------> | or Beats | (stores + indexes) (dashboards
Database Logs -------> | | charts
+------------+ alerts)
Each Component's Role
| Component | Job |
|---|---|
| Beats | Lightweight agents that collect data from servers (CPU, logs, network) |
| Logstash | Heavy-duty pipeline: receive, transform, and forward data |
| Elasticsearch | Store, index, and search all ingested data |
| Kibana | Browser-based UI for search, dashboards, alerting, and management |
Beats — Lightweight Data Shippers
Beats are small programs you install on servers. They collect specific types of data and forward them to Elasticsearch or Logstash:
| Beat | Collects |
|---|---|
| Filebeat | Log files from disk |
| Metricbeat | CPU, memory, disk, network stats |
| Packetbeat | Network traffic and protocols |
| Auditbeat | Linux audit framework events |
| Heartbeat | Uptime monitoring for URLs and services |
Monitoring Elasticsearch Itself
The Stack Monitoring feature in Kibana shows real-time health of your Elasticsearch cluster:
Kibana Stack Monitoring shows: Cluster Health: GREEN / YELLOW / RED Nodes: 3 active nodes Shards: 24 total, 0 unassigned JVM Heap Used: 62% Index Rate: 1,240 docs/sec Search Rate: 342 queries/sec Disk Used: 340 GB / 500 GB (68%)
Enable monitoring by configuring a monitoring cluster — a separate small Elasticsearch cluster that stores the metrics from your main cluster. Never store monitoring data in the cluster you are monitoring.
Key Metrics to Watch
| Metric | Warning Threshold | What it Signals |
|---|---|---|
| JVM Heap Used | Over 75% | Risk of garbage collection pauses |
| Disk Usage | Over 85% | Elasticsearch stops accepting writes at 95% |
| Unassigned Shards | Any | Data may be unavailable |
| Search Latency (p99) | Over 1 second | Queries are slow — check field mappings and shard count |
| Indexing Latency | Spikes | Possible merge pressure or I/O bottleneck |
Kibana Dashboards
Kibana Dashboard lets you build visual reports from your Elasticsearch data without writing code. Drag and drop charts onto a canvas:
Example: Web Traffic Dashboard [Line Chart: Requests per minute over last 24 hours] [Bar Chart: Top 10 pages by visits] [Pie Chart: Traffic by country] [Data Table: Top error codes with count and % share] [Metric: Total unique visitors today]
Save dashboards and share them with your team via URL. Kibana respects Elasticsearch's role-based access control — a user only sees data their role permits.
Kibana Alerting
Alerting rules check conditions on a schedule and fire actions — email, Slack, webhook — when conditions are met:
Example Alert Rule: Name: "High Error Rate Alert" Check: Every 5 minutes Condition: HTTP 500 errors > 50 in last 10 minutes Action: Send email to ops-team@company.com Example Alert Rule: Name: "Disk Space Warning" Check: Every 1 hour Condition: Any node disk usage > 80% Action: Post to Slack #alerts channel
Snapshot and Restore — Backup Strategy
Replicas protect against node failures but not against accidental deletion. Snapshots back up indexes to external storage:
# Register a repository (S3 example)
PUT /_snapshot/my_s3_backup
{
"type": "s3",
"settings": {
"bucket": "my-elasticsearch-backups",
"region": "ap-south-1"
}
}
# Take a snapshot
PUT /_snapshot/my_s3_backup/snapshot_2024_06_15
{
"indices": ["products", "orders"],
"ignore_unavailable": true
}
# Restore a snapshot
POST /_snapshot/my_s3_backup/snapshot_2024_06_15/_restore
{
"indices": ["products"]
}
Automate daily snapshots using Snapshot Lifecycle Management (SLM) — the same ILM concept, but for backups. SLM creates, retains, and deletes snapshots on a schedule without manual work.
Logs in Kibana Discover
Kibana Discover lets engineers search log data interactively:
Workflow: 1. Select time range: "Last 1 hour" 2. Search: "error AND service:payment-api" 3. Filter: status_code is 500 4. View matching log lines with timestamps 5. Click a log line to expand all fields 6. Spot the root cause — a null pointer in checkout flow 7. Share the search URL with the developer Total time to investigate: 2 minutes vs Searching raw log files with grep: 30+ minutes
