Elasticsearch Monitoring and the ELK Stack

Elasticsearch does not operate in isolation. The ELK Stack — Elasticsearch, Logstash, and Kibana — forms a complete pipeline for collecting, processing, storing, and visualizing data. Monitoring keeps your cluster healthy and alerts you before problems become outages.

The ELK Stack Architecture

Data Sources                  Collection                 Storage + Search          Visualization
(servers, apps, devices)
                          +------------+
Web Server Logs  -------> |            |
Application Logs -------> | Logstash   | ---------> [ Elasticsearch ] ---------> [ Kibana ]
System Metrics   -------> | or Beats   |               (stores + indexes)          (dashboards
Database Logs    -------> |            |                                             charts
                          +------------+                                             alerts)

Each Component's Role

Component	Job
Beats	Lightweight agents that collect data from servers (CPU, logs, network)
Logstash	Heavy-duty pipeline: receive, transform, and forward data
Elasticsearch	Store, index, and search all ingested data
Kibana	Browser-based UI for search, dashboards, alerting, and management

Beats — Lightweight Data Shippers

Beats are small programs you install on servers. They collect specific types of data and forward them to Elasticsearch or Logstash:

Beat	Collects
Filebeat	Log files from disk
Metricbeat	CPU, memory, disk, network stats
Packetbeat	Network traffic and protocols
Auditbeat	Linux audit framework events
Heartbeat	Uptime monitoring for URLs and services

Monitoring Elasticsearch Itself

The Stack Monitoring feature in Kibana shows real-time health of your Elasticsearch cluster:

Kibana Stack Monitoring shows:

Cluster Health:     GREEN / YELLOW / RED
Nodes:              3 active nodes
Shards:             24 total, 0 unassigned
JVM Heap Used:      62%
Index Rate:         1,240 docs/sec
Search Rate:        342 queries/sec
Disk Used:          340 GB / 500 GB (68%)

Enable monitoring by configuring a monitoring cluster — a separate small Elasticsearch cluster that stores the metrics from your main cluster. Never store monitoring data in the cluster you are monitoring.

Key Metrics to Watch

Metric	Warning Threshold	What it Signals
JVM Heap Used	Over 75%	Risk of garbage collection pauses
Disk Usage	Over 85%	Elasticsearch stops accepting writes at 95%
Unassigned Shards	Any	Data may be unavailable
Search Latency (p99)	Over 1 second	Queries are slow — check field mappings and shard count
Indexing Latency	Spikes	Possible merge pressure or I/O bottleneck

Kibana Dashboards

Kibana Dashboard lets you build visual reports from your Elasticsearch data without writing code. Drag and drop charts onto a canvas:

Example: Web Traffic Dashboard

[Line Chart: Requests per minute over last 24 hours]

[Bar Chart: Top 10 pages by visits]

[Pie Chart: Traffic by country]

[Data Table: Top error codes with count and % share]

[Metric: Total unique visitors today]

Save dashboards and share them with your team via URL. Kibana respects Elasticsearch's role-based access control — a user only sees data their role permits.

Kibana Alerting

Alerting rules check conditions on a schedule and fire actions — email, Slack, webhook — when conditions are met:

Example Alert Rule:
  Name:      "High Error Rate Alert"
  Check:     Every 5 minutes
  Condition: HTTP 500 errors > 50 in last 10 minutes
  Action:    Send email to ops-team@company.com

Example Alert Rule:
  Name:      "Disk Space Warning"
  Check:     Every 1 hour
  Condition: Any node disk usage > 80%
  Action:    Post to Slack #alerts channel

Snapshot and Restore — Backup Strategy

Replicas protect against node failures but not against accidental deletion. Snapshots back up indexes to external storage:

# Register a repository (S3 example)
PUT /_snapshot/my_s3_backup
{
  "type": "s3",
  "settings": {
    "bucket": "my-elasticsearch-backups",
    "region": "ap-south-1"
  }
}

# Take a snapshot
PUT /_snapshot/my_s3_backup/snapshot_2024_06_15
{
  "indices": ["products", "orders"],
  "ignore_unavailable": true
}

# Restore a snapshot
POST /_snapshot/my_s3_backup/snapshot_2024_06_15/_restore
{
  "indices": ["products"]
}

Automate daily snapshots using Snapshot Lifecycle Management (SLM) — the same ILM concept, but for backups. SLM creates, retains, and deletes snapshots on a schedule without manual work.

Logs in Kibana Discover

Kibana Discover lets engineers search log data interactively:

Workflow:
  1. Select time range: "Last 1 hour"
  2. Search: "error AND service:payment-api"
  3. Filter: status_code is 500
  4. View matching log lines with timestamps
  5. Click a log line to expand all fields
  6. Spot the root cause — a null pointer in checkout flow
  7. Share the search URL with the developer

Total time to investigate: 2 minutes
vs
Searching raw log files with grep: 30+ minutes

Previous lesson

Back to course