Elasticsearch Monitoring and the ELK Stack

Elasticsearch does not operate in isolation. The ELK Stack — Elasticsearch, Logstash, and Kibana — forms a complete pipeline for collecting, processing, storing, and visualizing data. Monitoring keeps your cluster healthy and alerts you before problems become outages.

The ELK Stack Architecture

Data Sources                  Collection                 Storage + Search          Visualization
(servers, apps, devices)
                          +------------+
Web Server Logs  -------> |            |
Application Logs -------> | Logstash   | ---------> [ Elasticsearch ] ---------> [ Kibana ]
System Metrics   -------> | or Beats   |               (stores + indexes)          (dashboards
Database Logs    -------> |            |                                             charts
                          +------------+                                             alerts)

Each Component's Role

ComponentJob
BeatsLightweight agents that collect data from servers (CPU, logs, network)
LogstashHeavy-duty pipeline: receive, transform, and forward data
ElasticsearchStore, index, and search all ingested data
KibanaBrowser-based UI for search, dashboards, alerting, and management

Beats — Lightweight Data Shippers

Beats are small programs you install on servers. They collect specific types of data and forward them to Elasticsearch or Logstash:

BeatCollects
FilebeatLog files from disk
MetricbeatCPU, memory, disk, network stats
PacketbeatNetwork traffic and protocols
AuditbeatLinux audit framework events
HeartbeatUptime monitoring for URLs and services

Monitoring Elasticsearch Itself

The Stack Monitoring feature in Kibana shows real-time health of your Elasticsearch cluster:

Kibana Stack Monitoring shows:

Cluster Health:     GREEN / YELLOW / RED
Nodes:              3 active nodes
Shards:             24 total, 0 unassigned
JVM Heap Used:      62%
Index Rate:         1,240 docs/sec
Search Rate:        342 queries/sec
Disk Used:          340 GB / 500 GB (68%)

Enable monitoring by configuring a monitoring cluster — a separate small Elasticsearch cluster that stores the metrics from your main cluster. Never store monitoring data in the cluster you are monitoring.

Key Metrics to Watch

MetricWarning ThresholdWhat it Signals
JVM Heap UsedOver 75%Risk of garbage collection pauses
Disk UsageOver 85%Elasticsearch stops accepting writes at 95%
Unassigned ShardsAnyData may be unavailable
Search Latency (p99)Over 1 secondQueries are slow — check field mappings and shard count
Indexing LatencySpikesPossible merge pressure or I/O bottleneck

Kibana Dashboards

Kibana Dashboard lets you build visual reports from your Elasticsearch data without writing code. Drag and drop charts onto a canvas:

Example: Web Traffic Dashboard

[Line Chart: Requests per minute over last 24 hours]

[Bar Chart: Top 10 pages by visits]

[Pie Chart: Traffic by country]

[Data Table: Top error codes with count and % share]

[Metric: Total unique visitors today]

Save dashboards and share them with your team via URL. Kibana respects Elasticsearch's role-based access control — a user only sees data their role permits.

Kibana Alerting

Alerting rules check conditions on a schedule and fire actions — email, Slack, webhook — when conditions are met:

Example Alert Rule:
  Name:      "High Error Rate Alert"
  Check:     Every 5 minutes
  Condition: HTTP 500 errors > 50 in last 10 minutes
  Action:    Send email to ops-team@company.com

Example Alert Rule:
  Name:      "Disk Space Warning"
  Check:     Every 1 hour
  Condition: Any node disk usage > 80%
  Action:    Post to Slack #alerts channel

Snapshot and Restore — Backup Strategy

Replicas protect against node failures but not against accidental deletion. Snapshots back up indexes to external storage:

# Register a repository (S3 example)
PUT /_snapshot/my_s3_backup
{
  "type": "s3",
  "settings": {
    "bucket": "my-elasticsearch-backups",
    "region": "ap-south-1"
  }
}

# Take a snapshot
PUT /_snapshot/my_s3_backup/snapshot_2024_06_15
{
  "indices": ["products", "orders"],
  "ignore_unavailable": true
}

# Restore a snapshot
POST /_snapshot/my_s3_backup/snapshot_2024_06_15/_restore
{
  "indices": ["products"]
}

Automate daily snapshots using Snapshot Lifecycle Management (SLM) — the same ILM concept, but for backups. SLM creates, retains, and deletes snapshots on a schedule without manual work.

Logs in Kibana Discover

Kibana Discover lets engineers search log data interactively:

Workflow:
  1. Select time range: "Last 1 hour"
  2. Search: "error AND service:payment-api"
  3. Filter: status_code is 500
  4. View matching log lines with timestamps
  5. Click a log line to expand all fields
  6. Spot the root cause — a null pointer in checkout flow
  7. Share the search URL with the developer

Total time to investigate: 2 minutes
vs
Searching raw log files with grep: 30+ minutes

Leave a Comment