DevOps Monitoring and Observability
Monitoring means watching a running system to detect problems before users notice them. Observability goes deeper — it means understanding why a system behaves the way it does, not just knowing that something is wrong.
In DevOps, deploying code is only half the job. The other half is making sure the deployed application runs correctly, performs well, and recovers quickly from failures. A team without monitoring is flying blind.
Monitoring vs Observability
| Concept | Focus | Question It Answers |
|---|---|---|
| Monitoring | Known failure modes | Is the system healthy right now? |
| Observability | Unknown and complex failures | Why is the system behaving this way? |
Good observability requires collecting and correlating three types of data, often called the Three Pillars of Observability:
- Metrics – Numerical measurements over time (CPU usage, request count, error rate).
- Logs – Timestamped text records of events (errors, user actions, system events).
- Traces – Records that follow a single request as it travels through multiple services.
Metrics
Metrics are numbers that describe system behavior over time. They are cheap to store and fast to query. Common categories:
Infrastructure Metrics
- CPU utilization (%)
- Memory usage (MB/GB)
- Disk I/O (reads/writes per second)
- Network throughput (bytes in/out)
Application Metrics
- Request rate (requests per second)
- Error rate (% of requests returning 5xx errors)
- Response latency (p50, p95, p99 in milliseconds)
- Active connections
Business Metrics
- Orders placed per minute
- Active users
- Payment success rate
- Feature adoption rate
The RED and USE Methods
RED Method (for services)
- Rate: How many requests per second?
- Errors: How many requests are failing?
- Duration: How long do requests take?
USE Method (for infrastructure)
- Utilization: How busy is the resource?
- Saturation: Is there a queue building up?
- Errors: Are there error events?
Prometheus – Metrics Collection
Prometheus is an open-source monitoring system that scrapes (pulls) metrics from targets at regular intervals and stores them in a time-series database. It uses a query language called PromQL to analyze metrics.
How Prometheus Works
- Applications expose metrics on an HTTP endpoint (e.g.,
/metrics). - Prometheus scrapes this endpoint every 15–30 seconds.
- Metrics are stored with timestamps in its time-series database.
- PromQL queries analyze and aggregate the data.
- Alert rules trigger notifications when thresholds are crossed.
prometheus.yml – Basic Configuration
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'webapp'
static_configs:
- targets: ['webapp:8080']
- job_name: 'node_exporter'
static_configs:
- targets: ['web01:9100', 'web02:9100']Common PromQL Queries
# HTTP request rate (per second, last 5 minutes)
rate(http_requests_total[5m])
# Error rate (percentage)
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m]) * 100
# 95th percentile response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# CPU usage across all instances
100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100Grafana – Visualization
Grafana is a dashboarding tool that connects to Prometheus (and many other data sources) to display metrics as beautiful, interactive graphs. Engineers use Grafana dashboards to understand system health at a glance.
A typical Grafana dashboard for a web application shows:
- Request rate graph (requests/sec over last 1 hour)
- Error rate graph (with a red threshold line at 1%)
- P99 latency graph (response time for the slowest 1% of requests)
- CPU and memory usage per server
- Active database connections
Alerting with Prometheus Alertmanager
Alertmanager handles alerts from Prometheus: routing them to the right team via email, Slack, PagerDuty, or other channels.
Sample Alert Rule
groups:
- name: webapp_alerts
rules:
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} for the last 2 minutes"
- alert: HighCPUUsage
expr: |
100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "CPU usage above 85%"Logging with the ELK Stack
The ELK Stack (Elasticsearch, Logstash, Kibana) is a popular centralized logging solution.
- Elasticsearch: Stores and indexes log data for fast searching.
- Logstash: Collects, parses, and transforms logs from multiple sources before sending to Elasticsearch.
- Kibana: A web interface to search, visualize, and analyze logs in Elasticsearch.
- Beats (Filebeat/Metricbeat): Lightweight agents that ship logs and metrics from servers to Logstash or Elasticsearch.
Modern setups often replace Logstash with Fluentd or Fluent Bit for lighter resource usage, creating the EFK stack (Elasticsearch, Fluentd, Kibana).
Distributed Tracing
In microservices architectures, a single user request may travel through 10 different services. A trace follows this request end-to-end, recording timing and errors at every step.
Popular tracing tools:
- Jaeger: Open-source distributed tracing from CNCF.
- Zipkin: Lightweight distributed tracing system.
- OpenTelemetry: A vendor-neutral standard for collecting traces, metrics, and logs. Increasingly the default choice for new projects.
SLAs, SLOs, and SLIs
Monitoring connects directly to business reliability commitments:
| Term | Meaning | Example |
|---|---|---|
| SLA (Service Level Agreement) | Contract with customers about availability | 99.9% uptime guaranteed monthly |
| SLO (Service Level Objective) | Internal reliability target | 99.95% of requests succeed in under 200ms |
| SLI (Service Level Indicator) | The actual measured metric | Current success rate: 99.97% |
| Error Budget | Allowed failure time before SLO is breached | 43.8 minutes per month at 99.9% SLO |
Real-World Example
An e-commerce site deploys a new checkout feature at 2 PM. By 2:15 PM, Prometheus detects a 3% error rate on the /checkout endpoint — up from the baseline of 0.1%. Alertmanager sends a Slack message to the on-call engineer. The engineer opens Grafana, sees the error spike correlates exactly with the new deployment. Jaeger traces show errors originating in the payment service. The team rolls back the deployment within 10 minutes of the first alert.
Without monitoring, users would have complained for hours before anyone noticed.
Summary
- Monitoring detects problems. Observability explains why they happen.
- The three pillars of observability are metrics, logs, and traces.
- Prometheus collects metrics through scraping. PromQL queries analyze them.
- Grafana visualizes Prometheus data in interactive dashboards.
- Alertmanager routes alerts to teams via Slack, email, and PagerDuty.
- The ELK/EFK stack provides centralized log collection, search, and visualization.
- SLOs and error budgets connect monitoring to business reliability goals.
