DevOps Monitoring and Observability

Monitoring means watching a running system to detect problems before users notice them. Observability goes deeper — it means understanding why a system behaves the way it does, not just knowing that something is wrong.

In DevOps, deploying code is only half the job. The other half is making sure the deployed application runs correctly, performs well, and recovers quickly from failures. A team without monitoring is flying blind.

Monitoring vs Observability

ConceptFocusQuestion It Answers
MonitoringKnown failure modesIs the system healthy right now?
ObservabilityUnknown and complex failuresWhy is the system behaving this way?

Good observability requires collecting and correlating three types of data, often called the Three Pillars of Observability:

  1. Metrics – Numerical measurements over time (CPU usage, request count, error rate).
  2. Logs – Timestamped text records of events (errors, user actions, system events).
  3. Traces – Records that follow a single request as it travels through multiple services.

Metrics

Metrics are numbers that describe system behavior over time. They are cheap to store and fast to query. Common categories:

Infrastructure Metrics

  • CPU utilization (%)
  • Memory usage (MB/GB)
  • Disk I/O (reads/writes per second)
  • Network throughput (bytes in/out)

Application Metrics

  • Request rate (requests per second)
  • Error rate (% of requests returning 5xx errors)
  • Response latency (p50, p95, p99 in milliseconds)
  • Active connections

Business Metrics

  • Orders placed per minute
  • Active users
  • Payment success rate
  • Feature adoption rate

The RED and USE Methods

RED Method (for services)

  • Rate: How many requests per second?
  • Errors: How many requests are failing?
  • Duration: How long do requests take?

USE Method (for infrastructure)

  • Utilization: How busy is the resource?
  • Saturation: Is there a queue building up?
  • Errors: Are there error events?

Prometheus – Metrics Collection

Prometheus is an open-source monitoring system that scrapes (pulls) metrics from targets at regular intervals and stores them in a time-series database. It uses a query language called PromQL to analyze metrics.

How Prometheus Works

  1. Applications expose metrics on an HTTP endpoint (e.g., /metrics).
  2. Prometheus scrapes this endpoint every 15–30 seconds.
  3. Metrics are stored with timestamps in its time-series database.
  4. PromQL queries analyze and aggregate the data.
  5. Alert rules trigger notifications when thresholds are crossed.

prometheus.yml – Basic Configuration

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'webapp'
    static_configs:
      - targets: ['webapp:8080']

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['web01:9100', 'web02:9100']

Common PromQL Queries

# HTTP request rate (per second, last 5 minutes)
rate(http_requests_total[5m])

# Error rate (percentage)
rate(http_requests_total{status=~"5.."}[5m])
  / rate(http_requests_total[5m]) * 100

# 95th percentile response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# CPU usage across all instances
100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100

Grafana – Visualization

Grafana is a dashboarding tool that connects to Prometheus (and many other data sources) to display metrics as beautiful, interactive graphs. Engineers use Grafana dashboards to understand system health at a glance.

A typical Grafana dashboard for a web application shows:

  • Request rate graph (requests/sec over last 1 hour)
  • Error rate graph (with a red threshold line at 1%)
  • P99 latency graph (response time for the slowest 1% of requests)
  • CPU and memory usage per server
  • Active database connections

Alerting with Prometheus Alertmanager

Alertmanager handles alerts from Prometheus: routing them to the right team via email, Slack, PagerDuty, or other channels.

Sample Alert Rule

groups:
  - name: webapp_alerts
    rules:
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m])
          / rate(http_requests_total[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }} for the last 2 minutes"

      - alert: HighCPUUsage
        expr: |
          100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU usage above 85%"

Logging with the ELK Stack

The ELK Stack (Elasticsearch, Logstash, Kibana) is a popular centralized logging solution.

  • Elasticsearch: Stores and indexes log data for fast searching.
  • Logstash: Collects, parses, and transforms logs from multiple sources before sending to Elasticsearch.
  • Kibana: A web interface to search, visualize, and analyze logs in Elasticsearch.
  • Beats (Filebeat/Metricbeat): Lightweight agents that ship logs and metrics from servers to Logstash or Elasticsearch.

Modern setups often replace Logstash with Fluentd or Fluent Bit for lighter resource usage, creating the EFK stack (Elasticsearch, Fluentd, Kibana).

Distributed Tracing

In microservices architectures, a single user request may travel through 10 different services. A trace follows this request end-to-end, recording timing and errors at every step.

Popular tracing tools:

  • Jaeger: Open-source distributed tracing from CNCF.
  • Zipkin: Lightweight distributed tracing system.
  • OpenTelemetry: A vendor-neutral standard for collecting traces, metrics, and logs. Increasingly the default choice for new projects.

SLAs, SLOs, and SLIs

Monitoring connects directly to business reliability commitments:

TermMeaningExample
SLA (Service Level Agreement)Contract with customers about availability99.9% uptime guaranteed monthly
SLO (Service Level Objective)Internal reliability target99.95% of requests succeed in under 200ms
SLI (Service Level Indicator)The actual measured metricCurrent success rate: 99.97%
Error BudgetAllowed failure time before SLO is breached43.8 minutes per month at 99.9% SLO

Real-World Example

An e-commerce site deploys a new checkout feature at 2 PM. By 2:15 PM, Prometheus detects a 3% error rate on the /checkout endpoint — up from the baseline of 0.1%. Alertmanager sends a Slack message to the on-call engineer. The engineer opens Grafana, sees the error spike correlates exactly with the new deployment. Jaeger traces show errors originating in the payment service. The team rolls back the deployment within 10 minutes of the first alert.

Without monitoring, users would have complained for hours before anyone noticed.

Summary

  • Monitoring detects problems. Observability explains why they happen.
  • The three pillars of observability are metrics, logs, and traces.
  • Prometheus collects metrics through scraping. PromQL queries analyze them.
  • Grafana visualizes Prometheus data in interactive dashboards.
  • Alertmanager routes alerts to teams via Slack, email, and PagerDuty.
  • The ELK/EFK stack provides centralized log collection, search, and visualization.
  • SLOs and error budgets connect monitoring to business reliability goals.

Leave a Comment