DevOps Logging and Log Management

Logs are timestamped records of events that occur inside an application or system. When something breaks in production — an error, a crash, unexpected behavior — logs are the primary source of evidence for understanding what happened and why.

In a DevOps environment, managing logs well means collecting them centrally, searching them quickly, and retaining them long enough for security and compliance purposes.

Why Centralized Logging?

A single application running on multiple servers generates logs on each individual machine. Checking each server manually is impractical. Centralized logging aggregates all logs from all services, servers, and containers into a single searchable system.

Search across all services simultaneously during an incident.
Correlate events across different systems using timestamps.
Set up alerts based on log patterns (e.g., "alert me when 'FATAL' appears in any service").
Retain logs for compliance (PCI DSS, HIPAA often require 1–7 years of logs).
Build dashboards from log data — error trends, user activity, API usage.

Structured Logging

Unstructured logs are plain text lines. Structured logs are machine-parseable records — typically JSON. Structured logs are far easier to query, filter, and analyze programmatically.

Unstructured Log (Hard to Query)

2025-03-15 10:32:45 ERROR Failed to process payment for order ORD-9821: timeout after 5000ms

Structured Log (Easy to Query and Filter)

{
  "timestamp": "2025-03-15T10:32:45.123Z",
  "level": "ERROR",
  "service": "payment-service",
  "version": "2.1.4",
  "trace_id": "abc123def456",
  "user_id": "USR-441",
  "order_id": "ORD-9821",
  "message": "Payment processing failed",
  "error": "Timeout after 5000ms",
  "duration_ms": 5001,
  "host": "app-server-03"
}

With structured logs, finding all payment failures for a specific user, in a specific time window, across all servers takes a single query.

Log Levels

Log levels indicate the severity of a log entry. Every application should use consistent log levels:

Level	Meaning	Example
TRACE	Extremely detailed — step-by-step execution	Entering method processPayment()
DEBUG	Developer-level detail for troubleshooting	Cart total calculated: $89.99
INFO	Normal application events worth recording	User USR-441 logged in successfully
WARN	Something unexpected but not yet failing	API response time: 1200ms (threshold: 500ms)
ERROR	A failure occurred, needs attention	Database query failed: connection refused
FATAL	Application cannot continue	Out of memory — shutting down

Production systems typically log at INFO level and above. Debug and Trace levels are enabled only during active troubleshooting to avoid excessive noise and storage costs.

The ELK Stack

The ELK Stack (Elasticsearch, Logstash, Kibana) is the most common centralized logging solution for on-premise and cloud environments.

Elasticsearch

A distributed search and analytics engine. Stores log data as JSON documents and provides near real-time full-text search across billions of log lines in seconds.

Logstash

A data processing pipeline. Ingests logs from multiple sources, parses and transforms them, then ships to Elasticsearch.

# logstash.conf - Collect, parse, and forward application logs
input {
  beats {
    port => 5044    # Receive logs from Filebeat agents
  }
}

filter {
  if [log_type] == "application" {
    json {
      source => "message"   # Parse JSON log lines
    }

    date {
      match => ["timestamp", "ISO8601"]
      target => "@timestamp"
    }
  }

  if [level] == "ERROR" or [level] == "FATAL" {
    mutate {
      add_tag => ["alert_candidate"]
    }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "app-logs-%{+YYYY.MM.dd}"
  }
}

Kibana

A web-based UI for searching, visualizing, and analyzing data in Elasticsearch. Engineers use Kibana to:

Search logs using the Kibana Query Language (KQL).
Build dashboards showing error rates, API latency, and user activity.
Create alerts that trigger when specific log patterns appear.
Correlate logs with application traces.

Filebeat – Lightweight Log Shipper

Filebeat is a lightweight agent installed on each server that watches log files and ships new entries to Logstash or Elasticsearch. It uses minimal CPU and memory — designed to run alongside any application.

# filebeat.yml
filebeat.inputs:
  - type: log
    enabled: true
    paths:
      - /var/log/myapp/*.log
    json.keys_under_root: true
    json.add_error_key: true
    fields:
      service: webapp
      environment: production

output.logstash:
  hosts: ["logstash:5044"]

The EFK Stack – Kubernetes-Native Logging

In Kubernetes environments, Fluent Bit (lightweight) or Fluentd typically replace Logstash as the log collector — forming the EFK stack. Fluent Bit runs as a DaemonSet on every Kubernetes node, automatically collecting all container logs.

# Fluent Bit DaemonSet configuration (simplified)
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: logging
spec:
  selector:
    matchLabels:
      app: fluent-bit
  template:
    spec:
      containers:
        - name: fluent-bit
          image: fluent/fluent-bit:2.1
          volumeMounts:
            - name: varlog
              mountPath: /var/log
            - name: varlibdockercontainers
              mountPath: /var/lib/docker/containers
              readOnly: true
      volumes:
        - name: varlog
          hostPath:
            path: /var/log
        - name: varlibdockercontainers
          hostPath:
            path: /var/lib/docker/containers

Cloud-Native Logging

AWS CloudWatch Logs

CloudWatch Logs collects log data from Lambda functions, ECS containers, EC2 instances, and AWS services automatically. Log Insights provides a powerful query language for searching.

# CloudWatch Logs Insights query - Find all errors in the last hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

AWS CloudWatch Log Groups

Logs are organized into Log Groups (one per service/application) and Log Streams (one per instance or container). Retention policies automatically delete old logs:

resource "aws_cloudwatch_log_group" "webapp" {
  name              = "/production/webapp"
  retention_in_days = 90   # Auto-delete logs older than 90 days
}

Log Alerting

Alerting on log patterns catches issues that metric-based alerts miss. Common log-based alerts:

Any log with level FATAL appears → immediate alert to on-call engineer.
Error rate exceeds 10 errors per minute → Slack notification.
Log pattern "OutOfMemoryError" detected → page the on-call team.
"401 Unauthorized" rate spikes → potential authentication attack.

Log Retention and Compliance

Many regulations define minimum log retention periods:

Regulation	Log Retention Requirement
PCI DSS	12 months (3 months immediately accessible)
HIPAA	6 years
SOC 2	12 months minimum
GDPR	Logs with personal data: minimize and pseudonymize

Use tiered storage to manage costs: recent logs in hot storage (Elasticsearch), older logs in warm/cold storage (S3 or Glacier). AWS S3 Intelligent-Tiering automatically moves data to the cheapest tier based on access patterns.

Best Practices for Application Logging

Always use structured JSON logging — never free-form text in production.
Include a trace_id in every log entry to correlate a single request across multiple services.
Never log sensitive data: passwords, credit card numbers, tokens, or personally identifiable information.
Log at appropriate levels — INFO for normal events, ERROR only for actual failures.
Include enough context in each log entry to understand what happened without reading surrounding lines.
Test log output in CI — ensure log format is valid JSON and required fields are present.

Summary

Centralized logging aggregates all application and infrastructure logs into a single searchable system.
Structured JSON logging makes logs queryable, filterable, and parseable at scale.
The ELK stack (Elasticsearch + Logstash + Kibana) is the standard on-premise centralized logging solution.
In Kubernetes, Fluent Bit runs as a DaemonSet to collect all container logs automatically.
AWS CloudWatch Logs provides cloud-native log collection with powerful Insights queries.
Log-based alerting catches application failures that metric thresholds cannot detect.
Log retention policies balance compliance requirements with storage costs.

Previous lesson

Back to course

Next lesson