DevOps Logging and Log Management
Logs are timestamped records of events that occur inside an application or system. When something breaks in production — an error, a crash, unexpected behavior — logs are the primary source of evidence for understanding what happened and why.
In a DevOps environment, managing logs well means collecting them centrally, searching them quickly, and retaining them long enough for security and compliance purposes.
Why Centralized Logging?
A single application running on multiple servers generates logs on each individual machine. Checking each server manually is impractical. Centralized logging aggregates all logs from all services, servers, and containers into a single searchable system.
- Search across all services simultaneously during an incident.
- Correlate events across different systems using timestamps.
- Set up alerts based on log patterns (e.g., "alert me when 'FATAL' appears in any service").
- Retain logs for compliance (PCI DSS, HIPAA often require 1–7 years of logs).
- Build dashboards from log data — error trends, user activity, API usage.
Structured Logging
Unstructured logs are plain text lines. Structured logs are machine-parseable records — typically JSON. Structured logs are far easier to query, filter, and analyze programmatically.
Unstructured Log (Hard to Query)
2025-03-15 10:32:45 ERROR Failed to process payment for order ORD-9821: timeout after 5000msStructured Log (Easy to Query and Filter)
{
"timestamp": "2025-03-15T10:32:45.123Z",
"level": "ERROR",
"service": "payment-service",
"version": "2.1.4",
"trace_id": "abc123def456",
"user_id": "USR-441",
"order_id": "ORD-9821",
"message": "Payment processing failed",
"error": "Timeout after 5000ms",
"duration_ms": 5001,
"host": "app-server-03"
}With structured logs, finding all payment failures for a specific user, in a specific time window, across all servers takes a single query.
Log Levels
Log levels indicate the severity of a log entry. Every application should use consistent log levels:
| Level | Meaning | Example |
|---|---|---|
| TRACE | Extremely detailed — step-by-step execution | Entering method processPayment() |
| DEBUG | Developer-level detail for troubleshooting | Cart total calculated: $89.99 |
| INFO | Normal application events worth recording | User USR-441 logged in successfully |
| WARN | Something unexpected but not yet failing | API response time: 1200ms (threshold: 500ms) |
| ERROR | A failure occurred, needs attention | Database query failed: connection refused |
| FATAL | Application cannot continue | Out of memory — shutting down |
Production systems typically log at INFO level and above. Debug and Trace levels are enabled only during active troubleshooting to avoid excessive noise and storage costs.
The ELK Stack
The ELK Stack (Elasticsearch, Logstash, Kibana) is the most common centralized logging solution for on-premise and cloud environments.
Elasticsearch
A distributed search and analytics engine. Stores log data as JSON documents and provides near real-time full-text search across billions of log lines in seconds.
Logstash
A data processing pipeline. Ingests logs from multiple sources, parses and transforms them, then ships to Elasticsearch.
# logstash.conf - Collect, parse, and forward application logs
input {
beats {
port => 5044 # Receive logs from Filebeat agents
}
}
filter {
if [log_type] == "application" {
json {
source => "message" # Parse JSON log lines
}
date {
match => ["timestamp", "ISO8601"]
target => "@timestamp"
}
}
if [level] == "ERROR" or [level] == "FATAL" {
mutate {
add_tag => ["alert_candidate"]
}
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "app-logs-%{+YYYY.MM.dd}"
}
}Kibana
A web-based UI for searching, visualizing, and analyzing data in Elasticsearch. Engineers use Kibana to:
- Search logs using the Kibana Query Language (KQL).
- Build dashboards showing error rates, API latency, and user activity.
- Create alerts that trigger when specific log patterns appear.
- Correlate logs with application traces.
Filebeat – Lightweight Log Shipper
Filebeat is a lightweight agent installed on each server that watches log files and ships new entries to Logstash or Elasticsearch. It uses minimal CPU and memory — designed to run alongside any application.
# filebeat.yml
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/myapp/*.log
json.keys_under_root: true
json.add_error_key: true
fields:
service: webapp
environment: production
output.logstash:
hosts: ["logstash:5044"]The EFK Stack – Kubernetes-Native Logging
In Kubernetes environments, Fluent Bit (lightweight) or Fluentd typically replace Logstash as the log collector — forming the EFK stack. Fluent Bit runs as a DaemonSet on every Kubernetes node, automatically collecting all container logs.
# Fluent Bit DaemonSet configuration (simplified)
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluent-bit
namespace: logging
spec:
selector:
matchLabels:
app: fluent-bit
template:
spec:
containers:
- name: fluent-bit
image: fluent/fluent-bit:2.1
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containersCloud-Native Logging
AWS CloudWatch Logs
CloudWatch Logs collects log data from Lambda functions, ECS containers, EC2 instances, and AWS services automatically. Log Insights provides a powerful query language for searching.
# CloudWatch Logs Insights query - Find all errors in the last hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100AWS CloudWatch Log Groups
Logs are organized into Log Groups (one per service/application) and Log Streams (one per instance or container). Retention policies automatically delete old logs:
resource "aws_cloudwatch_log_group" "webapp" {
name = "/production/webapp"
retention_in_days = 90 # Auto-delete logs older than 90 days
}Log Alerting
Alerting on log patterns catches issues that metric-based alerts miss. Common log-based alerts:
- Any log with level FATAL appears → immediate alert to on-call engineer.
- Error rate exceeds 10 errors per minute → Slack notification.
- Log pattern "OutOfMemoryError" detected → page the on-call team.
- "401 Unauthorized" rate spikes → potential authentication attack.
Log Retention and Compliance
Many regulations define minimum log retention periods:
| Regulation | Log Retention Requirement |
|---|---|
| PCI DSS | 12 months (3 months immediately accessible) |
| HIPAA | 6 years |
| SOC 2 | 12 months minimum |
| GDPR | Logs with personal data: minimize and pseudonymize |
Use tiered storage to manage costs: recent logs in hot storage (Elasticsearch), older logs in warm/cold storage (S3 or Glacier). AWS S3 Intelligent-Tiering automatically moves data to the cheapest tier based on access patterns.
Best Practices for Application Logging
- Always use structured JSON logging — never free-form text in production.
- Include a
trace_idin every log entry to correlate a single request across multiple services. - Never log sensitive data: passwords, credit card numbers, tokens, or personally identifiable information.
- Log at appropriate levels — INFO for normal events, ERROR only for actual failures.
- Include enough context in each log entry to understand what happened without reading surrounding lines.
- Test log output in CI — ensure log format is valid JSON and required fields are present.
Summary
- Centralized logging aggregates all application and infrastructure logs into a single searchable system.
- Structured JSON logging makes logs queryable, filterable, and parseable at scale.
- The ELK stack (Elasticsearch + Logstash + Kibana) is the standard on-premise centralized logging solution.
- In Kubernetes, Fluent Bit runs as a DaemonSet to collect all container logs automatically.
- AWS CloudWatch Logs provides cloud-native log collection with powerful Insights queries.
- Log-based alerting catches application failures that metric thresholds cannot detect.
- Log retention policies balance compliance requirements with storage costs.
