Microservices Logging and Monitoring
In a monolith, you open one log file to debug a problem. In a microservices system with 50 services, a single user request touches 8 different services. When something goes wrong, you need to see exactly what happened across all 8. Logging and monitoring give you that visibility.
The Three Pillars of Observability
Observability is the ability to understand what is happening inside your system from the outside. Three types of data together give you complete visibility.
PILLAR 1: LOGS
"What happened and when?"
Timestamped text records of events inside a service.
2025-03-14 10:23:01 [INFO] Order ORD-7701 received from user USR-212
2025-03-14 10:23:02 [INFO] Payment initiated for ORD-7701
2025-03-14 10:23:04 [ERROR] Payment failed: card declined for ORD-7701
PILLAR 2: METRICS
"How is the system performing right now?"
Numerical measurements collected over time.
- Requests per second: 1,240
- Error rate: 0.3%
- Average response time: 142ms
- CPU usage: 61%
PILLAR 3: TRACES
"Where did a specific request go?"
A record of every service a request passed through and how long each took.
Order Service: 5ms
--> Payment Service: 120ms
--> Bank API: 98ms (slowest part)
--> Email Service: 12ms
Total: 137ms
Centralized Logging
Each service writes logs. With 50 services on 100 servers, logs are scattered everywhere. Centralized logging collects all logs into one searchable system.
CENTRALIZED LOGGING FLOW ========================= [Order Service] --> logs --> + [Payment Service] --> logs --> | Log Collector --> [Central Log Store] [Inventory Service] --> logs --> + (Elasticsearch / Loki) [Email Service] --> logs --> + Developer searches ONE place: "Show me all logs related to order ORD-7701" Gets logs from all 4 services that touched that order, in one view.
The ELK Stack (Elasticsearch, Logstash, Kibana) is a popular combination. Logs go into Elasticsearch. Kibana provides a dashboard to search and visualize them. Grafana Loki is a lighter alternative that works well with Kubernetes.
Structured Logging
Plain text logs are hard for machines to search. Structured logs use a consistent format — usually JSON — with named fields. This makes querying fast and precise.
UNSTRUCTURED LOG (hard to query)
=================================
2025-03-14 10:23:04 ERROR Payment failed for order ORD-7701 user USR-212 card declined
STRUCTURED LOG (easy to query)
================================
{
"timestamp": "2025-03-14T10:23:04Z",
"level": "ERROR",
"service": "payment-service",
"order_id": "ORD-7701",
"user_id": "USR-212",
"event": "PaymentFailed",
"reason": "card_declined"
}
Query: Find all ERROR events for order ORD-7701
Result: Instant match. No parsing required.
Distributed Tracing
A trace tracks one request across all services. Every service adds a record (called a span) to the trace showing what it did and how long it took. All spans for one request share a Trace ID.
TRACE EXAMPLE: "Place Order" request
======================================
Trace ID: TRC-88421
[API Gateway] Span: 3ms (received request, routed to Order Service)
|
[Order Service] Span: 8ms (validated order, called Payment Service)
|
[Payment Service] Span: 134ms (most time here)
|
[Bank API] Span: 118ms (external call, slowest step)
|
[Payment Service] Span: 16ms (post-processing, updated DB)
|
[Order Service] Span: 5ms (saved order, published event)
|
[Email Service] Span: 22ms (sent confirmation email)
Total: 172ms
Bottleneck identified: Bank API takes 118ms (69% of total time)
OpenTelemetry is the standard for adding tracing to services. Jaeger and Zipkin are popular tools for storing and visualizing traces. Most cloud providers also offer managed tracing services.
Metrics and Alerting
Metrics capture the health numbers of every service. Prometheus is the most widely used metrics collection tool in Kubernetes environments. Grafana visualizes Prometheus metrics in dashboards.
KEY METRICS TO TRACK PER SERVICE ================================== Request Rate : How many requests per second is the service handling? Error Rate : What percentage of requests are failing? Latency : How long do requests take? (p50, p95, p99) Saturation : How close is the service to its resource limits? These four are known as the "RED" and "USE" methods.
Alerting Rules
Alerts fire automatically when a metric crosses a threshold. On-call engineers receive a notification immediately — they do not wait for a user to report a problem.
ALERT EXAMPLES =============== ALERT: PaymentService error rate > 1% for 5 minutes --> Page on-call engineer immediately ALERT: OrderService p99 latency > 2000ms for 3 minutes --> Send Slack notification to team channel ALERT: InventoryService pod count dropped below 2 --> Kubernetes alert: service is understaffed
Health Dashboards
A dashboard shows the current state of the entire system at a glance. Different views serve different audiences.
DASHBOARD TYPES ================ Executive Dashboard: - Overall system uptime - Total orders processed today - Revenue per hour Engineering Dashboard: - Error rate per service - Latency heatmap - Pod counts and CPU per service On-Call Dashboard: - Active alerts - Recent deployments (changes that may have caused issues) - Top error messages in the last 30 minutes
Correlation IDs
A correlation ID is a unique identifier assigned to a user request when it enters the system. Every service passes this ID along and includes it in every log line. This lets you search all logs across all services for one specific user interaction.
CORRELATION ID FLOW ==================== User clicks "Place Order" API Gateway assigns: X-Correlation-ID: CID-55901 Order Service logs: [CID-55901] Order ORD-7701 created Payment Service logs: [CID-55901] Charging card for ORD-7701 Email Service logs: [CID-55901] Sending confirmation to john@example.com Search logs for CID-55901 --> see the full story of this one user's order across all services, in chronological order.
Logging Best Practices
- Always include the service name, timestamp, and severity level (INFO, WARN, ERROR) in every log line.
- Include the correlation ID and relevant business identifiers (order_id, user_id) in every log line.
- Log the start and end of important operations so you can measure duration.
- Never log sensitive data — no passwords, card numbers, or personal identification data.
- Set log retention policies — keep production logs for 30–90 days and archive or delete older logs to manage storage costs.
