Microservices Logging and Monitoring

In a monolith, you open one log file to debug a problem. In a microservices system with 50 services, a single user request touches 8 different services. When something goes wrong, you need to see exactly what happened across all 8. Logging and monitoring give you that visibility.

The Three Pillars of Observability

Observability is the ability to understand what is happening inside your system from the outside. Three types of data together give you complete visibility.

PILLAR 1: LOGS
"What happened and when?"
Timestamped text records of events inside a service.
  2025-03-14 10:23:01 [INFO]  Order ORD-7701 received from user USR-212
  2025-03-14 10:23:02 [INFO]  Payment initiated for ORD-7701
  2025-03-14 10:23:04 [ERROR] Payment failed: card declined for ORD-7701

PILLAR 2: METRICS
"How is the system performing right now?"
Numerical measurements collected over time.
  - Requests per second: 1,240
  - Error rate: 0.3%
  - Average response time: 142ms
  - CPU usage: 61%

PILLAR 3: TRACES
"Where did a specific request go?"
A record of every service a request passed through and how long each took.
  Order Service: 5ms
  --> Payment Service: 120ms
      --> Bank API: 98ms (slowest part)
  --> Email Service: 12ms
  Total: 137ms

Centralized Logging

Each service writes logs. With 50 services on 100 servers, logs are scattered everywhere. Centralized logging collects all logs into one searchable system.

CENTRALIZED LOGGING FLOW
=========================
[Order Service]      --> logs --> +
[Payment Service]    --> logs --> | Log Collector --> [Central Log Store]
[Inventory Service]  --> logs --> +                  (Elasticsearch / Loki)
[Email Service]      --> logs --> +

Developer searches ONE place:
"Show me all logs related to order ORD-7701"
Gets logs from all 4 services that touched that order, in one view.

The ELK Stack (Elasticsearch, Logstash, Kibana) is a popular combination. Logs go into Elasticsearch. Kibana provides a dashboard to search and visualize them. Grafana Loki is a lighter alternative that works well with Kubernetes.

Structured Logging

Plain text logs are hard for machines to search. Structured logs use a consistent format — usually JSON — with named fields. This makes querying fast and precise.

UNSTRUCTURED LOG (hard to query)
=================================
2025-03-14 10:23:04 ERROR Payment failed for order ORD-7701 user USR-212 card declined

STRUCTURED LOG (easy to query)
================================
{
  "timestamp": "2025-03-14T10:23:04Z",
  "level": "ERROR",
  "service": "payment-service",
  "order_id": "ORD-7701",
  "user_id": "USR-212",
  "event": "PaymentFailed",
  "reason": "card_declined"
}

Query: Find all ERROR events for order ORD-7701
Result: Instant match. No parsing required.

Distributed Tracing

A trace tracks one request across all services. Every service adds a record (called a span) to the trace showing what it did and how long it took. All spans for one request share a Trace ID.

TRACE EXAMPLE: "Place Order" request
======================================
Trace ID: TRC-88421

[API Gateway]       Span: 3ms   (received request, routed to Order Service)
  |
  [Order Service]   Span: 8ms   (validated order, called Payment Service)
    |
    [Payment Service] Span: 134ms  (most time here)
      |
      [Bank API]      Span: 118ms  (external call, slowest step)
    |
    [Payment Service] Span: 16ms  (post-processing, updated DB)
  |
  [Order Service]   Span: 5ms   (saved order, published event)
    |
    [Email Service]   Span: 22ms  (sent confirmation email)

Total: 172ms
Bottleneck identified: Bank API takes 118ms (69% of total time)

OpenTelemetry is the standard for adding tracing to services. Jaeger and Zipkin are popular tools for storing and visualizing traces. Most cloud providers also offer managed tracing services.

Metrics and Alerting

Metrics capture the health numbers of every service. Prometheus is the most widely used metrics collection tool in Kubernetes environments. Grafana visualizes Prometheus metrics in dashboards.

KEY METRICS TO TRACK PER SERVICE
==================================
Request Rate     : How many requests per second is the service handling?
Error Rate       : What percentage of requests are failing?
Latency          : How long do requests take? (p50, p95, p99)
Saturation       : How close is the service to its resource limits?

These four are known as the "RED" and "USE" methods.

Alerting Rules

Alerts fire automatically when a metric crosses a threshold. On-call engineers receive a notification immediately — they do not wait for a user to report a problem.

ALERT EXAMPLES
===============
ALERT: PaymentService error rate > 1% for 5 minutes
  --> Page on-call engineer immediately

ALERT: OrderService p99 latency > 2000ms for 3 minutes
  --> Send Slack notification to team channel

ALERT: InventoryService pod count dropped below 2
  --> Kubernetes alert: service is understaffed

Health Dashboards

A dashboard shows the current state of the entire system at a glance. Different views serve different audiences.

DASHBOARD TYPES
================
Executive Dashboard:
  - Overall system uptime
  - Total orders processed today
  - Revenue per hour

Engineering Dashboard:
  - Error rate per service
  - Latency heatmap
  - Pod counts and CPU per service

On-Call Dashboard:
  - Active alerts
  - Recent deployments (changes that may have caused issues)
  - Top error messages in the last 30 minutes

Correlation IDs

A correlation ID is a unique identifier assigned to a user request when it enters the system. Every service passes this ID along and includes it in every log line. This lets you search all logs across all services for one specific user interaction.

CORRELATION ID FLOW
====================
User clicks "Place Order"
API Gateway assigns: X-Correlation-ID: CID-55901

Order Service logs:   [CID-55901] Order ORD-7701 created
Payment Service logs: [CID-55901] Charging card for ORD-7701
Email Service logs:   [CID-55901] Sending confirmation to john@example.com

Search logs for CID-55901 --> see the full story of this one user's order
across all services, in chronological order.

Logging Best Practices

Always include the service name, timestamp, and severity level (INFO, WARN, ERROR) in every log line.
Include the correlation ID and relevant business identifiers (order_id, user_id) in every log line.
Log the start and end of important operations so you can measure duration.
Never log sensitive data — no passwords, card numbers, or personal identification data.
Set log retention policies — keep production logs for 30–90 days and archive or delete older logs to manage storage costs.

Previous lessons

Back to courses

Next lessons