SRE Monitoring Metrics Logs and Traces

A pilot flies a plane using dozens of instruments — speed, altitude, fuel level, engine temperature. Each instrument answers a different question. Taken together, they give a complete picture of what the aircraft is doing right now. SRE teams instrument their software the same way, using three primary tools: metrics, logs, and traces.

The Three Pillars of Observability

Observability is the ability to understand what is happening inside a system by looking at its outputs. The three pillars — metrics, logs, and traces — each provide a different type of output.

QUESTION                         PILLAR TO USE
---------------------------------------------------
How is the system performing?    Metrics
What happened at 3:47 PM?        Logs
Why did this request take so     Traces
long?

Metrics

A metric is a number measured over time. It answers: how much, how fast, how often?

Types of Metrics

  • Counter: A number that only goes up. Example: total number of HTTP requests received since startup.
  • Gauge: A number that goes up and down. Example: current memory usage in megabytes.
  • Histogram: Distributes measurements into buckets. Example: how many requests completed in 0-100ms, 100-200ms, 200-500ms, and over 500ms.

The Four Golden Signals

Google SRE recommends monitoring four key metrics for any user-facing service. These are called the Four Golden Signals:

1. LATENCY     — How long do requests take?
                 Track successful requests AND failed ones separately.

2. TRAFFIC     — How much demand is the system receiving?
                 Example: requests per second, active users.

3. ERRORS      — What fraction of requests are failing?
                 Example: HTTP 500 rate, failed payment rate.

4. SATURATION  — How full is the system?
                 Example: CPU at 90%, disk at 95% full.

Watch these four signals first. They cover the majority of problems users experience.

Logs

A log is a time-stamped record of something that happened. When a user logs in, the application writes a log entry. When a database query runs, the database writes a log. When an error occurs, the error message goes into a log.

What a Log Entry Looks Like

2024-03-15T14:32:07Z INFO  user_id=8821 action=login status=success latency_ms=43
2024-03-15T14:32:09Z ERROR user_id=8821 action=checkout error="payment_timeout" order_id=99021

Each line contains a timestamp, a severity level, and structured fields describing what happened. Structured logs (with key=value format) are much easier to search and analyze than plain text sentences.

Log Severity Levels

LevelMeaningExample
DEBUGDetailed internal state — for developersVariable values during calculation
INFONormal operationsUser logged in, file uploaded
WARNSomething unusual — not broken yetCache miss rate above threshold
ERRORSomething failedDatabase query returned no result unexpectedly
FATAL/CRITICALSystem cannot continueCannot connect to database at all

Traces

A trace follows a single request as it travels through multiple services. Modern applications often consist of many small services (microservices). A single user action — like clicking "Buy Now" — might touch ten different services: authentication, inventory, payment, shipping, notification, and more.

How a Trace Works

User clicks "Buy Now"
      |
      v
[API Gateway]          Span A: 8ms
      |
      v
[Auth Service]         Span B: 12ms
      |
      v
[Inventory Service]    Span C: 180ms  ← SLOW — root cause found
      |
      v
[Payment Service]      Span D: 55ms
      |
      v
[Notification Service] Span E: 10ms
      |
      v
Total request time: 265ms

Each step in this chain is called a span. All spans belonging to the same user request share a single trace ID. By examining the spans, the SRE team sees exactly which service added the most latency and where to focus the fix.

Without Tracing

Without traces, an SRE sees only the total request time of 265ms. Finding which service caused it requires guessing and checking logs from five different services manually. With traces, the slow span is visible immediately.

How the Three Pillars Work Together

STEP 1: Metrics alert fires
        "95th percentile latency exceeded 500ms for the last 5 minutes."

STEP 2: Logs reveal the context
        Searching logs shows errors from the inventory service starting at 2:14 PM.

STEP 3: Traces identify the exact cause
        Traces show inventory lookups taking 800ms on requests that call
        the product catalog database.

CONCLUSION: The product catalog database is slow. Run a query analysis.

Common Monitoring Tools

  • Metrics: Prometheus, Datadog, CloudWatch, InfluxDB
  • Logs: Elasticsearch + Kibana (ELK stack), Splunk, Loki, Cloud Logging
  • Traces: Jaeger, Zipkin, Tempo, AWS X-Ray, Datadog APM

Key Points

  • Metrics answer "how much" — track trends and trigger alerts.
  • Logs answer "what happened" — record events for investigation.
  • Traces answer "why did this request take so long" — follow a request across multiple services.
  • The Four Golden Signals (latency, traffic, errors, saturation) cover most production problems.
  • All three pillars together give a complete picture; each one alone leaves blind spots.

Leave a Comment

Your email address will not be published. Required fields are marked *