SRE Monitoring Metrics Logs and Traces
A pilot flies a plane using dozens of instruments — speed, altitude, fuel level, engine temperature. Each instrument answers a different question. Taken together, they give a complete picture of what the aircraft is doing right now. SRE teams instrument their software the same way, using three primary tools: metrics, logs, and traces.
The Three Pillars of Observability
Observability is the ability to understand what is happening inside a system by looking at its outputs. The three pillars — metrics, logs, and traces — each provide a different type of output.
QUESTION PILLAR TO USE --------------------------------------------------- How is the system performing? Metrics What happened at 3:47 PM? Logs Why did this request take so Traces long?
Metrics
A metric is a number measured over time. It answers: how much, how fast, how often?
Types of Metrics
- Counter: A number that only goes up. Example: total number of HTTP requests received since startup.
- Gauge: A number that goes up and down. Example: current memory usage in megabytes.
- Histogram: Distributes measurements into buckets. Example: how many requests completed in 0-100ms, 100-200ms, 200-500ms, and over 500ms.
The Four Golden Signals
Google SRE recommends monitoring four key metrics for any user-facing service. These are called the Four Golden Signals:
1. LATENCY — How long do requests take?
Track successful requests AND failed ones separately.
2. TRAFFIC — How much demand is the system receiving?
Example: requests per second, active users.
3. ERRORS — What fraction of requests are failing?
Example: HTTP 500 rate, failed payment rate.
4. SATURATION — How full is the system?
Example: CPU at 90%, disk at 95% full.
Watch these four signals first. They cover the majority of problems users experience.
Logs
A log is a time-stamped record of something that happened. When a user logs in, the application writes a log entry. When a database query runs, the database writes a log. When an error occurs, the error message goes into a log.
What a Log Entry Looks Like
2024-03-15T14:32:07Z INFO user_id=8821 action=login status=success latency_ms=43 2024-03-15T14:32:09Z ERROR user_id=8821 action=checkout error="payment_timeout" order_id=99021
Each line contains a timestamp, a severity level, and structured fields describing what happened. Structured logs (with key=value format) are much easier to search and analyze than plain text sentences.
Log Severity Levels
| Level | Meaning | Example |
|---|---|---|
| DEBUG | Detailed internal state — for developers | Variable values during calculation |
| INFO | Normal operations | User logged in, file uploaded |
| WARN | Something unusual — not broken yet | Cache miss rate above threshold |
| ERROR | Something failed | Database query returned no result unexpectedly |
| FATAL/CRITICAL | System cannot continue | Cannot connect to database at all |
Traces
A trace follows a single request as it travels through multiple services. Modern applications often consist of many small services (microservices). A single user action — like clicking "Buy Now" — might touch ten different services: authentication, inventory, payment, shipping, notification, and more.
How a Trace Works
User clicks "Buy Now"
|
v
[API Gateway] Span A: 8ms
|
v
[Auth Service] Span B: 12ms
|
v
[Inventory Service] Span C: 180ms ← SLOW — root cause found
|
v
[Payment Service] Span D: 55ms
|
v
[Notification Service] Span E: 10ms
|
v
Total request time: 265ms
Each step in this chain is called a span. All spans belonging to the same user request share a single trace ID. By examining the spans, the SRE team sees exactly which service added the most latency and where to focus the fix.
Without Tracing
Without traces, an SRE sees only the total request time of 265ms. Finding which service caused it requires guessing and checking logs from five different services manually. With traces, the slow span is visible immediately.
How the Three Pillars Work Together
STEP 1: Metrics alert fires
"95th percentile latency exceeded 500ms for the last 5 minutes."
STEP 2: Logs reveal the context
Searching logs shows errors from the inventory service starting at 2:14 PM.
STEP 3: Traces identify the exact cause
Traces show inventory lookups taking 800ms on requests that call
the product catalog database.
CONCLUSION: The product catalog database is slow. Run a query analysis.
Common Monitoring Tools
- Metrics: Prometheus, Datadog, CloudWatch, InfluxDB
- Logs: Elasticsearch + Kibana (ELK stack), Splunk, Loki, Cloud Logging
- Traces: Jaeger, Zipkin, Tempo, AWS X-Ray, Datadog APM
Key Points
- Metrics answer "how much" — track trends and trigger alerts.
- Logs answer "what happened" — record events for investigation.
- Traces answer "why did this request take so long" — follow a request across multiple services.
- The Four Golden Signals (latency, traffic, errors, saturation) cover most production problems.
- All three pillars together give a complete picture; each one alone leaves blind spots.
