SRE Dashboards and Observability

An air traffic controller does not walk to each plane and ask how it is doing. They sit in front of a screen that shows every plane's position, altitude, speed, and fuel — all at once. SRE dashboards do the same for software systems: they put the right information in front of the right person at the right time.

What Is an Observability Platform

An observability platform is a set of tools that collects, stores, and displays data about a running system. It connects your metrics, logs, and traces into a single interface so engineers can investigate problems without jumping between unrelated tools.

[Your Services] → emit → [Metrics, Logs, Traces]
                              |
                              v
                    [Observability Platform]
                              |
                              v
                    [Dashboards + Alerts + Search]
                              |
                              v
                    [SRE Engineer investigates here]

Types of Dashboards

Not all dashboards serve the same purpose. A good observability setup has multiple layers of dashboards aimed at different audiences and different questions.

1. Service Health Dashboard (Top-Level)

This is the homepage for a service's reliability. It shows the Four Golden Signals — latency, traffic, errors, and saturation — for the entire service over the last hour, day, and week. Any SRE can open this dashboard and immediately know whether the service is healthy.

Service Health Dashboard — Checkout Service
--------------------------------------------
[ Error Rate:    0.03% ✅ ]   [ Latency p95: 280ms ✅ ]
[ Request Rate:  4,200/s ✅ ]  [ CPU Saturation: 61% ✅ ]

SLO Status: 99.97% (Target: 99.9%) — Budget remaining: 41 min
Last incident: 9 days ago

2. Component Deep-Dive Dashboard

When a problem appears on the top-level dashboard, engineers drill into component-specific dashboards. A database dashboard shows query times, connection counts, cache hit rates, and replication lag. A payment service dashboard shows per-provider success rates, retry counts, and timeout rates.

3. Capacity Dashboard

Capacity dashboards track resource usage over time — weeks and months. They help SREs predict when a service will run out of resources before it actually happens. The goal is to never be surprised by capacity exhaustion.

Capacity Trend: Database Storage
---------------------------------
60 days ago: 1.2 TB used (48%)
30 days ago: 1.6 TB used (64%)
Today:       2.0 TB used (80%)
Projected:   Full in ~25 days at current growth rate
Action:      Provision additional storage this week

4. Business Metrics Dashboard

SRE dashboards do not stop at technical metrics. Business metrics — orders per minute, successful checkouts, active subscriptions — correlate directly with technical health. A sudden drop in orders per minute often reveals a system problem before any technical alert fires.

Dashboard Design Principles

Show Context, Not Just Numbers

A number with no context is hard to interpret. A good dashboard shows the current value, the target, the trend over time, and comparison to a similar period (like last week at the same time). This lets an engineer immediately see whether 500ms is fast or slow for this service right now.

Design for the On-Call Engineer at 3 AM

The on-call engineer is tired, stressed, and needs answers fast. Dashboards should put the most important information at the top. Labels should be clear. Colors should follow a standard — green for healthy, yellow for warning, red for critical. Ambiguous or cluttered dashboards slow down incident response.

One Dashboard Per Job

Resist the temptation to put every possible metric on one massive dashboard. A dashboard that tries to show everything effectively shows nothing. Keep each dashboard focused on one job: top-level health, database performance, API gateway status, and so on.

Common Observability Platforms

PlatformStrengthsCommon Use Case
Grafana + PrometheusOpen source, highly flexibleSelf-hosted metrics and dashboards
DatadogAll-in-one: metrics, logs, traces, APMTeams wanting a single commercial platform
New RelicApplication performance monitoringApplication-centric observability
AWS CloudWatchNative AWS integrationTeams running on AWS infrastructure
Google Cloud MonitoringNative GCP integration, managedTeams running on GCP infrastructure
HoneycombHigh-cardinality event analysisComplex microservice debugging

Building an SLO Dashboard

An SLO dashboard shows exactly how close the service is to its reliability target and how much error budget remains. This dashboard is referenced in every incident and every reliability review.

SLO Dashboard: User Login Service
----------------------------------
SLO Target:          99.95% success rate over 30 days
Current 30-day SLI:  99.97%
Status:              HEALTHY ✅

Error Budget Allowed: 21.6 minutes / 30 days
Budget Consumed:       6.2 minutes (28.7%)
Budget Remaining:     15.4 minutes (71.3%)

Burn Rate (1-hour):   0.8x  ← normal pace
Burn Rate (6-hour):   1.1x  ← slightly elevated, watch closely
Burn Rate (30-day):   0.9x  ← healthy

Key Points

  • Observability platforms unify metrics, logs, and traces in one place.
  • Layer dashboards from top-level health to component deep dives.
  • Design dashboards for the tired, stressed on-call engineer — clarity above all else.
  • Business metrics dashboards often catch problems before technical alerts do.
  • Every service should have a visible SLO dashboard showing budget health at a glance.

Leave a Comment

Your email address will not be published. Required fields are marked *