SRE SLOs SLAs and SLIs
A pizza restaurant promises delivery in 30 minutes or the pizza is free. That promise has a measurement (delivery time), a target (30 minutes), and a consequence (free pizza). SRE teams make the same kind of promise for software services — using SLIs, SLOs, and SLAs.
The Three Terms and How They Relate
SLI = The measurement (What are we measuring?) SLO = The target (How good should it be?) SLA = The agreement (What happens if we miss it?)
These three work as a stack. You measure something (SLI), set a goal for it (SLO), and agree on consequences if you miss that goal consistently (SLA).
Service Level Indicators — SLIs
An SLI is a specific number that describes how a service is behaving right now. It answers the question: how well is the system performing on this particular dimension?
Common SLIs
- Availability: What percentage of requests succeed? Example: 99.95% of requests returned a successful response in the last 30 days.
- Latency: How fast do requests finish? Example: 95% of API calls completed in under 200 milliseconds.
- Error Rate: What fraction of requests return an error? Example: 0.1% of requests returned a 500 error.
- Throughput: How many requests does the system handle per second? Example: 5,000 requests per second during peak hours.
Choosing the Right SLI
Not every number worth measuring makes a good SLI. A good SLI directly reflects the user's experience. CPU utilization on a single server is useful internally, but it does not directly tell you whether users are getting a fast, correct response. Request success rate does.
Bad SLI: Server CPU stays below 70%
(Internal metric — users may still have a bad experience)
Good SLI: 99% of search queries return results in under 500ms
(Directly measures what users experience)
Service Level Objectives — SLOs
An SLO is a target value for an SLI over a set time window. It answers: how good does this measurement need to be?
SLOs turn vague goals like "the service should be fast" into concrete, measurable commitments like "99% of requests must complete in under 300 milliseconds over any rolling 30-day window."
Writing a Good SLO
Format: [What percentage] of [SLI] must be [at or above/below] [threshold]
over [time window].
Example: 99.5% of homepage requests must return HTTP 200
in under 1 second over any rolling 28-day period.
Why Not Set SLOs at 100 Percent
100 percent reliability is not achievable in practice. Networks fail. Hardware dies. Software has bugs. Setting a 100 percent SLO creates a system where any tiny imperfection triggers a failure. It also leaves no room for planned deployments, experiments, or any change at all.
A realistic SLO — say 99.9 percent — tells the team: the system should be excellent, but a small amount of imperfection is acceptable and expected.
Service Level Agreements — SLAs
An SLA is a formal contract between a service provider and a customer. It defines what happens — usually financial penalties or service credits — when the provider misses the SLO consistently.
SLA vs SLO: An Important Difference
| SLO | SLA | |
|---|---|---|
| Who it is for | Internal engineering team | Customers and business stakeholders |
| What it does | Sets the engineering goal | Sets the business commitment |
| Consequence of missing it | Engineering review; error budget burn | Financial penalty; customer credits |
| Typical strictness | Stricter (internal buffer) | Looser (allows engineering headroom) |
A company usually sets its internal SLO tighter than its external SLA. If the SLA promises 99.9 percent availability, the internal SLO might target 99.95 percent — giving the team a buffer before the SLA breach triggers.
Internal SLO target: 99.95% availability External SLA promise: 99.90% availability Buffer: 0.05% — room to recover before penalty
A Complete Example: Video Streaming Service
Imagine a video streaming platform. Here is how SLI, SLO, and SLA work together for the video playback feature:
SLI: Percentage of video play requests that start within 3 seconds
Current reading: 98.7%
SLO: 99% of video play requests must start within 3 seconds
over any rolling 30-day period.
Status: Currently missing the target.
SLA: The company promises customers 99% video start reliability.
If the monthly average falls below 99%, customers receive
a 10% service credit.
Status: At risk if SLO is not restored quickly.
Measuring SLIs With Good and Bad Events
Many SLIs are calculated by counting "good" events divided by total events. A good event is one that meets your quality bar. A bad event fails to meet it.
Availability SLI = (Good Requests / Total Requests) x 100 Example: Total requests in last 30 days: 1,000,000 Requests that succeeded: 999,200 Requests that failed: 800 SLI = (999,200 / 1,000,000) x 100 = 99.92%
Key Points
- SLI measures the current state of a service from the user's point of view.
- SLO sets the engineering target for that measurement.
- SLA is the business contract with consequences for sustained SLO violations.
- Internal SLOs should be stricter than external SLAs to create a safety buffer.
- 100 percent targets are counterproductive — realistic targets enable healthy engineering decisions.
