SRE SLOs SLAs and SLIs

A pizza restaurant promises delivery in 30 minutes or the pizza is free. That promise has a measurement (delivery time), a target (30 minutes), and a consequence (free pizza). SRE teams make the same kind of promise for software services — using SLIs, SLOs, and SLAs.

The Three Terms and How They Relate

SLI  =  The measurement   (What are we measuring?)
SLO  =  The target        (How good should it be?)
SLA  =  The agreement     (What happens if we miss it?)

These three work as a stack. You measure something (SLI), set a goal for it (SLO), and agree on consequences if you miss that goal consistently (SLA).

Service Level Indicators — SLIs

An SLI is a specific number that describes how a service is behaving right now. It answers the question: how well is the system performing on this particular dimension?

Common SLIs

  • Availability: What percentage of requests succeed? Example: 99.95% of requests returned a successful response in the last 30 days.
  • Latency: How fast do requests finish? Example: 95% of API calls completed in under 200 milliseconds.
  • Error Rate: What fraction of requests return an error? Example: 0.1% of requests returned a 500 error.
  • Throughput: How many requests does the system handle per second? Example: 5,000 requests per second during peak hours.

Choosing the Right SLI

Not every number worth measuring makes a good SLI. A good SLI directly reflects the user's experience. CPU utilization on a single server is useful internally, but it does not directly tell you whether users are getting a fast, correct response. Request success rate does.

Bad SLI:  Server CPU stays below 70%
          (Internal metric — users may still have a bad experience)

Good SLI: 99% of search queries return results in under 500ms
          (Directly measures what users experience)

Service Level Objectives — SLOs

An SLO is a target value for an SLI over a set time window. It answers: how good does this measurement need to be?

SLOs turn vague goals like "the service should be fast" into concrete, measurable commitments like "99% of requests must complete in under 300 milliseconds over any rolling 30-day window."

Writing a Good SLO

Format: [What percentage] of [SLI] must be [at or above/below] [threshold]
        over [time window].

Example: 99.5% of homepage requests must return HTTP 200
         in under 1 second over any rolling 28-day period.

Why Not Set SLOs at 100 Percent

100 percent reliability is not achievable in practice. Networks fail. Hardware dies. Software has bugs. Setting a 100 percent SLO creates a system where any tiny imperfection triggers a failure. It also leaves no room for planned deployments, experiments, or any change at all.

A realistic SLO — say 99.9 percent — tells the team: the system should be excellent, but a small amount of imperfection is acceptable and expected.

Service Level Agreements — SLAs

An SLA is a formal contract between a service provider and a customer. It defines what happens — usually financial penalties or service credits — when the provider misses the SLO consistently.

SLA vs SLO: An Important Difference

SLOSLA
Who it is forInternal engineering teamCustomers and business stakeholders
What it doesSets the engineering goalSets the business commitment
Consequence of missing itEngineering review; error budget burnFinancial penalty; customer credits
Typical strictnessStricter (internal buffer)Looser (allows engineering headroom)

A company usually sets its internal SLO tighter than its external SLA. If the SLA promises 99.9 percent availability, the internal SLO might target 99.95 percent — giving the team a buffer before the SLA breach triggers.

Internal SLO target:  99.95% availability
External SLA promise: 99.90% availability
Buffer:                0.05% — room to recover before penalty

A Complete Example: Video Streaming Service

Imagine a video streaming platform. Here is how SLI, SLO, and SLA work together for the video playback feature:

SLI:  Percentage of video play requests that start within 3 seconds
      Current reading: 98.7%

SLO:  99% of video play requests must start within 3 seconds
      over any rolling 30-day period.
      Status: Currently missing the target.

SLA:  The company promises customers 99% video start reliability.
      If the monthly average falls below 99%, customers receive
      a 10% service credit.
      Status: At risk if SLO is not restored quickly.

Measuring SLIs With Good and Bad Events

Many SLIs are calculated by counting "good" events divided by total events. A good event is one that meets your quality bar. A bad event fails to meet it.

Availability SLI = (Good Requests / Total Requests) x 100

Example:
Total requests in last 30 days:  1,000,000
Requests that succeeded:           999,200
Requests that failed:                  800

SLI = (999,200 / 1,000,000) x 100 = 99.92%

Key Points

  • SLI measures the current state of a service from the user's point of view.
  • SLO sets the engineering target for that measurement.
  • SLA is the business contract with consequences for sustained SLO violations.
  • Internal SLOs should be stricter than external SLAs to create a safety buffer.
  • 100 percent targets are counterproductive — realistic targets enable healthy engineering decisions.

Leave a Comment

Your email address will not be published. Required fields are marked *