SRE SLOs SLAs and SLIs

A pizza restaurant promises delivery in 30 minutes or the pizza is free. That promise has a measurement (delivery time), a target (30 minutes), and a consequence (free pizza). SRE teams make the same kind of promise for software services — using SLIs, SLOs, and SLAs.

The Three Terms and How They Relate

SLI  =  The measurement   (What are we measuring?)
SLO  =  The target        (How good should it be?)
SLA  =  The agreement     (What happens if we miss it?)

These three work as a stack. You measure something (SLI), set a goal for it (SLO), and agree on consequences if you miss that goal consistently (SLA).

Service Level Indicators — SLIs

An SLI is a specific number that describes how a service is behaving right now. It answers the question: how well is the system performing on this particular dimension?

Common SLIs

Availability: What percentage of requests succeed? Example: 99.95% of requests returned a successful response in the last 30 days.
Latency: How fast do requests finish? Example: 95% of API calls completed in under 200 milliseconds.
Error Rate: What fraction of requests return an error? Example: 0.1% of requests returned a 500 error.
Throughput: How many requests does the system handle per second? Example: 5,000 requests per second during peak hours.

Choosing the Right SLI

Not every number worth measuring makes a good SLI. A good SLI directly reflects the user's experience. CPU utilization on a single server is useful internally, but it does not directly tell you whether users are getting a fast, correct response. Request success rate does.

Bad SLI:  Server CPU stays below 70%
          (Internal metric — users may still have a bad experience)

Good SLI: 99% of search queries return results in under 500ms
          (Directly measures what users experience)

Service Level Objectives — SLOs

An SLO is a target value for an SLI over a set time window. It answers: how good does this measurement need to be?

SLOs turn vague goals like "the service should be fast" into concrete, measurable commitments like "99% of requests must complete in under 300 milliseconds over any rolling 30-day window."

Writing a Good SLO

Format: [What percentage] of [SLI] must be [at or above/below] [threshold]
        over [time window].

Example: 99.5% of homepage requests must return HTTP 200
         in under 1 second over any rolling 28-day period.

Why Not Set SLOs at 100 Percent

100 percent reliability is not achievable in practice. Networks fail. Hardware dies. Software has bugs. Setting a 100 percent SLO creates a system where any tiny imperfection triggers a failure. It also leaves no room for planned deployments, experiments, or any change at all.

A realistic SLO — say 99.9 percent — tells the team: the system should be excellent, but a small amount of imperfection is acceptable and expected.

Service Level Agreements — SLAs

An SLA is a formal contract between a service provider and a customer. It defines what happens — usually financial penalties or service credits — when the provider misses the SLO consistently.

SLA vs SLO: An Important Difference

	SLO	SLA
Who it is for	Internal engineering team	Customers and business stakeholders
What it does	Sets the engineering goal	Sets the business commitment
Consequence of missing it	Engineering review; error budget burn	Financial penalty; customer credits
Typical strictness	Stricter (internal buffer)	Looser (allows engineering headroom)

A company usually sets its internal SLO tighter than its external SLA. If the SLA promises 99.9 percent availability, the internal SLO might target 99.95 percent — giving the team a buffer before the SLA breach triggers.

Internal SLO target:  99.95% availability
External SLA promise: 99.90% availability
Buffer:                0.05% — room to recover before penalty

A Complete Example: Video Streaming Service

Imagine a video streaming platform. Here is how SLI, SLO, and SLA work together for the video playback feature:

SLI:  Percentage of video play requests that start within 3 seconds
      Current reading: 98.7%

SLO:  99% of video play requests must start within 3 seconds
      over any rolling 30-day period.
      Status: Currently missing the target.

SLA:  The company promises customers 99% video start reliability.
      If the monthly average falls below 99%, customers receive
      a 10% service credit.
      Status: At risk if SLO is not restored quickly.

Measuring SLIs With Good and Bad Events

Many SLIs are calculated by counting "good" events divided by total events. A good event is one that meets your quality bar. A bad event fails to meet it.

Availability SLI = (Good Requests / Total Requests) x 100

Example:
Total requests in last 30 days:  1,000,000
Requests that succeeded:           999,200
Requests that failed:                  800

SLI = (999,200 / 1,000,000) x 100 = 99.92%

Key Points

SLI measures the current state of a service from the user's point of view.
SLO sets the engineering target for that measurement.
SLA is the business contract with consequences for sustained SLO violations.
Internal SLOs should be stricter than external SLAs to create a safety buffer.
100 percent targets are counterproductive — realistic targets enable healthy engineering decisions.

Previous lesson

Back to course

Next lesson