SRE Error Budgets

A bank gives you a credit limit. You can spend up to that limit. When you hit it, no more spending until next month. Error budgets work the same way — they give engineering teams a defined amount of unreliability they can "spend" before they must stop and fix things.

What Is an Error Budget

An error budget is the amount of downtime, errors, or slowness your system is allowed to have within a specific time period — and still meet your SLO.

If your SLO says the service must be available 99.9 percent of the time in a 30-day month, then 0.1 percent of that month can be unavailable. That 0.1 percent is your error budget.

Calculating an Error Budget

SLO Target:         99.9% availability
Allowed failure:    100% - 99.9% = 0.1%

Minutes in 30 days: 30 x 24 x 60 = 43,200 minutes
Error budget:       43,200 x 0.001 = 43.2 minutes of downtime allowed per month

As long as cumulative downtime stays under 43.2 minutes, the team has not violated the SLO. Those 43.2 minutes are the budget to spend.

How Teams Spend the Error Budget

Error budget is not wasted — it is intentionally used. Teams spend it on activities that improve the product but carry some risk:

  • Deploying new features that might introduce short instabilities
  • Running experiments or A/B tests on the production system
  • Migrating to new infrastructure with brief cutover windows
  • Conducting chaos engineering experiments (intentional fault injection)
Error Budget as a Resource:

[43.2 min] FULL BUDGET at start of month
     |
     v
[-5 min]   Deployment caused a 5-minute slowdown
[-3 min]   Config change caused a brief outage
[-8 min]   Dependency failure caused errors
     |
     v
[27.2 min] Remaining budget — still safe to experiment

What Happens When the Budget Runs Out

When the error budget is fully consumed before the month ends, the team switches into reliability mode. Feature releases stop. The team focuses entirely on fixing the stability problems that consumed the budget.

Budget Status         Team Action
----------------------------------------------
Budget healthy        Normal release pace — ship features freely
Budget below 50%      Extra care on deployments; review risky changes
Budget exhausted      Freeze new features; focus only on reliability work
Budget consistently   Escalate to leadership; fundamental architecture
running out           changes needed

Why Error Budgets Solve a Classic Conflict

Developers want to ship new features quickly. Operations teams want systems to stay stable. These goals pull in opposite directions and cause conflict.

Error budgets resolve this conflict by making the trade-off explicit and shared. Both teams agree upfront: here is how much instability is acceptable. When the budget is healthy, developers can move fast. When it runs low, stability work takes priority. Neither side can unilaterally override the data.

Without Error Budget:
Developer: "We need to ship this today."
Ops Team:  "It's too risky — we just had an incident."
Result:    Argument. Political decision. No shared data.

With Error Budget:
Developer: "We need to ship this today."
SRE:       "We have 30 minutes of budget left this month.
            This deployment has a 20-minute risk window.
            Decision: proceed with rollback plan ready."
Result:    Data-driven decision. Both teams aligned.

Multi-Window Error Budgets

Many teams track error budgets across multiple time windows simultaneously. This gives a more complete picture of system health.

WindowBudgetPurpose
1-hour~3.6 seconds (at 99.9%)Detect fast-burning incidents
6-hour~21.6 secondsSpot deployment-related problems
1-day~86.4 secondsTrack daily operational quality
30-day~43.2 minutesMonthly SLO compliance view

If the 1-hour budget burns quickly but the 30-day budget looks fine, a recent deployment probably caused a short spike. If the 30-day budget is shrinking steadily, a chronic underlying problem needs investigation.

Error Budget Burn Rate

Burn rate tells you how fast you are consuming the budget relative to the normal rate. A burn rate of 1 means you are consuming budget at exactly the pace your SLO allows. A burn rate of 2 means you are using it twice as fast — you will run out in half the time.

SLO: 99.9% over 30 days
Normal burn rate = 1 (budget lasts exactly 30 days)

If burn rate = 5:
Budget will exhaust in: 30 / 5 = 6 days
Action needed: Investigate immediately

Key Points

  • Error budget = the amount of imperfection your system is allowed to have and still meet the SLO.
  • Healthy budget means teams can move fast; exhausted budget means freeze and fix.
  • Error budgets replace political arguments with shared data.
  • Multi-window tracking and burn rate give early warning before the monthly budget runs out.

Leave a Comment

Your email address will not be published. Required fields are marked *