DevOps Site Reliability Engineering

Site Reliability Engineering (SRE) is a discipline that applies software engineering principles to IT operations. Google invented SRE to solve the challenge of keeping large-scale systems reliable while still deploying new features rapidly. SRE and DevOps share the same goals — faster delivery, higher reliability — but approach them from complementary angles.

DevOps focuses on culture and process. SRE provides concrete practices, metrics, and organizational models to make reliability measurable and achievable.

DevOps vs SRE

AspectDevOpsSRE
OriginCultural movementEngineering practice (from Google)
FocusCollaboration and delivery speedReliability and scalability at scale
ApproachPrescriptive cultureSoftware engineering for ops
Key metricDeployment frequencyError budget consumption
CompatibilityComplementaryComplementary

Core SRE Concepts

Service Level Indicators (SLIs)

An SLI is a quantitative measure of some aspect of service behavior. It is the actual number measured in the real world. Common SLIs:

  • Availability: Percentage of successful requests over total requests.
  • Latency: Percentage of requests served within a defined time threshold.
  • Error Rate: Percentage of requests that return an error.
  • Throughput: Number of requests processed per second.
  • Durability: Probability that stored data is retained (for storage services).
# Availability SLI (Prometheus PromQL)
sum(rate(http_requests_total{status!~"5.."}[5m]))
  / sum(rate(http_requests_total[5m])) * 100

# Latency SLI - percentage of requests under 200ms
sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m]))
  / sum(rate(http_request_duration_seconds_count[5m])) * 100

Service Level Objectives (SLOs)

An SLO is the target value for an SLI. It defines "good enough" reliability. SLOs are internal team goals.

Examples:

  • Availability SLO: 99.9% of requests succeed each month.
  • Latency SLO: 95% of requests complete within 200ms.
  • Error rate SLO: Error rate stays below 0.1% over a rolling 30-day window.

Service Level Agreements (SLAs)

An SLA is a formal contract with customers that includes consequences (refunds, penalties) if the service falls below a defined level. SLOs are typically set tighter than SLAs to provide a safety margin.

Error Budgets

The error budget is the acceptable amount of unreliability permitted within the SLO period. If the SLO is 99.9% availability, the error budget is 0.1% — equivalent to 43.8 minutes of downtime per month.

Error budgets change how teams think about risk:

  • Lots of budget remaining? Ship new features quickly. Risk is acceptable.
  • Budget nearly exhausted? Slow down deployments. Focus on reliability work.
  • Budget exceeded? Freeze new releases until reliability improves.

This creates a data-driven negotiation between developers (who want to ship) and SREs (who maintain reliability). Both teams agree to the same objective number.

Error Budget Calculation

SLO Target: 99.9% availability
Measurement Window: 30 days (43,200 minutes)

Error Budget = (1 - 0.999) × 43,200 = 43.2 minutes

If 20 minutes of downtime occurred this month:
Remaining Budget = 43.2 - 20 = 23.2 minutes
Budget Consumed = 46%
Status: Healthy — deployments can continue normally

If 50 minutes of downtime occurred this month:
Budget Consumed = 116% — SLO BREACHED
Status: Feature freeze until reliability is restored

Toil Reduction

Toil is manual, repetitive, operational work that doesn't provide lasting value. Restarting a service manually every Tuesday is toil. Re-creating cloud resources from scratch because there's no IaC is toil.

The SRE principle: if toil consumes more than 50% of an SRE team's time, it must be reduced through automation. Common toil-elimination work:

  • Automating repetitive deployment steps with scripts or pipelines.
  • Replacing manual server restarts with auto-healing Kubernetes configurations.
  • Automating alert response (runbooks as code with PagerDuty or Opsgenie).
  • Replacing manual capacity planning with auto-scaling policies.

Incident Management

Incidents happen. SRE defines a structured process to handle them efficiently and learn from them.

Incident Severity Levels

SeverityDefinitionExample
SEV-1 (Critical)Complete service outage, major data lossProduction database down
SEV-2 (High)Key feature broken, significant user impactPayment processing failing
SEV-3 (Medium)Degraded performance, partial impactImage uploads slow but working
SEV-4 (Low)Minor issue, workaround existsNon-critical report page is slow

Incident Response Steps

  1. Detect: Alert fires (Prometheus/PagerDuty) or user reports an issue.
  2. Triage: Assess severity. Assign an incident commander.
  3. Mitigate: Restore service as quickly as possible — rollback, feature flag off, failover.
  4. Investigate: Find the root cause once the service is restored.
  5. Communicate: Keep stakeholders updated throughout. Post status page updates.
  6. Resolve: Confirm full recovery. Close the incident.
  7. Post-mortem: Write a blameless post-mortem to prevent recurrence.

Blameless Post-Mortems

A post-mortem is a document written after an incident that explains what happened, why it happened, and what will change to prevent it from happening again. The critical word is blameless — people must feel safe to report mistakes honestly without fear of punishment.

Post-Mortem Structure

  • Incident Summary: What happened, when, and for how long.
  • Timeline: Sequence of events from detection to resolution.
  • Root Cause Analysis: The underlying reason, not just the surface symptom.
  • Impact: How many users affected, business impact, SLO status.
  • Action Items: Specific tasks to prevent recurrence, with owners and due dates.
  • Lessons Learned: What the team learned about the system and the process.

On-Call Best Practices

  • Every alert must be actionable — non-actionable alerts cause alert fatigue and get ignored.
  • On-call schedules must be fair and sustainable — no one person is always on call.
  • Runbooks (step-by-step response guides) must exist for every recurring alert type.
  • On-call engineers track toil — recurring manual work gets a ticket to be automated.
  • Reduce alert noise relentlessly — if an alert fires constantly without action, fix the root cause or remove the alert.

Capacity Planning and Performance Engineering

SREs plan for growth ahead of time. Capacity planning involves:

  • Measuring current resource utilization trends.
  • Projecting future demand based on business forecasts.
  • Load testing systems to find breaking points before users do.
  • Defining auto-scaling policies so systems grow automatically with demand.

Summary

  • SRE applies software engineering to operations — automation, measurement, and scalable processes.
  • SLIs measure what matters. SLOs define targets. SLAs are customer-facing contracts.
  • Error budgets balance the speed of feature delivery with the need for reliability.
  • Toil is manual work that must be reduced through automation — not accepted as normal.
  • Blameless post-mortems turn incidents into learning opportunities without blame.
  • On-call practices must be sustainable, actionable, and constantly improved.

Leave a Comment