DevOps Site Reliability Engineering
Site Reliability Engineering (SRE) is a discipline that applies software engineering principles to IT operations. Google invented SRE to solve the challenge of keeping large-scale systems reliable while still deploying new features rapidly. SRE and DevOps share the same goals — faster delivery, higher reliability — but approach them from complementary angles.
DevOps focuses on culture and process. SRE provides concrete practices, metrics, and organizational models to make reliability measurable and achievable.
DevOps vs SRE
| Aspect | DevOps | SRE |
|---|---|---|
| Origin | Cultural movement | Engineering practice (from Google) |
| Focus | Collaboration and delivery speed | Reliability and scalability at scale |
| Approach | Prescriptive culture | Software engineering for ops |
| Key metric | Deployment frequency | Error budget consumption |
| Compatibility | Complementary | Complementary |
Core SRE Concepts
Service Level Indicators (SLIs)
An SLI is a quantitative measure of some aspect of service behavior. It is the actual number measured in the real world. Common SLIs:
- Availability: Percentage of successful requests over total requests.
- Latency: Percentage of requests served within a defined time threshold.
- Error Rate: Percentage of requests that return an error.
- Throughput: Number of requests processed per second.
- Durability: Probability that stored data is retained (for storage services).
# Availability SLI (Prometheus PromQL)
sum(rate(http_requests_total{status!~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100
# Latency SLI - percentage of requests under 200ms
sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m]))
/ sum(rate(http_request_duration_seconds_count[5m])) * 100Service Level Objectives (SLOs)
An SLO is the target value for an SLI. It defines "good enough" reliability. SLOs are internal team goals.
Examples:
- Availability SLO: 99.9% of requests succeed each month.
- Latency SLO: 95% of requests complete within 200ms.
- Error rate SLO: Error rate stays below 0.1% over a rolling 30-day window.
Service Level Agreements (SLAs)
An SLA is a formal contract with customers that includes consequences (refunds, penalties) if the service falls below a defined level. SLOs are typically set tighter than SLAs to provide a safety margin.
Error Budgets
The error budget is the acceptable amount of unreliability permitted within the SLO period. If the SLO is 99.9% availability, the error budget is 0.1% — equivalent to 43.8 minutes of downtime per month.
Error budgets change how teams think about risk:
- Lots of budget remaining? Ship new features quickly. Risk is acceptable.
- Budget nearly exhausted? Slow down deployments. Focus on reliability work.
- Budget exceeded? Freeze new releases until reliability improves.
This creates a data-driven negotiation between developers (who want to ship) and SREs (who maintain reliability). Both teams agree to the same objective number.
Error Budget Calculation
SLO Target: 99.9% availability
Measurement Window: 30 days (43,200 minutes)
Error Budget = (1 - 0.999) × 43,200 = 43.2 minutes
If 20 minutes of downtime occurred this month:
Remaining Budget = 43.2 - 20 = 23.2 minutes
Budget Consumed = 46%
Status: Healthy — deployments can continue normally
If 50 minutes of downtime occurred this month:
Budget Consumed = 116% — SLO BREACHED
Status: Feature freeze until reliability is restoredToil Reduction
Toil is manual, repetitive, operational work that doesn't provide lasting value. Restarting a service manually every Tuesday is toil. Re-creating cloud resources from scratch because there's no IaC is toil.
The SRE principle: if toil consumes more than 50% of an SRE team's time, it must be reduced through automation. Common toil-elimination work:
- Automating repetitive deployment steps with scripts or pipelines.
- Replacing manual server restarts with auto-healing Kubernetes configurations.
- Automating alert response (runbooks as code with PagerDuty or Opsgenie).
- Replacing manual capacity planning with auto-scaling policies.
Incident Management
Incidents happen. SRE defines a structured process to handle them efficiently and learn from them.
Incident Severity Levels
| Severity | Definition | Example |
|---|---|---|
| SEV-1 (Critical) | Complete service outage, major data loss | Production database down |
| SEV-2 (High) | Key feature broken, significant user impact | Payment processing failing |
| SEV-3 (Medium) | Degraded performance, partial impact | Image uploads slow but working |
| SEV-4 (Low) | Minor issue, workaround exists | Non-critical report page is slow |
Incident Response Steps
- Detect: Alert fires (Prometheus/PagerDuty) or user reports an issue.
- Triage: Assess severity. Assign an incident commander.
- Mitigate: Restore service as quickly as possible — rollback, feature flag off, failover.
- Investigate: Find the root cause once the service is restored.
- Communicate: Keep stakeholders updated throughout. Post status page updates.
- Resolve: Confirm full recovery. Close the incident.
- Post-mortem: Write a blameless post-mortem to prevent recurrence.
Blameless Post-Mortems
A post-mortem is a document written after an incident that explains what happened, why it happened, and what will change to prevent it from happening again. The critical word is blameless — people must feel safe to report mistakes honestly without fear of punishment.
Post-Mortem Structure
- Incident Summary: What happened, when, and for how long.
- Timeline: Sequence of events from detection to resolution.
- Root Cause Analysis: The underlying reason, not just the surface symptom.
- Impact: How many users affected, business impact, SLO status.
- Action Items: Specific tasks to prevent recurrence, with owners and due dates.
- Lessons Learned: What the team learned about the system and the process.
On-Call Best Practices
- Every alert must be actionable — non-actionable alerts cause alert fatigue and get ignored.
- On-call schedules must be fair and sustainable — no one person is always on call.
- Runbooks (step-by-step response guides) must exist for every recurring alert type.
- On-call engineers track toil — recurring manual work gets a ticket to be automated.
- Reduce alert noise relentlessly — if an alert fires constantly without action, fix the root cause or remove the alert.
Capacity Planning and Performance Engineering
SREs plan for growth ahead of time. Capacity planning involves:
- Measuring current resource utilization trends.
- Projecting future demand based on business forecasts.
- Load testing systems to find breaking points before users do.
- Defining auto-scaling policies so systems grow automatically with demand.
Summary
- SRE applies software engineering to operations — automation, measurement, and scalable processes.
- SLIs measure what matters. SLOs define targets. SLAs are customer-facing contracts.
- Error budgets balance the speed of feature delivery with the need for reliability.
- Toil is manual work that must be reduced through automation — not accepted as normal.
- Blameless post-mortems turn incidents into learning opportunities without blame.
- On-call practices must be sustainable, actionable, and constantly improved.
