SRE Incident Response
A hospital's emergency room does not wait until a patient is critical to start a plan. They have a triage process, a clear chain of responsibility, and practiced procedures for every situation. Incident response in SRE works the same way — a structured process that turns a chaotic outage into a managed, documented recovery.
What Is an Incident
An incident is any event that degrades user experience or puts service reliability at risk. It does not have to be a total outage. A significant slowdown, elevated error rates, or data inconsistencies all qualify as incidents.
Incident Severity Levels
| Severity | Definition | Example |
|---|---|---|
| SEV-1 | Complete outage; all users affected | Login service is down globally |
| SEV-2 | Significant degradation; many users affected | Checkout failing for 30% of users |
| SEV-3 | Minor degradation; some users affected | Image thumbnails not loading in one region |
| SEV-4 | Minimal impact; cosmetic or edge case | Incorrect currency symbol in one locale |
The Incident Lifecycle
DETECT → DECLARE → RESPOND → MITIGATE → RESOLVE → REVIEW
Step 1: Detect
Detection happens when an alert fires, a user reports a problem, or an engineer notices an anomaly on a dashboard. Fast detection requires well-tuned alerts and active monitoring. Every minute of undetected impact is users having a bad experience.
Step 2: Declare
Someone must officially declare the incident. Declaring it opens an incident ticket or a dedicated communication channel (like a Slack incident channel), assigns an Incident Commander, and starts the response clock. Without a clear declaration, multiple people may investigate separately without coordinating.
Step 3: Respond
The team assembles in the incident channel. The Incident Commander coordinates — they do not fix the problem personally; they direct the people who do. Everyone knows their role. Communication stays in the incident channel so nothing is lost.
Step 4: Mitigate
Mitigation means stopping the bleeding. The immediate goal is not a permanent fix — it is reducing user impact as fast as possible. Common mitigation actions include:
- Rolling back the most recent deployment
- Rerouting traffic away from the affected region
- Disabling a failing feature flag
- Restarting a crashed service
- Scaling up infrastructure to absorb load
Step 5: Resolve
Resolution means the system returns to full normal operation and the root cause is understood well enough to prevent recurrence. The incident is closed, the channel is archived, and a postmortem is scheduled.
Step 6: Review
The postmortem process (covered in the next topic) ensures the team learns from every incident. Resolution without review means the same problem recurs.
Roles During an Incident
Incident Commander (IC)
The IC owns the incident from declaration to resolution. They coordinate all responders, make decisions when there is disagreement, communicate status to stakeholders, and keep the response moving. The IC does not fix code — they manage the process.
Technical Lead
The Technical Lead drives the diagnosis and fix. They direct engineers to investigate specific areas and decide which mitigation action to try first.
Communications Lead
For major incidents, a Communications Lead sends status updates to customers, posts to the status page, and handles internal stakeholder communication. This frees the IC and Technical Lead to focus on the technical work.
Scribe
The Scribe records everything that happens in real time: what was tried, when, and what the result was. This timeline becomes the foundation for the postmortem.
INCIDENT COMMAND STRUCTURE
----------------------------
[Incident Commander]
|
[Technical Lead] + [Comms Lead]
|
[Engineers A, B, C] [Scribe]
Incident Communication
Clear, regular communication during an incident keeps stakeholders informed and prevents chaos. The standard pattern is:
- Post an initial update within 10 minutes of declaration, even if nothing is known yet.
- Post updates every 15 to 30 minutes with current status.
- Never go silent — silence makes everyone assume the worst.
Example Status Update
[14:15 UTC] INCIDENT UPDATE — SEV-2 Checkout Errors
Status: INVESTIGATING
Impact: Approximately 25% of checkout attempts failing with error code 503.
Current: Team investigating API gateway logs for root cause.
No deployment in last 24 hours.
Next update: 14:30 UTC or when significant new information available.
IC: @sarah_sre
Mean Time Metrics
Teams track incident performance using a few standard measurements:
| Metric | Full Name | What It Measures |
|---|---|---|
| MTTD | Mean Time to Detect | How quickly is the problem found? |
| MTTA | Mean Time to Acknowledge | How quickly does someone respond? |
| MTTM | Mean Time to Mitigate | How quickly is user impact reduced? |
| MTTR | Mean Time to Resolve | How quickly is the issue fully resolved? |
Tracking these over time shows whether the team's incident response capability is improving.
Key Points
- An incident is any event degrading user experience — not just total outages.
- Clear roles (IC, Technical Lead, Comms Lead, Scribe) prevent coordination chaos.
- Mitigation comes before resolution — stop the bleeding first, fix it properly second.
- Regular status updates every 15 to 30 minutes keep everyone informed.
- MTTD, MTTA, MTTM, and MTTR measure and improve response capability over time.
