SRE Incident Response

A hospital's emergency room does not wait until a patient is critical to start a plan. They have a triage process, a clear chain of responsibility, and practiced procedures for every situation. Incident response in SRE works the same way — a structured process that turns a chaotic outage into a managed, documented recovery.

What Is an Incident

An incident is any event that degrades user experience or puts service reliability at risk. It does not have to be a total outage. A significant slowdown, elevated error rates, or data inconsistencies all qualify as incidents.

Incident Severity Levels

Severity	Definition	Example
SEV-1	Complete outage; all users affected	Login service is down globally
SEV-2	Significant degradation; many users affected	Checkout failing for 30% of users
SEV-3	Minor degradation; some users affected	Image thumbnails not loading in one region
SEV-4	Minimal impact; cosmetic or edge case	Incorrect currency symbol in one locale

The Incident Lifecycle

DETECT → DECLARE → RESPOND → MITIGATE → RESOLVE → REVIEW

Step 1: Detect

Detection happens when an alert fires, a user reports a problem, or an engineer notices an anomaly on a dashboard. Fast detection requires well-tuned alerts and active monitoring. Every minute of undetected impact is users having a bad experience.

Step 2: Declare

Someone must officially declare the incident. Declaring it opens an incident ticket or a dedicated communication channel (like a Slack incident channel), assigns an Incident Commander, and starts the response clock. Without a clear declaration, multiple people may investigate separately without coordinating.

Step 3: Respond

The team assembles in the incident channel. The Incident Commander coordinates — they do not fix the problem personally; they direct the people who do. Everyone knows their role. Communication stays in the incident channel so nothing is lost.

Step 4: Mitigate

Mitigation means stopping the bleeding. The immediate goal is not a permanent fix — it is reducing user impact as fast as possible. Common mitigation actions include:

Rolling back the most recent deployment
Rerouting traffic away from the affected region
Disabling a failing feature flag
Restarting a crashed service
Scaling up infrastructure to absorb load

Step 5: Resolve

Resolution means the system returns to full normal operation and the root cause is understood well enough to prevent recurrence. The incident is closed, the channel is archived, and a postmortem is scheduled.

Step 6: Review

The postmortem process (covered in the next topic) ensures the team learns from every incident. Resolution without review means the same problem recurs.

Roles During an Incident

Incident Commander (IC)

The IC owns the incident from declaration to resolution. They coordinate all responders, make decisions when there is disagreement, communicate status to stakeholders, and keep the response moving. The IC does not fix code — they manage the process.

Technical Lead

The Technical Lead drives the diagnosis and fix. They direct engineers to investigate specific areas and decide which mitigation action to try first.

Communications Lead

For major incidents, a Communications Lead sends status updates to customers, posts to the status page, and handles internal stakeholder communication. This frees the IC and Technical Lead to focus on the technical work.

Scribe

The Scribe records everything that happens in real time: what was tried, when, and what the result was. This timeline becomes the foundation for the postmortem.

INCIDENT COMMAND STRUCTURE
----------------------------
        [Incident Commander]
               |
    [Technical Lead] + [Comms Lead]
               |
    [Engineers A, B, C]   [Scribe]

Incident Communication

Clear, regular communication during an incident keeps stakeholders informed and prevents chaos. The standard pattern is:

Post an initial update within 10 minutes of declaration, even if nothing is known yet.
Post updates every 15 to 30 minutes with current status.
Never go silent — silence makes everyone assume the worst.

Example Status Update

[14:15 UTC] INCIDENT UPDATE — SEV-2 Checkout Errors
Status:   INVESTIGATING
Impact:   Approximately 25% of checkout attempts failing with error code 503.
Current:  Team investigating API gateway logs for root cause.
          No deployment in last 24 hours.
Next update: 14:30 UTC or when significant new information available.
IC: @sarah_sre

Mean Time Metrics

Teams track incident performance using a few standard measurements:

Metric	Full Name	What It Measures
MTTD	Mean Time to Detect	How quickly is the problem found?
MTTA	Mean Time to Acknowledge	How quickly does someone respond?
MTTM	Mean Time to Mitigate	How quickly is user impact reduced?
MTTR	Mean Time to Resolve	How quickly is the issue fully resolved?

Tracking these over time shows whether the team's incident response capability is improving.

Key Points

An incident is any event degrading user experience — not just total outages.
Clear roles (IC, Technical Lead, Comms Lead, Scribe) prevent coordination chaos.
Mitigation comes before resolution — stop the bleeding first, fix it properly second.
Regular status updates every 15 to 30 minutes keep everyone informed.
MTTD, MTTA, MTTM, and MTTR measure and improve response capability over time.

Previous lesson

Back to course

Next lesson