SRE Incident Response

A hospital's emergency room does not wait until a patient is critical to start a plan. They have a triage process, a clear chain of responsibility, and practiced procedures for every situation. Incident response in SRE works the same way — a structured process that turns a chaotic outage into a managed, documented recovery.

What Is an Incident

An incident is any event that degrades user experience or puts service reliability at risk. It does not have to be a total outage. A significant slowdown, elevated error rates, or data inconsistencies all qualify as incidents.

Incident Severity Levels

SeverityDefinitionExample
SEV-1Complete outage; all users affectedLogin service is down globally
SEV-2Significant degradation; many users affectedCheckout failing for 30% of users
SEV-3Minor degradation; some users affectedImage thumbnails not loading in one region
SEV-4Minimal impact; cosmetic or edge caseIncorrect currency symbol in one locale

The Incident Lifecycle

DETECT → DECLARE → RESPOND → MITIGATE → RESOLVE → REVIEW

Step 1: Detect

Detection happens when an alert fires, a user reports a problem, or an engineer notices an anomaly on a dashboard. Fast detection requires well-tuned alerts and active monitoring. Every minute of undetected impact is users having a bad experience.

Step 2: Declare

Someone must officially declare the incident. Declaring it opens an incident ticket or a dedicated communication channel (like a Slack incident channel), assigns an Incident Commander, and starts the response clock. Without a clear declaration, multiple people may investigate separately without coordinating.

Step 3: Respond

The team assembles in the incident channel. The Incident Commander coordinates — they do not fix the problem personally; they direct the people who do. Everyone knows their role. Communication stays in the incident channel so nothing is lost.

Step 4: Mitigate

Mitigation means stopping the bleeding. The immediate goal is not a permanent fix — it is reducing user impact as fast as possible. Common mitigation actions include:

  • Rolling back the most recent deployment
  • Rerouting traffic away from the affected region
  • Disabling a failing feature flag
  • Restarting a crashed service
  • Scaling up infrastructure to absorb load

Step 5: Resolve

Resolution means the system returns to full normal operation and the root cause is understood well enough to prevent recurrence. The incident is closed, the channel is archived, and a postmortem is scheduled.

Step 6: Review

The postmortem process (covered in the next topic) ensures the team learns from every incident. Resolution without review means the same problem recurs.

Roles During an Incident

Incident Commander (IC)

The IC owns the incident from declaration to resolution. They coordinate all responders, make decisions when there is disagreement, communicate status to stakeholders, and keep the response moving. The IC does not fix code — they manage the process.

Technical Lead

The Technical Lead drives the diagnosis and fix. They direct engineers to investigate specific areas and decide which mitigation action to try first.

Communications Lead

For major incidents, a Communications Lead sends status updates to customers, posts to the status page, and handles internal stakeholder communication. This frees the IC and Technical Lead to focus on the technical work.

Scribe

The Scribe records everything that happens in real time: what was tried, when, and what the result was. This timeline becomes the foundation for the postmortem.

INCIDENT COMMAND STRUCTURE
----------------------------
        [Incident Commander]
               |
    [Technical Lead] + [Comms Lead]
               |
    [Engineers A, B, C]   [Scribe]

Incident Communication

Clear, regular communication during an incident keeps stakeholders informed and prevents chaos. The standard pattern is:

  • Post an initial update within 10 minutes of declaration, even if nothing is known yet.
  • Post updates every 15 to 30 minutes with current status.
  • Never go silent — silence makes everyone assume the worst.

Example Status Update

[14:15 UTC] INCIDENT UPDATE — SEV-2 Checkout Errors
Status:   INVESTIGATING
Impact:   Approximately 25% of checkout attempts failing with error code 503.
Current:  Team investigating API gateway logs for root cause.
          No deployment in last 24 hours.
Next update: 14:30 UTC or when significant new information available.
IC: @sarah_sre

Mean Time Metrics

Teams track incident performance using a few standard measurements:

MetricFull NameWhat It Measures
MTTDMean Time to DetectHow quickly is the problem found?
MTTAMean Time to AcknowledgeHow quickly does someone respond?
MTTMMean Time to MitigateHow quickly is user impact reduced?
MTTRMean Time to ResolveHow quickly is the issue fully resolved?

Tracking these over time shows whether the team's incident response capability is improving.

Key Points

  • An incident is any event degrading user experience — not just total outages.
  • Clear roles (IC, Technical Lead, Comms Lead, Scribe) prevent coordination chaos.
  • Mitigation comes before resolution — stop the bleeding first, fix it properly second.
  • Regular status updates every 15 to 30 minutes keep everyone informed.
  • MTTD, MTTA, MTTM, and MTTR measure and improve response capability over time.

Leave a Comment

Your email address will not be published. Required fields are marked *