SRE Postmortems and Blameless Culture

A commercial airline crashes. The investigation does not end with "the pilot made a mistake." Investigators trace every contributing factor — the training program, the maintenance schedule, the air traffic control communication protocols — until they understand the full system failure. SRE borrows this same discipline. When something goes wrong, the question is not "who failed" but "how did the system allow this to happen."

What Is a Postmortem

A postmortem is a written analysis of an incident, completed after the incident is resolved. It documents what happened, how it was detected, what the team did to fix it, and — most importantly — what changes will prevent recurrence.

The postmortem is not a punishment document. It is a learning document. Its value is entirely in the quality of the actions it produces.

The Blameless Principle

Blameless postmortems work from one foundational assumption: people do not come to work intending to cause outages. When someone makes a mistake that leads to an incident, they made a reasonable decision with the information and tools available to them at the time. The failure was in the system — the missing safeguard, the inadequate alert, the confusing interface, the absent review process.

Why Blame Fails

Blame-based culture:
- Engineer makes mistake → gets blamed → learns to hide mistakes
- Future mistakes go unreported until they become major incidents
- No one shares what they did wrong → problems repeat

Blameless culture:
- Engineer makes mistake → team learns from it together
- Root causes are documented and fixed
- Engineers feel safe escalating early → smaller incidents

Blame optimizes for punishment. Blameless culture optimizes for learning and improvement.

Anatomy of a Good Postmortem

Section 1: Summary

A two-to-three sentence overview. What happened, when it happened, how long it lasted, and what the user impact was. Anyone in the company should be able to read this section and understand the incident in under a minute.

Section 2: Timeline

A chronological record of events from first symptom detection to full resolution. The Scribe's real-time notes during the incident form the foundation. A good timeline includes times to the minute.

Timeline Example — Database Connection Exhaustion Incident

14:00 UTC  Alert fires: "Checkout API error rate above 2%"
14:03 UTC  On-call engineer acknowledges. Begins investigation.
14:07 UTC  API gateway logs show "connection pool exhausted" errors.
14:12 UTC  Incident declared SEV-2. IC assigned.
14:18 UTC  Root cause identified: new code release from 13:45 leaked connections.
14:23 UTC  Decision made to roll back release.
14:31 UTC  Rollback complete. Error rate returns to baseline.
14:45 UTC  Incident resolved. Postmortem scheduled for next morning.

Section 3: Root Cause Analysis

Root cause analysis asks "why" repeatedly until the true underlying cause is found. A surface-level root cause ("the database ran out of connections") is not enough. A deep root cause explains the full chain of conditions that allowed the problem to occur.

The 5 Whys Technique

Problem: Users could not complete checkouts.

Why 1: The checkout API returned 503 errors.
Why 2: The checkout API ran out of database connections.
Why 3: A new code release leaked database connections.
Why 4: The code change did not close connections in error paths.
Why 5: The code review process did not include connection lifecycle checks.

Root Cause: No review step covers connection lifecycle management.
Fix: Add connection lifecycle to code review checklist AND to integration tests.

Section 4: Impact Assessment

Quantify the impact. How many users were affected? For how long? What percentage of transactions failed? What was the approximate business impact? Concrete numbers help prioritize the follow-up work.

Section 5: Action Items

The action items section is the most important part. Each action item must have an owner and a due date. Without these two fields, action items accumulate without being completed.

Action Items:
---------------------------------------------------------------------------
Action                                         Owner      Due Date   Status
---------------------------------------------------------------------------
Add connection lifecycle to code review guide  @alex_dev   Oct 15    Open
Add integration test for connection cleanup    @priya_qa   Oct 20    Open
Add connection pool saturation alert           @tom_sre    Oct 12    Open
Review other services for similar leak pattern @sam_sre   Oct 25    Open
---------------------------------------------------------------------------

When to Write a Postmortem

Not every tiny hiccup needs a full postmortem. Most teams write postmortems for:

  • Any SEV-1 or SEV-2 incident
  • Any incident that consumed significant error budget
  • Any incident that required a data fix or user communication
  • Any novel failure the team had not seen before

Postmortem Review Meeting

Writing the document is not the end. Teams hold a postmortem review meeting — usually within 48 hours of the incident — where the relevant engineers walk through the document together. This discussion often reveals additional contributing factors and improves the action items.

The meeting focuses on the system, not on the people. "The alert did not fire early enough" is acceptable. "Raj should have known better" is not.

Building a Postmortem Archive

Every completed postmortem goes into a shared, searchable archive. When a new incident occurs, engineers search the archive for similar past incidents. Patterns across multiple postmortems often reveal systemic problems that no single incident makes obvious. If three postmortems in six months all mention the same database component, that component needs architectural attention.

Key Points

  • A postmortem is a learning document, not a blame document.
  • Blameless culture produces more honest analysis and fewer repeated incidents.
  • The 5 Whys technique uncovers root causes instead of stopping at surface symptoms.
  • Action items with owners and due dates are the output that actually improves reliability.
  • A searchable postmortem archive reveals systemic patterns across multiple incidents.

Leave a Comment

Your email address will not be published. Required fields are marked *