SRE Postmortems and Blameless Culture
A commercial airline crashes. The investigation does not end with "the pilot made a mistake." Investigators trace every contributing factor — the training program, the maintenance schedule, the air traffic control communication protocols — until they understand the full system failure. SRE borrows this same discipline. When something goes wrong, the question is not "who failed" but "how did the system allow this to happen."
What Is a Postmortem
A postmortem is a written analysis of an incident, completed after the incident is resolved. It documents what happened, how it was detected, what the team did to fix it, and — most importantly — what changes will prevent recurrence.
The postmortem is not a punishment document. It is a learning document. Its value is entirely in the quality of the actions it produces.
The Blameless Principle
Blameless postmortems work from one foundational assumption: people do not come to work intending to cause outages. When someone makes a mistake that leads to an incident, they made a reasonable decision with the information and tools available to them at the time. The failure was in the system — the missing safeguard, the inadequate alert, the confusing interface, the absent review process.
Why Blame Fails
Blame-based culture: - Engineer makes mistake → gets blamed → learns to hide mistakes - Future mistakes go unreported until they become major incidents - No one shares what they did wrong → problems repeat Blameless culture: - Engineer makes mistake → team learns from it together - Root causes are documented and fixed - Engineers feel safe escalating early → smaller incidents
Blame optimizes for punishment. Blameless culture optimizes for learning and improvement.
Anatomy of a Good Postmortem
Section 1: Summary
A two-to-three sentence overview. What happened, when it happened, how long it lasted, and what the user impact was. Anyone in the company should be able to read this section and understand the incident in under a minute.
Section 2: Timeline
A chronological record of events from first symptom detection to full resolution. The Scribe's real-time notes during the incident form the foundation. A good timeline includes times to the minute.
Timeline Example — Database Connection Exhaustion Incident 14:00 UTC Alert fires: "Checkout API error rate above 2%" 14:03 UTC On-call engineer acknowledges. Begins investigation. 14:07 UTC API gateway logs show "connection pool exhausted" errors. 14:12 UTC Incident declared SEV-2. IC assigned. 14:18 UTC Root cause identified: new code release from 13:45 leaked connections. 14:23 UTC Decision made to roll back release. 14:31 UTC Rollback complete. Error rate returns to baseline. 14:45 UTC Incident resolved. Postmortem scheduled for next morning.
Section 3: Root Cause Analysis
Root cause analysis asks "why" repeatedly until the true underlying cause is found. A surface-level root cause ("the database ran out of connections") is not enough. A deep root cause explains the full chain of conditions that allowed the problem to occur.
The 5 Whys Technique
Problem: Users could not complete checkouts. Why 1: The checkout API returned 503 errors. Why 2: The checkout API ran out of database connections. Why 3: A new code release leaked database connections. Why 4: The code change did not close connections in error paths. Why 5: The code review process did not include connection lifecycle checks. Root Cause: No review step covers connection lifecycle management. Fix: Add connection lifecycle to code review checklist AND to integration tests.
Section 4: Impact Assessment
Quantify the impact. How many users were affected? For how long? What percentage of transactions failed? What was the approximate business impact? Concrete numbers help prioritize the follow-up work.
Section 5: Action Items
The action items section is the most important part. Each action item must have an owner and a due date. Without these two fields, action items accumulate without being completed.
Action Items: --------------------------------------------------------------------------- Action Owner Due Date Status --------------------------------------------------------------------------- Add connection lifecycle to code review guide @alex_dev Oct 15 Open Add integration test for connection cleanup @priya_qa Oct 20 Open Add connection pool saturation alert @tom_sre Oct 12 Open Review other services for similar leak pattern @sam_sre Oct 25 Open ---------------------------------------------------------------------------
When to Write a Postmortem
Not every tiny hiccup needs a full postmortem. Most teams write postmortems for:
- Any SEV-1 or SEV-2 incident
- Any incident that consumed significant error budget
- Any incident that required a data fix or user communication
- Any novel failure the team had not seen before
Postmortem Review Meeting
Writing the document is not the end. Teams hold a postmortem review meeting — usually within 48 hours of the incident — where the relevant engineers walk through the document together. This discussion often reveals additional contributing factors and improves the action items.
The meeting focuses on the system, not on the people. "The alert did not fire early enough" is acceptable. "Raj should have known better" is not.
Building a Postmortem Archive
Every completed postmortem goes into a shared, searchable archive. When a new incident occurs, engineers search the archive for similar past incidents. Patterns across multiple postmortems often reveal systemic problems that no single incident makes obvious. If three postmortems in six months all mention the same database component, that component needs architectural attention.
Key Points
- A postmortem is a learning document, not a blame document.
- Blameless culture produces more honest analysis and fewer repeated incidents.
- The 5 Whys technique uncovers root causes instead of stopping at surface symptoms.
- Action items with owners and due dates are the output that actually improves reliability.
- A searchable postmortem archive reveals systemic patterns across multiple incidents.
