SRE Alerting and On-Call Management
At 3 AM, an engineer's phone buzzes. The alert says the payment service error rate is above 5 percent. They open their laptop, investigate, and fix the issue in 20 minutes before most users even notice. On-call is the backbone of production reliability — and bad alerting is the fastest way to break it.
What Is an Alert
An alert is an automatic notification sent when a monitored metric crosses a defined threshold. Alerts translate raw numbers into urgent action items.
Alert Anatomy
METRIC: Payment service error rate CONDITION: exceeds 1% for more than 5 consecutive minutes SEVERITY: Critical NOTIFY: On-call engineer via PagerDuty MESSAGE: "Payment errors at 3.2% — SLO at risk. Check Runbook #P-17."
The Biggest Problem With Alerting: Alert Fatigue
Alert fatigue happens when engineers receive so many alerts that they start ignoring them — including the ones that matter. It is one of the most dangerous problems in production operations.
How Alert Fatigue Develops
Week 1: 10 alerts per day. Engineer investigates each one carefully. Week 4: 50 alerts per day. Half are false alarms. Engineer starts skimming. Week 8: 100 alerts per day. Engineer acknowledges without investigating. Week 12: A real critical incident gets ignored. Major outage results.
The solution is strict alert hygiene: every alert must demand action, and every alert must be actionable.
Principles of Good Alerts
Principle 1: Every Alert Must Be Actionable
If an engineer receives an alert and cannot do anything specific in response, delete the alert. A notification that says "CPU is at 60%" with no guidance wastes attention. An alert that says "CPU sustained above 85% for 10 minutes — scale the worker pool" gives a clear next step.
Principle 2: Alert on Symptoms, Not Causes
Alert on what users experience, not on internal technical events.
BAD ALERT: "MySQL replica lag exceeded 30 seconds."
(Internal technical event — user may not be affected yet)
GOOD ALERT: "User-facing search requests returning results older than 2 minutes."
(Symptom the user experiences — definitely requires action)
Alert on the symptom first. If needed, add supporting alerts to help diagnose the cause once the symptom fires.
Principle 3: Use Multiple Severity Levels
| Severity | Meaning | Response Required |
|---|---|---|
| Critical / P1 | Service down or SLO at immediate risk | Wake someone up at 3 AM |
| High / P2 | Degraded performance, SLO at risk soon | Notify during business hours, same day |
| Medium / P3 | Minor degradation, trend worth watching | Review in next team standup |
| Low / P4 | Informational, no immediate action | Weekly review or ticket for backlog |
Only Critical alerts should wake people up at night. Everything else waits for business hours or a scheduled review.
On-Call Rotation Basics
An on-call rotation is a schedule where team members take turns being the primary responder for production incidents. One person or a small group is designated as on-call for a defined window — often one week at a time.
A Simple On-Call Structure
PRIMARY ON-CALL: First person paged. Responds within 5 minutes. SECONDARY ON-CALL: Paged if primary does not respond in 10 minutes. ESCALATION: Paged if secondary does not respond. Usually a manager or senior.
On-Call Expectations and Handoff
A healthy on-call rotation sets clear expectations and limits. Engineers should know exactly what they are responsible for, how long shifts last, and what support is available.
On-Call Shift Checklist
- Read the handoff notes from the previous on-call engineer.
- Check the current error budget status.
- Review any open incidents or known degradations.
- Confirm alert routing and escalation paths are correct.
- Write a handoff summary for the next engineer at shift end.
Runbooks
A runbook is a step-by-step guide for responding to a specific alert. It tells the on-call engineer exactly what to check, what commands to run, and what actions to take.
Example Runbook Structure
ALERT: High database connection pool exhaustion IMPACT: Users cannot complete transactions. Revenue impact. IMMEDIATE STEPS: 1. Check current connection count: SELECT count(*) FROM pg_stat_activity; 2. Kill long-running idle connections: SELECT pg_terminate_backend(pid) ... 3. Check for connection leaks in application logs (search for "connection not released") IF RESOLVED: Update this runbook with root cause. IF NOT RESOLVED: Escalate to database team lead. RELATED DASHBOARDS: db-connections-overview, app-pool-metrics
Runbooks get better over time. After each incident, the responding engineer updates the runbook with what they learned.
Tracking On-Call Burden
SRE teams track how many pages engineers receive per on-call shift. The Google SRE guideline suggests no more than two to three actionable alerts per 12-hour on-call period. More than that signals a systemic problem that needs engineering attention, not just faster human response.
On-Call Health Check: - Pages per shift: Target < 3 actionable pages / 12 hours - False positive rate: Target < 10% of total alerts - Incidents acknowledged: 100% within SLA response time - Unreviewed alerts (week over week): Should trend toward zero
Key Points
- Alert fatigue is the enemy of reliability — every alert must demand action.
- Alert on symptoms users experience, not internal technical events.
- Use severity levels to separate wake-at-3AM situations from next-day reviews.
- Runbooks turn unclear alerts into clear, actionable responses.
- Track on-call burden; excessive pages are a signal that automation work is overdue.
