SRE Alerting and On-Call Management

At 3 AM, an engineer's phone buzzes. The alert says the payment service error rate is above 5 percent. They open their laptop, investigate, and fix the issue in 20 minutes before most users even notice. On-call is the backbone of production reliability — and bad alerting is the fastest way to break it.

What Is an Alert

An alert is an automatic notification sent when a monitored metric crosses a defined threshold. Alerts translate raw numbers into urgent action items.

Alert Anatomy

METRIC:     Payment service error rate
CONDITION:  exceeds 1% for more than 5 consecutive minutes
SEVERITY:   Critical
NOTIFY:     On-call engineer via PagerDuty
MESSAGE:    "Payment errors at 3.2% — SLO at risk. Check Runbook #P-17."

The Biggest Problem With Alerting: Alert Fatigue

Alert fatigue happens when engineers receive so many alerts that they start ignoring them — including the ones that matter. It is one of the most dangerous problems in production operations.

How Alert Fatigue Develops

Week 1:   10 alerts per day. Engineer investigates each one carefully.
Week 4:   50 alerts per day. Half are false alarms. Engineer starts skimming.
Week 8:   100 alerts per day. Engineer acknowledges without investigating.
Week 12:  A real critical incident gets ignored. Major outage results.

The solution is strict alert hygiene: every alert must demand action, and every alert must be actionable.

Principles of Good Alerts

Principle 1: Every Alert Must Be Actionable

If an engineer receives an alert and cannot do anything specific in response, delete the alert. A notification that says "CPU is at 60%" with no guidance wastes attention. An alert that says "CPU sustained above 85% for 10 minutes — scale the worker pool" gives a clear next step.

Principle 2: Alert on Symptoms, Not Causes

Alert on what users experience, not on internal technical events.

BAD ALERT:  "MySQL replica lag exceeded 30 seconds."
            (Internal technical event — user may not be affected yet)

GOOD ALERT: "User-facing search requests returning results older than 2 minutes."
            (Symptom the user experiences — definitely requires action)

Alert on the symptom first. If needed, add supporting alerts to help diagnose the cause once the symptom fires.

Principle 3: Use Multiple Severity Levels

SeverityMeaningResponse Required
Critical / P1Service down or SLO at immediate riskWake someone up at 3 AM
High / P2Degraded performance, SLO at risk soonNotify during business hours, same day
Medium / P3Minor degradation, trend worth watchingReview in next team standup
Low / P4Informational, no immediate actionWeekly review or ticket for backlog

Only Critical alerts should wake people up at night. Everything else waits for business hours or a scheduled review.

On-Call Rotation Basics

An on-call rotation is a schedule where team members take turns being the primary responder for production incidents. One person or a small group is designated as on-call for a defined window — often one week at a time.

A Simple On-Call Structure

PRIMARY ON-CALL:   First person paged. Responds within 5 minutes.
SECONDARY ON-CALL: Paged if primary does not respond in 10 minutes.
ESCALATION:        Paged if secondary does not respond. Usually a manager or senior.

On-Call Expectations and Handoff

A healthy on-call rotation sets clear expectations and limits. Engineers should know exactly what they are responsible for, how long shifts last, and what support is available.

On-Call Shift Checklist

  • Read the handoff notes from the previous on-call engineer.
  • Check the current error budget status.
  • Review any open incidents or known degradations.
  • Confirm alert routing and escalation paths are correct.
  • Write a handoff summary for the next engineer at shift end.

Runbooks

A runbook is a step-by-step guide for responding to a specific alert. It tells the on-call engineer exactly what to check, what commands to run, and what actions to take.

Example Runbook Structure

ALERT: High database connection pool exhaustion

IMPACT: Users cannot complete transactions. Revenue impact.

IMMEDIATE STEPS:
1. Check current connection count: SELECT count(*) FROM pg_stat_activity;
2. Kill long-running idle connections: SELECT pg_terminate_backend(pid) ...
3. Check for connection leaks in application logs (search for "connection not released")

IF RESOLVED: Update this runbook with root cause.
IF NOT RESOLVED: Escalate to database team lead.

RELATED DASHBOARDS: db-connections-overview, app-pool-metrics

Runbooks get better over time. After each incident, the responding engineer updates the runbook with what they learned.

Tracking On-Call Burden

SRE teams track how many pages engineers receive per on-call shift. The Google SRE guideline suggests no more than two to three actionable alerts per 12-hour on-call period. More than that signals a systemic problem that needs engineering attention, not just faster human response.

On-Call Health Check:
- Pages per shift:       Target < 3 actionable pages / 12 hours
- False positive rate:   Target < 10% of total alerts
- Incidents acknowledged: 100% within SLA response time
- Unreviewed alerts (week over week): Should trend toward zero

Key Points

  • Alert fatigue is the enemy of reliability — every alert must demand action.
  • Alert on symptoms users experience, not internal technical events.
  • Use severity levels to separate wake-at-3AM situations from next-day reviews.
  • Runbooks turn unclear alerts into clear, actionable responses.
  • Track on-call burden; excessive pages are a signal that automation work is overdue.

Leave a Comment

Your email address will not be published. Required fields are marked *