Grafana Alert Notifications

An alert rule detects the problem. A contact point delivers the news. This topic covers how to configure contact points, build notification policies, and silence alerts during maintenance — the three skills that make your alerting system reliable rather than noisy.

The Emergency Dispatch Analogy

When a fire alarm sounds in a large building, the alarm panel does not call everyone — it calls the fire department, security desk, and building manager through separate channels based on severity. Grafana notification policies work the same way. Severity-critical alerts go to PagerDuty and wake up an on-call engineer. Severity-warning alerts send a Slack message to the team channel. Same alert system, routed intelligently.

Contact Points

A contact point tells Grafana where and how to deliver the notification. One contact point can include multiple notification integrations — for example, send to both Email and Slack simultaneously.

Creating a Contact Point

Go to Alerting → Contact points → Add contact point. Give it a name like Ops Team Slack. Choose the integration type and fill in the credentials.

Email Contact Point

Integration: Email
Addresses:   ops-team@yourcompany.com; manager@yourcompany.com
Subject:     [{{ .Status | toUpper }}] {{ .CommonAnnotations.summary }}

Grafana sends one email per alert group. The subject line dynamically includes the alert status (FIRING or RESOLVED) and the summary annotation.

Slack Contact Point

Integration:  Slack
Webhook URL:  https://hooks.slack.com/services/T.../B.../...
Channel:      #alerts-ops
Username:     Grafana

Create a Slack Incoming Webhook from the Slack API website and paste the URL. The alert message appears in the specified channel with colour coding — red for firing, green for resolved.

PagerDuty Contact Point

Integration:      PagerDuty
Integration Key:  (from PagerDuty service configuration)
Severity:         critical

PagerDuty handles on-call rotation and escalation policies. Grafana sends the alert to PagerDuty, which then pages the on-call engineer via SMS, phone call, or app push notification.

Microsoft Teams Contact Point

Integration:   Microsoft Teams
Webhook URL:   https://yourorg.webhook.office.com/...

Webhook Contact Point

The webhook integration sends a JSON POST request to any URL you specify. Use it to integrate Grafana with custom applications, ticketing systems, or automation tools.

POST https://your-system.com/grafana-alert
Content-Type: application/json

{
  "status": "firing",
  "labels": { "alertname": "High CPU", "instance": "server-01" },
  "annotations": { "summary": "CPU above 90%", "description": "..." },
  "startsAt": "2024-03-01T12:05:00Z"
}

Testing a Contact Point

After saving a contact point, click the Test button. Grafana sends a test notification immediately. Check that the message arrives at the destination before relying on the contact point in production.

Notification Policies

A notification policy is a routing tree. It matches incoming alerts to contact points based on label values. Every alert must match at least the default policy.

Default Policy

The default policy is the catch-all. Any alert that does not match a specific policy lands here. Configure the default policy to send to your general-purpose contact point — usually a team email or Slack channel.

Adding a Specific Policy

Go to Alerting → Notification policies → + New child policy. Set the matching labels and the contact point to use when those labels match.

Notification Policy Tree:

Default Policy → Contact: General Slack (#alerts-general)
│
├── Match: severity=critical
│   → Contact: PagerDuty (wakes on-call engineer)
│
├── Match: severity=warning
│   → Contact: Team Slack (#alerts-warning)
│
└── Match: team=database
    → Contact: DBA Email (database team)

Group By

The Group by setting in a policy groups multiple firing alerts into a single notification instead of sending one message per alert. Grouping by alertname and instance means all alerts with the same name and instance are bundled together. This prevents alert storms from generating hundreds of messages.

Group Wait, Group Interval, and Repeat Interval

Group wait:      30s   ← wait 30 seconds after first alert before sending,
                         to collect other alerts in the same group
Group interval:  5m    ← after the first notification, send updates every 5 minutes
                         if the group has new alerts
Repeat interval: 4h    ← resend the notification every 4 hours if the alert
                         is still firing and no changes occurred

Silences

A silence mutes alert notifications for a specific time window. Use silences during planned maintenance to prevent expected alerts from disturbing the on-call team.

Creating a Silence

Go to Alerting → Silences → Add silence.

Start:    2024-03-15 02:00
End:      2024-03-15 04:00
Comment:  Planned database maintenance window
Matchers: alertname=~".*", instance="db-server-01"

The matcher instance="db-server-01" silences all alerts with that label during the window. Other servers continue to send alerts normally.

Expiry

Silences expire automatically at the end time you set. There is no need to manually remove them after the maintenance is complete, though you can delete a silence early if needed.

Inhibition Rules

An inhibition rule suppresses less-important alerts when a more-critical alert is already firing. For example, when a "Server Down" alert fires, inhibit all other alerts for that server — there is no point receiving "High CPU" alerts if the server is completely offline.

Inhibition rule:
  Source alert:  alertname="ServerDown"  (the main alert)
  Target alerts: instance matches source  (suppress these)

Effect:
  ServerDown fires for server-01
    → CPU alert for server-01 is suppressed
    → Memory alert for server-01 is suppressed
    → Disk alert for server-01 is suppressed

Alert History

Go to Alerting → Alert history to see a timeline of all state changes for all alert rules. Each state transition — Normal → Pending → Firing → Normal — is recorded with a timestamp. Use this log to understand how long an incident lasted and when it was first detected.

Previous lessons

Back to courses

Next lessons