Grafana Alerting Basics
Grafana Alerting watches your data continuously and sends notifications when something crosses a threshold. Instead of staring at dashboards all day, you define rules once and let Grafana do the watching. When CPU spikes to 95%, you get a message. When it drops back to normal, you get an all-clear.
The Smoke Alarm Analogy
A smoke alarm does not require you to watch the room for fire. It monitors conditions constantly and only speaks when something is wrong. Grafana alerts work the same way — they monitor metric values in the background and only contact you when action is needed.
Normal state: CPU: 45% → below threshold → silence Alert fires: CPU: 96% → above threshold → Slack message: "CPU CRITICAL on server-01" Recovery: CPU: 48% → below threshold → Slack message: "CPU RESOLVED on server-01"
Grafana Alerting Architecture
Grafana Alerting (introduced as Unified Alerting in Grafana 9) has four main components that work together.
┌─────────────────┐ ┌──────────────────┐ ┌───────────────────┐
│ Alert Rule │────▶│ Alert Manager │────▶│ Contact Point │
│ (what to watch) │ │ (routes alerts) │ │ (where to notify) │
└─────────────────┘ └──────────────────┘ └───────────────────┘
│ │
▼ ▼
┌─────────────┐ ┌──────────────────┐
│ Silence │ │ Notification │
│ (mutes │ │ Policy │
│ noise) │ │ (routing rules) │
└─────────────┘ └──────────────────┘
Alert Rule
An alert rule defines what metric to evaluate, how often to check it, and what condition triggers the alert. For example: check CPU usage every 1 minute; fire if the value exceeds 90% for more than 5 minutes.
Contact Point
A contact point is where Grafana sends the notification — Email, Slack, PagerDuty, Microsoft Teams, OpsGenie, or a custom webhook. You configure contact points once and reference them from notification policies.
Notification Policy
The notification policy is a routing tree that decides which contact point receives which alert. You match alerts by label (such as severity=critical) and route them to the appropriate team.
Silence
A silence temporarily suppresses alerts matching specific labels. Use it during planned maintenance windows so on-call engineers are not flooded with expected alerts.
Alert States
Every alert rule moves through a series of states based on the current metric value.
Normal ──────▶ Pending ──────▶ Firing ──────▶ Normal
(condition (condition (condition
met but not met for the no longer
long enough) full "For" met)
duration)
Normal
The metric value is within the acceptable range. No notification sent.
Pending
The condition is met, but the "For" duration has not elapsed yet. This prevents false alarms from momentary spikes. If the value recovers before the duration expires, the alert returns to Normal without firing.
Firing
The condition stayed true for the full "For" duration. Grafana sends a notification to the contact point.
No Data
The query returned no data. This usually means the data source is down or the metric stopped being collected. You can configure the alert to treat No Data as Normal, Pending, or Firing.
Error
The query failed to execute — typically a connection problem with the data source.
Creating an Alert Rule
Step 1 – Open Alerting
Click the bell icon in the left sidebar to open the Alerting section. Click Alert rules → New alert rule.
Step 2 – Name the Rule
Give the alert a descriptive name such as High CPU Usage — server-01. Good names tell the on-call engineer exactly what fired without reading the full alert details.
Step 3 – Define the Query
Select the data source and write the query whose value you want to watch. For CPU monitoring with Prometheus:
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Step 4 – Set the Condition
In the Expressions section, add a Threshold expression referencing Query A. Set the condition to IS ABOVE 90. This means the alert fires when CPU exceeds 90%.
Query A → value Threshold: value IS ABOVE 90 → fires alert
Step 5 – Set the Evaluation Interval and For Duration
The Evaluate every field controls how often Grafana checks the condition — for example, every 1 minute. The For field sets how long the condition must stay true before firing — for example, 5 minutes. This prevents alerts from firing on brief spikes.
Evaluate every: 1m For: 5m → CPU must stay above 90% for 5 consecutive minutes before an alert fires
Step 6 – Add Labels
Labels on alert rules work like labels on metrics — they identify the alert and control routing. Add labels such as severity=critical and team=ops. Notification policies use these labels to route the alert to the right contact point.
Step 7 – Add Annotations
Annotations add human-readable context to the alert notification. The Summary field holds a short description. The Description field holds detailed information and can reference metric values using template variables.
Summary: CPU usage is critically high
Description: Instance {{ $labels.instance }} CPU is at {{ $values.A }}%
Check top processes with: ssh {{ $labels.instance }} 'top -bn1'
Step 8 – Set the Folder and Evaluation Group
Organise alert rules into folders (like dashboard folders). The evaluation group controls how many rules are evaluated in the same batch. Rules in the same group share the same evaluation interval.
Step 9 – Save
Click Save rule and exit. The rule becomes active immediately and appears in the Alert Rules list with a status indicator.
Testing an Alert Rule
After saving, click View on the alert rule to open its detail page. You see a live preview of the query result and the current state. Click Test rule to force-evaluate the condition immediately — useful for verifying the rule logic without waiting for a real incident.
