Grafana Alerting Basics

Grafana Alerting watches your data continuously and sends notifications when something crosses a threshold. Instead of staring at dashboards all day, you define rules once and let Grafana do the watching. When CPU spikes to 95%, you get a message. When it drops back to normal, you get an all-clear.

The Smoke Alarm Analogy

A smoke alarm does not require you to watch the room for fire. It monitors conditions constantly and only speaks when something is wrong. Grafana alerts work the same way — they monitor metric values in the background and only contact you when action is needed.

Normal state:
  CPU: 45% → below threshold → silence

Alert fires:
  CPU: 96% → above threshold → Slack message: "CPU CRITICAL on server-01"

Recovery:
  CPU: 48% → below threshold → Slack message: "CPU RESOLVED on server-01"

Grafana Alerting Architecture

Grafana Alerting (introduced as Unified Alerting in Grafana 9) has four main components that work together.

┌─────────────────┐     ┌──────────────────┐     ┌───────────────────┐
│   Alert Rule    │────▶│  Alert Manager   │────▶│  Contact Point    │
│ (what to watch) │     │ (routes alerts)  │     │ (where to notify) │
└─────────────────┘     └──────────────────┘     └───────────────────┘
         │                       │
         ▼                       ▼
  ┌─────────────┐       ┌──────────────────┐
  │  Silence    │       │ Notification     │
  │  (mutes     │       │ Policy           │
  │   noise)    │       │ (routing rules)  │
  └─────────────┘       └──────────────────┘

Alert Rule

An alert rule defines what metric to evaluate, how often to check it, and what condition triggers the alert. For example: check CPU usage every 1 minute; fire if the value exceeds 90% for more than 5 minutes.

Contact Point

A contact point is where Grafana sends the notification — Email, Slack, PagerDuty, Microsoft Teams, OpsGenie, or a custom webhook. You configure contact points once and reference them from notification policies.

Notification Policy

The notification policy is a routing tree that decides which contact point receives which alert. You match alerts by label (such as severity=critical) and route them to the appropriate team.

Silence

A silence temporarily suppresses alerts matching specific labels. Use it during planned maintenance windows so on-call engineers are not flooded with expected alerts.

Alert States

Every alert rule moves through a series of states based on the current metric value.

Normal ──────▶ Pending ──────▶ Firing ──────▶ Normal
               (condition     (condition      (condition
               met but not    met for the     no longer
               long enough)   full "For"      met)
                              duration)

Normal

The metric value is within the acceptable range. No notification sent.

Pending

The condition is met, but the "For" duration has not elapsed yet. This prevents false alarms from momentary spikes. If the value recovers before the duration expires, the alert returns to Normal without firing.

Firing

The condition stayed true for the full "For" duration. Grafana sends a notification to the contact point.

No Data

The query returned no data. This usually means the data source is down or the metric stopped being collected. You can configure the alert to treat No Data as Normal, Pending, or Firing.

Error

The query failed to execute — typically a connection problem with the data source.

Creating an Alert Rule

Step 1 – Open Alerting

Click the bell icon in the left sidebar to open the Alerting section. Click Alert rulesNew alert rule.

Step 2 – Name the Rule

Give the alert a descriptive name such as High CPU Usage — server-01. Good names tell the on-call engineer exactly what fired without reading the full alert details.

Step 3 – Define the Query

Select the data source and write the query whose value you want to watch. For CPU monitoring with Prometheus:

100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Step 4 – Set the Condition

In the Expressions section, add a Threshold expression referencing Query A. Set the condition to IS ABOVE 90. This means the alert fires when CPU exceeds 90%.

Query A → value
Threshold: value IS ABOVE 90  → fires alert

Step 5 – Set the Evaluation Interval and For Duration

The Evaluate every field controls how often Grafana checks the condition — for example, every 1 minute. The For field sets how long the condition must stay true before firing — for example, 5 minutes. This prevents alerts from firing on brief spikes.

Evaluate every: 1m
For:            5m

→ CPU must stay above 90% for 5 consecutive minutes before an alert fires

Step 6 – Add Labels

Labels on alert rules work like labels on metrics — they identify the alert and control routing. Add labels such as severity=critical and team=ops. Notification policies use these labels to route the alert to the right contact point.

Step 7 – Add Annotations

Annotations add human-readable context to the alert notification. The Summary field holds a short description. The Description field holds detailed information and can reference metric values using template variables.

Summary:     CPU usage is critically high
Description: Instance {{ $labels.instance }} CPU is at {{ $values.A }}%
             Check top processes with: ssh {{ $labels.instance }} 'top -bn1'

Step 8 – Set the Folder and Evaluation Group

Organise alert rules into folders (like dashboard folders). The evaluation group controls how many rules are evaluated in the same batch. Rules in the same group share the same evaluation interval.

Step 9 – Save

Click Save rule and exit. The rule becomes active immediately and appears in the Alert Rules list with a status indicator.

Testing an Alert Rule

After saving, click View on the alert rule to open its detail page. You see a live preview of the query result and the current state. Click Test rule to force-evaluate the condition immediately — useful for verifying the rule logic without waiting for a real incident.

Leave a Comment

Your email address will not be published. Required fields are marked *