Grafana Prometheus Queries

Prometheus is the most common data source paired with Grafana. It uses its own query language called PromQL (Prometheus Query Language). PromQL lets you filter, aggregate, and calculate metrics with precision. This topic teaches PromQL from the basics up to the queries you will use every day in production dashboards.

What Prometheus Stores

Prometheus stores time-series data. Every metric is a stream of (timestamp, value) pairs, identified by a metric name and a set of labels. Labels are key-value pairs that describe the source of the metric.

Metric name:  node_cpu_seconds_total
Labels:       {instance="server-01", mode="idle", cpu="0"}
Sample:       (2024-03-01 12:00:00, 99452.3)

Metric name:  http_requests_total
Labels:       {job="api", status="200", method="GET"}
Sample:       (2024-03-01 12:00:00, 15823)

Metric Types

Counter

A counter only goes up. It resets to zero when the process restarts. Examples: total HTTP requests, total errors, total bytes sent. You almost always use the rate() function with counters to convert a growing total into a per-second rate.

Gauge

A gauge goes up and down freely. Examples: current CPU usage, current memory used, number of active connections. Query gauges directly — no rate() needed.

Histogram

A histogram records observations in buckets. Examples: request duration, response size. Use histogram_quantile() to calculate percentiles like p50, p95, and p99.

PromQL Basics

Instant Vector – Current Value

Type the metric name alone to get the current value of every time series matching that name.

node_memory_MemAvailable_bytes

Result: one number per server that Prometheus monitors.

Range Vector – A Window of Values

Add a time window in square brackets to get a range of values over that period.

node_cpu_seconds_total[5m]

Result: the last 5 minutes of CPU data. Range vectors feed into functions like rate().

Label Filtering

Use curly braces to filter by label values. Only time series matching all label conditions are returned.

node_cpu_seconds_total{mode="idle"}
node_cpu_seconds_total{mode!="idle"}
node_cpu_seconds_total{instance=~"server-0.*"}   ← regex match
node_cpu_seconds_total{instance!~"server-99.*"}  ← regex exclude

Essential PromQL Functions

rate() – Per-Second Rate from a Counter

The rate() function calculates how fast a counter is increasing per second, averaged over the given time window.

rate(http_requests_total[5m])

This returns the average per-second request rate over the last 5 minutes. Use at least a 4x multiple of your scrape interval in the time window — if Prometheus scrapes every 15 seconds, use at least [1m].

irate() – Instant Rate

The irate() function calculates rate from only the last two data points. It reacts faster to spikes but is noisier than rate().

irate(http_requests_total[5m])

increase() – Total Increase Over a Window

The increase() function returns the total amount a counter grew over the time window.

increase(http_requests_total[1h])

This returns how many HTTP requests occurred in the last hour — useful for hourly totals on summary panels.

sum() – Aggregate All Series

The sum() function adds all time series matching a query into one total.

sum(rate(http_requests_total[5m]))

Without sum(), you get one line per server. With sum(), you get a single line showing the total across all servers.

sum by() – Aggregate and Group

The by() modifier groups the result by specific labels instead of collapsing everything into one line.

sum by(status) (rate(http_requests_total[5m]))

Result: one line per HTTP status code (200, 404, 500) showing the request rate for each status.

Without sum():
  server-01 → line
  server-02 → line
  server-03 → line

With sum():
  Total → single line

With sum by(status):
  200 → line
  404 → line
  500 → line

avg() – Average Across Series

avg by(instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))

Returns the average CPU usage per server across all CPU cores on that server.

max() and min()

max by(instance) (node_memory_MemAvailable_bytes)
min by(instance) (node_memory_MemAvailable_bytes)

histogram_quantile() – Percentiles

histogram_quantile(0.95, sum by(le) (rate(http_request_duration_seconds_bucket[5m])))

This returns the 95th percentile (p95) of HTTP request durations. It means 95% of requests completed faster than this value. It is the most important latency metric in web services.

Common Production Queries

CPU Usage Percentage

100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Memory Usage Percentage

(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

Disk Usage Percentage

(1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100

HTTP Error Rate

sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100

Network Traffic (bytes per second, inbound)

rate(node_network_receive_bytes_total{device!="lo"}[5m])

Arithmetic Between Metrics

You can do math between two metrics using standard operators: +, -, *, /. Both metrics must have matching labels for the operation to work correctly.

node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes

This calculates used memory by subtracting available memory from total memory.

Recording Rules – Pre-computing Expensive Queries

Some PromQL queries are expensive to calculate on every dashboard load. Recording rules let Prometheus pre-compute the result and store it as a new metric. Grafana then queries the pre-computed metric, which loads instantly.

Rule definition (in Prometheus config):
  - record: job:http_requests:rate5m
    expr: sum(rate(http_requests_total[5m])) by(job)

Grafana query using the recording rule:
  job:http_requests:rate5m

Leave a Comment

Your email address will not be published. Required fields are marked *