Grafana Prometheus Queries
Prometheus is the most common data source paired with Grafana. It uses its own query language called PromQL (Prometheus Query Language). PromQL lets you filter, aggregate, and calculate metrics with precision. This topic teaches PromQL from the basics up to the queries you will use every day in production dashboards.
What Prometheus Stores
Prometheus stores time-series data. Every metric is a stream of (timestamp, value) pairs, identified by a metric name and a set of labels. Labels are key-value pairs that describe the source of the metric.
Metric name: node_cpu_seconds_total
Labels: {instance="server-01", mode="idle", cpu="0"}
Sample: (2024-03-01 12:00:00, 99452.3)
Metric name: http_requests_total
Labels: {job="api", status="200", method="GET"}
Sample: (2024-03-01 12:00:00, 15823)
Metric Types
Counter
A counter only goes up. It resets to zero when the process restarts. Examples: total HTTP requests, total errors, total bytes sent. You almost always use the rate() function with counters to convert a growing total into a per-second rate.
Gauge
A gauge goes up and down freely. Examples: current CPU usage, current memory used, number of active connections. Query gauges directly — no rate() needed.
Histogram
A histogram records observations in buckets. Examples: request duration, response size. Use histogram_quantile() to calculate percentiles like p50, p95, and p99.
PromQL Basics
Instant Vector – Current Value
Type the metric name alone to get the current value of every time series matching that name.
node_memory_MemAvailable_bytes
Result: one number per server that Prometheus monitors.
Range Vector – A Window of Values
Add a time window in square brackets to get a range of values over that period.
node_cpu_seconds_total[5m]
Result: the last 5 minutes of CPU data. Range vectors feed into functions like rate().
Label Filtering
Use curly braces to filter by label values. Only time series matching all label conditions are returned.
node_cpu_seconds_total{mode="idle"}
node_cpu_seconds_total{mode!="idle"}
node_cpu_seconds_total{instance=~"server-0.*"} ← regex match
node_cpu_seconds_total{instance!~"server-99.*"} ← regex exclude
Essential PromQL Functions
rate() – Per-Second Rate from a Counter
The rate() function calculates how fast a counter is increasing per second, averaged over the given time window.
rate(http_requests_total[5m])
This returns the average per-second request rate over the last 5 minutes. Use at least a 4x multiple of your scrape interval in the time window — if Prometheus scrapes every 15 seconds, use at least [1m].
irate() – Instant Rate
The irate() function calculates rate from only the last two data points. It reacts faster to spikes but is noisier than rate().
irate(http_requests_total[5m])
increase() – Total Increase Over a Window
The increase() function returns the total amount a counter grew over the time window.
increase(http_requests_total[1h])
This returns how many HTTP requests occurred in the last hour — useful for hourly totals on summary panels.
sum() – Aggregate All Series
The sum() function adds all time series matching a query into one total.
sum(rate(http_requests_total[5m]))
Without sum(), you get one line per server. With sum(), you get a single line showing the total across all servers.
sum by() – Aggregate and Group
The by() modifier groups the result by specific labels instead of collapsing everything into one line.
sum by(status) (rate(http_requests_total[5m]))
Result: one line per HTTP status code (200, 404, 500) showing the request rate for each status.
Without sum(): server-01 → line server-02 → line server-03 → line With sum(): Total → single line With sum by(status): 200 → line 404 → line 500 → line
avg() – Average Across Series
avg by(instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
Returns the average CPU usage per server across all CPU cores on that server.
max() and min()
max by(instance) (node_memory_MemAvailable_bytes) min by(instance) (node_memory_MemAvailable_bytes)
histogram_quantile() – Percentiles
histogram_quantile(0.95, sum by(le) (rate(http_request_duration_seconds_bucket[5m])))
This returns the 95th percentile (p95) of HTTP request durations. It means 95% of requests completed faster than this value. It is the most important latency metric in web services.
Common Production Queries
CPU Usage Percentage
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Memory Usage Percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
Disk Usage Percentage
(1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100
HTTP Error Rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100
Network Traffic (bytes per second, inbound)
rate(node_network_receive_bytes_total{device!="lo"}[5m])
Arithmetic Between Metrics
You can do math between two metrics using standard operators: +, -, *, /. Both metrics must have matching labels for the operation to work correctly.
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
This calculates used memory by subtracting available memory from total memory.
Recording Rules – Pre-computing Expensive Queries
Some PromQL queries are expensive to calculate on every dashboard load. Recording rules let Prometheus pre-compute the result and store it as a new metric. Grafana then queries the pre-computed metric, which loads instantly.
Rule definition (in Prometheus config):
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by(job)
Grafana query using the recording rule:
job:http_requests:rate5m
