Redis Monitoring and Debugging
Running Redis in production without monitoring is like driving a car without a dashboard. You cannot see speed, fuel, or engine warnings until something breaks. Redis provides built-in commands and a set of metrics that tell you exactly what is happening inside the server at any moment.
The Control Room Analogy
A power plant has a control room full of gauges, lights, and alarms. Each gauge measures something different: temperature, load, voltage. Operators watch these gauges and act before a number goes out of range. Redis INFO, MONITOR, and SLOWLOG are your gauges, lights, and alarms.
Redis Health Monitoring Overview
┌──────────────────────────────────────────────────────────┐
│ Redis Server │
│ │
│ Memory ──▶ used_memory, maxmemory │
│ Clients ──▶ connected_clients, blocked_clients │
│ Performance ──▶ ops_per_sec, hit_rate │
│ Persistence ──▶ rdb_last_save, aof_enabled │
│ Replication ──▶ role, connected_slaves, lag │
│ Errors ──▶ rejected_connections, evicted_keys │
│ │
└──────────────────────────────────────────────────────────┘
│
▼
redis-cli INFO ← single command to see all of these
The INFO Command
INFO is the most important monitoring command. It returns statistics grouped into sections. You can request the full report or a specific section.
127.0.0.1:6379> INFO server ← Redis version, OS, config file path 127.0.0.1:6379> INFO memory ← memory usage details 127.0.0.1:6379> INFO clients ← connected client counts 127.0.0.1:6379> INFO stats ← commands processed, hits, misses 127.0.0.1:6379> INFO replication ← primary/replica status 127.0.0.1:6379> INFO persistence ← RDB/AOF state 127.0.0.1:6379> INFO all ← everything at once
Key Metrics to Watch in INFO memory
used_memory: 104857600 ← bytes Redis is using (100 MB) used_memory_human: 100.00M maxmemory: 268435456 ← limit set in config (256 MB) maxmemory_human: 256.00M mem_fragmentation_ratio: 1.2 ← ideal range: 1.0–1.5 If mem_fragmentation_ratio > 2.0: Redis is wasting memory. Run MEMORY PURGE or restart to defragment.
Key Metrics to Watch in INFO stats
keyspace_hits: 9820000 ← requests served from cache
keyspace_misses: 180000 ← requests not found in cache
Cache hit rate = hits / (hits + misses) × 100
= 9820000 / (9820000 + 180000) × 100
= 98.2%
A healthy cache hit rate for most apps: above 90%.
If hit rate drops, your cache keys may be expiring too fast,
or new keys are displacing useful ones via eviction.
MONITOR – Watch Live Commands
MONITOR streams every command processed by Redis in real time to your terminal. Use it briefly during debugging. Never leave it running in production — it doubles the work Redis does and slows the server.
127.0.0.1:6379> MONITOR 1710000001.234 [0 127.0.0.1:52100] "SET" "user:1001" "Alice" 1710000001.235 [0 127.0.0.1:52100] "GET" "session:abc" 1710000001.240 [0 127.0.0.1:52101] "INCR" "pageviews" Each line shows: timestamp [db source_ip:port] command arguments
SLOWLOG – Find Slow Commands
The slow log records every command that takes longer than a configurable threshold. Commands appearing here are candidates for optimization.
Configure the threshold in redis.conf:
slowlog-log-slower-than 10000 ← log commands taking > 10ms (10000 microseconds)
slowlog-max-len 128 ← keep the last 128 slow commands
View the slow log:
SLOWLOG GET 10 ← show the 10 most recent slow commands
Output:
1) 1) (integer) 42 ← log entry ID
2) (integer) 1710000050 ← Unix timestamp when it ran
3) (integer) 18234 ← duration in microseconds (18ms!)
4) 1) "KEYS" ← the command that was slow
2) "*"
5) "127.0.0.1:52201"
6) ""
→ KEYS * ran in 18ms on a large keyspace. Replace it with SCAN.
Clear the slow log:
SLOWLOG RESET
SCAN – Safe Key Iteration (Replace KEYS)
KEYS * blocks Redis until it scans every key. On a dataset with millions of keys, this freezes the server for seconds. SCAN returns a cursor and a small batch of keys per call without blocking.
Iterating all keys with SCAN (safe):
cursor = 0
loop:
result = SCAN cursor COUNT 100
cursor = result[0] ← new cursor for next call
keys = result[1] ← batch of key names
process keys...
if cursor == "0": stop ← full scan completed
Example in redis-cli:
SCAN 0 COUNT 100
1) "128" ← next cursor
2) 1) "user:1001"
2) "session:abc"
3) "product:55"
...
SCAN 128 COUNT 100
1) "0" ← cursor 0 = scan complete
2) 1) "leaderboard"
...
CLIENT LIST – See All Connected Clients
127.0.0.1:6379> CLIENT LIST id=5 addr=127.0.0.1:52100 fd=9 name= age=10 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 obl=0 oll=0 omem=0 cmd=client Fields to watch: idle → seconds since last command (high idle = stale connection) sub → subscribed channels (0 for normal clients) multi → inside a MULTI block (-1 = not in transaction) cmd → last command this client ran
DEBUG JMAP and MEMORY USAGE
Check memory used by a single key: MEMORY USAGE user:1001 → (integer) 128 ← 128 bytes including Redis overhead Find all large keys with SCAN + MEMORY USAGE in a loop, or use the redis-cli built-in tool: redis-cli --memkeys ← scans and reports top memory-consuming keys
Common Production Alerts to Set Up
┌──────────────────────────────────────┬─────────────────────────┐ │ Metric │ Alert Threshold │ ├──────────────────────────────────────┼─────────────────────────┤ │ used_memory > 80% of maxmemory │ Warning at 80% │ │ Cache hit rate < 90% │ Investigate │ │ connected_clients > normal peak │ Connection leak? │ │ rejected_connections > 0 │ maxclients reached │ │ replication lag > 10 seconds │ Replica falling behind │ │ aof_last_bgrewrite_status = err │ AOF rewrite failed │ └──────────────────────────────────────┴─────────────────────────┘
Key Points
- INFO gives a full health report. Use INFO memory, INFO stats, and INFO replication most often in production.
- Track your cache hit rate from INFO stats. A rate above 90% is healthy.
- MONITOR streams live commands — useful for short debugging sessions. Remove from production when done.
- SLOWLOG reveals commands that exceed your latency threshold. Replace KEYS * with SCAN to fix the most common slowdown.
- Use MEMORY USAGE to find keys consuming disproportionate memory.
- Set alerts on used_memory, hit rate, replication lag, and rejected connections before problems become outages.
