Kafka Monitoring and Observability

Running Apache Kafka without monitoring is like driving a car at night with no dashboard lights. You can move forward, but you won't know when the engine overheats, the fuel runs out, or the brakes start failing. Monitoring gives you full visibility into what Kafka is doing at every second — from how fast messages flow to how much disk space brokers consume.

Observability goes one level deeper. It answers not just "what is happening?" but "why is it happening?" You collect metrics, logs, and traces together so you can find the root cause of any problem without guessing.

Why Kafka Needs Its Own Monitoring Strategy

Kafka runs as a distributed system. Multiple brokers share the workload, producers push data from many sources simultaneously, and consumers pull data at their own pace. When something goes wrong, the problem could sit in the broker, the network, the producer application, or the consumer application. A solid monitoring strategy covers all four layers.

Think of Kafka like a busy postal sorting facility:

┌────────────────────────────────────────────────────────────────────┐
│              KAFKA POSTAL FACILITY                                 │
│                                                                    │
│  [Senders]──▶[Receiving Dock]──▶[Sorting Room]──▶[Delivery Vans]   │   
│  Producers    Broker Ingestion   Partition Storage  Consumers      │
│                                                                    │
│  Monitor:     Monitor:           Monitor:          Monitor:        │
│  Send speed   Network in/out     Disk usage        Lag behind      │
│  Errors       Queue depth        Replication       Processing      │
│  Retries      CPU/memory         health            rate            │
└────────────────────────────────────────────────────────────────────┘

Each station in that facility needs its own set of gauges and alarms. Ignoring any one station lets problems grow silently until they become outages.

The Three Pillars of Kafka Observability

Pillar 1 — Metrics

Metrics are numbers measured over time. They answer questions like "how many messages per second?" or "what percentage of disk is used?" Kafka exposes metrics through JMX (Java Management Extensions). Tools like Prometheus scrape these numbers every few seconds and store them. Grafana then draws charts from those stored numbers.

Kafka Broker
    │
    ├── JMX Endpoint (port 9999)
    │       │
    │       ▼
    │   Prometheus Exporter
    │       │
    │       ▼
    │   Prometheus Server ──▶ Alertmanager ──▶ PagerDuty / Slack
    │       │
    │       ▼
    │   Grafana Dashboard
    │

Pillar 2 — Logs

Logs are text records that Kafka writes when notable events happen. A broker writes a log entry when a leader partition changes, when a consumer group rebalances, or when a configuration error occurs. Centralizing logs from all brokers into one place (like Elasticsearch or Splunk) lets you search across them together instead of SSH-ing into individual machines.

Pillar 3 — Distributed Traces

A single business event — say, a customer placing an order — might pass through three producers, two Kafka topics, and four consumer services. A distributed trace follows that single event end-to-end and shows you exactly where time was spent or where an error occurred. Tools like Jaeger and Zipkin visualize these traces.

Critical Kafka Metrics to Watch

Broker-Level Metrics

The broker is the heart of Kafka. These metrics tell you whether the heart is beating correctly.

┌────────────────────────────────────────────────────────────┐
│                   BROKER HEALTH PANEL                      │
│                                                            │
│  Messages In/Sec  ████████████░░░░  12,450 msg/s           │
│  Bytes In/Sec     ███████████░░░░░  98 MB/s                │
│  Bytes Out/Sec    ████████░░░░░░░░  74 MB/s                │
│  Active Partitions████████████████  2,048 partitions       │
│  Under-Replicated ██░░░░░░░░░░░░░░  3 partitions ⚠️        │
│  Offline Partitions░░░░░░░░░░░░░░░  0 partitions ✅        │
│  Request Latency  ████░░░░░░░░░░░░  8ms avg                │
│  Network Threads  ████████████░░░░  12 active              │
└────────────────────────────────────────────────────────────┘

Under-Replicated Partitions deserves the most immediate attention. This metric shows how many partitions have fewer replicas than configured. A value above zero means data is at risk. If a broker fails right now and some partitions are under-replicated, those partitions lose data.

Offline Partitions is the fire alarm metric. Any value above zero means consumers cannot read from those partitions and producers cannot write to them. Treat this as a production incident immediately.

Producer Metrics

Producers push data into Kafka. Monitoring producer health helps you catch problems before they back up into the brokers.

Producer Application
│
├── record-send-rate       → How many records sent per second
├── record-error-rate      → How many sends failed per second
├── request-latency-avg    → Average time to get acknowledgment from broker
├── record-retry-rate      → How often the producer retried a failed send
├── batch-size-avg         → Average size of batches sent (bigger = more efficient)
└── buffer-available-bytes → Free space in the producer's send buffer

A rising record-error-rate paired with rising record-retry-rate tells you the broker is under stress or unreachable. The producer is working hard to compensate, which burns CPU and delays downstream consumers.

Consumer Metrics — Consumer Lag Is King

Consumer lag is the most important metric in any Kafka deployment. It measures how far behind a consumer group is compared to the latest messages in a topic.

Topic Partition Timeline:
─────────────────────────────────────────────────────▶  time

  Offset:   0   100  200  300  400  500  600  700  800
  Messages: ●───●────●────●────●────●────●────●────●

  Producer writes at offset 800 (latest)
  Consumer reads at offset 550 (current position)
  
  Consumer Lag = 800 - 550 = 250 messages behind

  ┌──────────────────────────────────────────────────┐
  │  LAG = 250   Status: Acceptable                  │
  │  LAG = 5000  Status: Warning — falling behind    │
  │  LAG = 50000 Status: Alert — serious problem     │
  └──────────────────────────────────────────────────┘

Lag by itself is not always bad. A batch-processing consumer might intentionally run every hour and accumulate lag between runs. What matters is whether the lag grows continuously. Growing lag means the consumer cannot keep up with the producer's speed — a gap that will only widen unless you add more consumer instances or optimize the consumer code.

Topic and Partition Metrics

Per-Topic View:
┌─────────────────┬──────────┬──────────┬──────────┬────────────┐
│ Topic           │Partitions│ In/Sec   │ Out/Sec  │ Retention  │
├─────────────────┼──────────┼──────────┼──────────┼────────────┤
│ orders          │ 24       │ 8,200    │ 16,400   │ 72h        │
│ payments        │ 12       │ 2,100    │ 6,300    │ 168h       │
│ user-events     │ 48       │ 45,000   │ 90,000   │ 24h        │
│ notifications   │ 6        │ 500      │ 1,000    │ 48h        │
└─────────────────┴──────────┴──────────┴──────────┴────────────┘

Track per-partition leader distribution as well. Ideally, leadership spreads evenly across all brokers. If one broker holds 80% of the leaders, it handles 80% of all client requests — an imbalance that creates a bottleneck.

JVM and OS Metrics for Broker Health

Kafka runs on the Java Virtual Machine. The JVM needs its own monitoring because garbage collection pauses can freeze a broker for hundreds of milliseconds.

JVM Garbage Collection Impact on Kafka:

Normal Operation:
  Request: ──────────────────── 5ms ──── Response

GC Pause Happens:
  Request: ──────────────────────────── (waiting) ──────────────── 2,000ms ──── Response
                                         ↑
                              JVM stopped all threads to
                              clean up memory (GC pause)
                              Consumers see timeout errors

Watch these JVM metrics closely:

GC pause duration — Keep this below 200ms. Pauses above 500ms cause visible consumer timeouts.
GC frequency — Frequent short GCs often signal heap is too small.
Heap used vs heap max — When heap usage stays above 75% consistently, add more broker RAM.

At the operating system level, track disk I/O wait, network bandwidth utilization, and open file descriptor count. Kafka opens one file per log segment per partition. A broker managing 1,000 partitions with 5 segments each keeps 5,000 files open simultaneously — a number that can exceed default OS limits on poorly configured servers.

ZooKeeper and KRaft Mode Monitoring

Older Kafka clusters use ZooKeeper to coordinate broker elections and store cluster metadata. Newer clusters use KRaft mode (Kafka Raft), which eliminates ZooKeeper entirely.

ZooKeeper-based cluster (Kafka < 3.3 default):
┌────────────┐     ┌────────────┐     ┌────────────┐
│  Broker 1  │     │  Broker 2  │     │  Broker 3  │
└─────┬──────┘     └─────┬──────┘     └─────┬──────┘
      │                  │                  │
      └──────────────────┼──────────────────┘
                         │
                  ┌──────▼──────┐
                  │  ZooKeeper  │  ← Monitor: latency,
                  │  Ensemble   │    outstanding requests,
                  └─────────────┘    watch count

KRaft mode cluster (Kafka 3.3+):
┌────────────┐     ┌────────────┐     ┌────────────┐
│  Broker 1  │     │  Broker 2  │     │  Broker 3  │
│ (also acts │     │            │     │            │
│ as quorum  │     └────────────┘     └────────────┘
│  voter)    │
└────────────┘
No ZooKeeper needed — built-in Raft consensus
Monitor: quorum leader, election count, metadata log offset

For ZooKeeper clusters, watch the outstanding requests metric. This number shows how many requests ZooKeeper is processing. A rising count without a corresponding drop signals ZooKeeper is overwhelmed, which slows the entire Kafka cluster's metadata operations.

Setting Up a Prometheus and Grafana Stack

This stack is the industry standard for Kafka monitoring. Here is how the components connect:

Step 1: Add JMX Exporter to Kafka
──────────────────────────────────
Each Kafka broker runs a JMX Exporter agent that converts
JMX metrics into Prometheus format.

KAFKA_OPTS="-javaagent:/opt/jmx_exporter/jmx_prometheus_javaagent.jar
            =7071:/opt/jmx_exporter/kafka-broker.yml"

Step 2: Prometheus Scrapes Every 15 Seconds
────────────────────────────────────────────
scrape_configs:
  - job_name: 'kafka'
    static_configs:
      - targets:
          - 'broker1:7071'
          - 'broker2:7071'
          - 'broker3:7071'

Step 3: Grafana Connects to Prometheus
────────────────────────────────────────
Data Source → Prometheus URL → Import Dashboard ID 7589 (Kafka Overview)

Result: Live dashboards update every 15 seconds showing
all broker, producer, and consumer metrics in one place.

Alerting Rules — When to Wake Someone Up

Alerts should fire when human action is required. Too many alerts train people to ignore them. Too few alerts let problems grow unnoticed. Use this tiered approach:

SEVERITY TIERS FOR KAFKA ALERTS:

🔴 CRITICAL — Page on-call engineer immediately
   • Offline partitions > 0 (for more than 2 minutes)
   • Active controller count ≠ 1
   • Broker down (missing from cluster)
   • Consumer lag growing continuously for 30+ minutes

🟡 WARNING — Notify team, investigate within 2 hours  
   • Under-replicated partitions > 0 (sustained 5+ minutes)
   • Disk usage > 70% on any broker
   • Consumer lag > 100,000 messages
   • GC pause duration > 500ms
   • Producer error rate > 1%

🔵 INFO — Log for review next business day
   • Leader imbalance > 20%
   • Disk usage > 50%
   • Request latency p99 > 100ms

Consumer Group Monitoring in Practice

Kafka provides a built-in tool to inspect consumer group health without any external software:

Command:
kafka-consumer-groups.sh \
  --bootstrap-server broker1:9092 \
  --describe \
  --group order-processor

Output:
GROUP           TOPIC     PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG
order-processor orders    0          8,450,230       8,450,245       15
order-processor orders    1          9,102,100       9,102,895       795
order-processor orders    2          7,800,050       7,800,050       0
order-processor orders    3          8,950,100       8,953,200       3,100  ⚠️

Analysis:
• Partition 0: Lag=15  → Healthy, nearly caught up
• Partition 1: Lag=795 → Slightly behind but manageable
• Partition 2: Lag=0   → Perfect, consumer ahead of producer
• Partition 3: Lag=3100→ Problem: one consumer may be slow or stuck

Partition 3 having 3,100 lag while others have minimal lag points to an uneven workload or a slow consumer instance assigned to that partition. Restarting that consumer instance or rebalancing the group often resolves this.

Log Aggregation for Kafka

Kafka brokers write logs to /var/log/kafka/ by default. In a cluster of ten brokers, you have ten separate log directories. Ship all logs to a central system using Filebeat or Fluentd.

Broker 1 logs ──┐
Broker 2 logs ──┼──▶ Filebeat/Fluentd ──▶ Elasticsearch ──▶ Kibana
Broker 3 logs ──┘                                            (search UI)

Key log patterns to search for:
─────────────────────────────────────────────
"[KafkaServer] started"          → Broker started successfully
"ERROR" or "WARN"                → Investigate immediately
"Partition.*leader"              → Leader election happened
"ReplicaFetcherThread"           → Replication activity
"ConsumerGroupMetadata"          → Consumer group changes
"Throttling"                     → Client hitting rate limits

End-to-End Latency Measurement

Sometimes you need to know how long it takes for a message to travel from producer to consumer. This end-to-end latency is not exposed directly — you measure it by embedding a timestamp in the message itself.

Producer stamps the message:
  Message = { "event": "order_placed", "timestamp": 1718012400000, "data": {...} }
              ──────────────────────────────────────────────────────────────────
              timestamp = Unix milliseconds when producer created the message

Consumer measures the gap:
  Received timestamp = 1718012400000
  Current time       = 1718012400145
  
  End-to-end latency = 145 - 0 = 145 milliseconds

Track this as a histogram:
  p50 latency:  85ms  (half of messages arrive in 85ms or less)
  p95 latency: 210ms  (95% arrive in 210ms or less)
  p99 latency: 890ms  (99% arrive in 890ms or less)
  max latency: 4200ms (rare spike, often a GC pause)

Key Points — Kafka Monitoring Checklist

Monitor offline and under-replicated partitions first — these directly signal data availability problems.
Consumer lag is the most business-relevant metric — growing lag means downstream systems fall behind.
Use Prometheus + JMX Exporter + Grafana as the standard monitoring stack.
Aggregate logs from all brokers into one searchable system.
Alert only on actionable conditions, split into critical, warning, and informational tiers.
Measure end-to-end latency using producer-embedded timestamps for true pipeline visibility.
Watch JVM garbage collection — long GC pauses appear as Kafka timeouts to clients.

Previous lesson

Back to course

Next lesson