Kafka Consumer Groups and Load Balancing

Consumer groups are the mechanism that makes Kafka scale on the consumption side. Without consumer groups, a single consumer reading a high-volume topic becomes the bottleneck — it cannot keep up with the rate of incoming messages. Consumer groups solve this by letting many consumers share the work, each reading a subset of the topic's partitions in parallel. They also allow multiple completely independent applications to read the same topic, each getting their own copy of every message.

What Is a Consumer Group

A consumer group is a set of consumer instances that share a group.id configuration value and collectively read a topic. Kafka guarantees that each partition in a topic is read by exactly one consumer within the group at any given time. No two consumers in the same group read the same partition simultaneously — this prevents duplicate processing.

Think of a conveyor belt in a factory. The belt (the topic) carries products (messages). One worker (consumer) cannot inspect all products fast enough. You assign multiple workers (consumer group) to the belt. Each worker takes responsibility for a section of the belt (partition). Every product gets inspected exactly once, and the total inspection rate scales with the number of workers.

CONSUMER GROUP: "order-processors" reading topic "orders" (6 partitions)

Topic "orders":
  P0  P1  P2  P3  P4  P5

Consumer A → reads P0, P1
Consumer B → reads P2, P3
Consumer C → reads P4, P5

Rules:
  ✓ Each partition read by exactly ONE consumer
  ✓ Each consumer can read multiple partitions
  ✓ All consumers in group share the workload
  ✗ Two consumers CANNOT share the same partition

Multiple Consumer Groups on the Same Topic

The most powerful aspect of consumer groups is that multiple groups can all read the same topic completely independently. Each group maintains its own committed offsets. Reading by group A does not affect group B's position. Both groups receive every message the topic contains.

TOPIC: orders (4 partitions)
Message flow: →→→→→→→→→→→→→→→→→→→→

GROUP 1: "warehouse-team"
  Consumer W1 → P0, P1 → processing for shipping
  Consumer W2 → P2, P3 → processing for shipping
  Committed offset: P0=142, P1=89, P2=201, P3=55

GROUP 2: "email-service"
  Consumer E1 → P0, P1, P2, P3 (single consumer handles all)
  Committed offset: P0=141, P1=87, P2=199, P3=55

GROUP 3: "fraud-detection"
  Consumer F1 → P0, P1
  Consumer F2 → P2, P3
  Committed offset: P0=138, P1=89, P2=198, P3=54

All three groups read every message. Each operates independently.
warehouse-team is slightly ahead of fraud-detection.
Neither affects the other's progress.

Partition-to-Consumer Ratio Rules

The number of partitions in a topic caps the maximum parallelism within a single consumer group. You cannot have more active consumers than partitions in a group — extra consumers sit idle.

PARTITION COUNT vs CONSUMER COUNT SCENARIOS:

Topic: events (4 partitions: P0, P1, P2, P3)

Scenario A: 2 consumers in group (under-provisioned)
  Consumer 1 → P0, P1, P2  (handles 3 partitions)
  Consumer 2 → P3           (handles 1 partition)
  Consumer 1 is doing 3x more work. Imbalanced.

Scenario B: 4 consumers in group (perfectly balanced)
  Consumer 1 → P0
  Consumer 2 → P1
  Consumer 3 → P2
  Consumer 4 → P3
  Each consumer handles exactly 1 partition. Perfect balance.

Scenario C: 6 consumers in group (over-provisioned)
  Consumer 1 → P0
  Consumer 2 → P1
  Consumer 3 → P2
  Consumer 4 → P3
  Consumer 5 → IDLE (no partition available)
  Consumer 6 → IDLE (no partition available)
  Extra consumers sit idle. Wasted resources, but ready as standby.

The ideal setup is: number of consumers = number of partitions. This gives perfect load distribution and uses every consumer's capacity. Having spare (idle) consumers is useful as standby — if an active consumer fails, an idle consumer immediately takes over its partitions without needing to launch a new instance.

Consumer Group Rebalancing

A rebalance happens when the membership of a consumer group changes. Membership changes occur when a consumer joins the group (new instance started), a consumer leaves the group (instance stopped or crashed), or a consumer is considered dead (no poll() call within max.poll.interval.ms).

The Rebalance Process

REBALANCE TRIGGERED: Consumer C crashes

Before failure:
  Consumer A → P0, P1
  Consumer B → P2, P3
  Consumer C → P4, P5  ← CRASHES

Group Coordinator detects C's heartbeat is missing.
Coordinator declares C dead. Triggers rebalance.

STOP-THE-WORLD REBALANCE (eager rebalance):
  Step 1: All consumers told to STOP and return all partitions
  Step 2: No consumer reads any message (brief pause)
  Step 3: Assignment algorithm runs with remaining consumers (A, B)
  Step 4: New assignment distributed:
    Consumer A → P0, P1, P4  (took 2 of C's partitions)
    Consumer B → P2, P3, P5  (took 1 of C's partitions)
  Step 5: Consumers resume reading from committed offsets

Processing paused during rebalance. Duration: typically 1-10 seconds.

Eager vs Cooperative Rebalancing

The traditional "stop the world" rebalance described above is called eager rebalancing. Every consumer stops reading during the rebalance, even if its partition assignment isn't changing. For large consumer groups on high-volume topics, this pause is costly.

Kafka 2.4+ introduced cooperative rebalancing (also called incremental cooperative rebalancing). In this mode, only the partitions that need to move between consumers are revoked. Consumers that keep their partitions never stop reading.

COOPERATIVE REBALANCE (Kafka 2.4+): Consumer C crashes

Before failure:
  Consumer A → P0, P1
  Consumer B → P2, P3
  Consumer C → P4, P5  ← CRASHES

Cooperative rebalance:
  Step 1: Coordinator computes new assignment
  Step 2: Only revoke P4 and P5 from dead C (A and B keep their partitions)
  Step 3: A and B continue reading P0-P3 WITHOUT PAUSE
  Step 4: Assign P4 to A, P5 to B
  Step 5: A and B now read additional partitions

Consumer A NEVER stopped reading P0, P1.
Consumer B NEVER stopped reading P2, P3.
Only the transfer of P4, P5 caused a brief micro-pause.

Enable cooperative rebalancing with: partition.assignment.strategy=CooperativeStickyAssignor

Partition Assignment Strategies

When a rebalance occurs and the group leader assigns partitions to consumers, it uses an assignment strategy. Kafka provides several built-in strategies.

RangeAssignor (default)

Assigns partitions as contiguous ranges per topic. For each topic, sorts partitions and consumers, then divides ranges. Works well with a single topic. Becomes unbalanced across multiple topics.

RANGE ASSIGNMENT: 2 topics (3 partitions each), 2 consumers

Topic A: P0, P1, P2
Topic B: P0, P1, P2

Consumer 1: Topic-A P0, P1 | Topic-B P0, P1  (4 partitions)
Consumer 2: Topic-A P2     | Topic-B P2       (2 partitions)

Imbalanced! Consumer 1 has twice the work.

RoundRobinAssignor

Treats all partitions across all subscribed topics as one pool and assigns them round-robin to consumers. More balanced than Range when multiple topics are involved.

ROUND-ROBIN ASSIGNMENT: 2 topics (3 partitions each), 2 consumers

All partitions: Topic-A-P0, Topic-A-P1, Topic-A-P2, Topic-B-P0, Topic-B-P1, Topic-B-P2

Consumer 1: Topic-A-P0, Topic-A-P2, Topic-B-P1  (3 partitions)
Consumer 2: Topic-A-P1, Topic-B-P0, Topic-B-P2  (3 partitions)

Balanced! Both consumers handle 3 partitions.

StickyAssignor

Like RoundRobin but tries to preserve existing assignments during rebalances. Minimizes partition movement — only moves partitions that need to move. Reduces the disruption of rebalances.

CooperativeStickyAssignor

The same as StickyAssignor but uses the cooperative (incremental) rebalance protocol. This is the recommended strategy for Kafka 2.4+ deployments. It combines balanced assignment, minimal partition movement, and non-stop-the-world rebalancing.

The Group Coordinator

One broker in the Kafka cluster acts as the group coordinator for each consumer group. The coordinator is responsible for managing group membership, triggering rebalances, and tracking committed offsets.

Which broker is the coordinator for a group is determined by hashing the group.id:

Coordinator partition = hash(group.id) % __consumer_offsets.partitions

The broker that is the leader of that __consumer_offsets partition
becomes the group coordinator.

Example:
  group.id = "analytics"
  hash("analytics") % 50 = 12
  Broker 3 is the leader of __consumer_offsets partition 12
  → Broker 3 is the group coordinator for "analytics" group

Heartbeats: Keeping the Group Alive

Each consumer sends periodic heartbeats to the group coordinator to prove it is alive. A dedicated heartbeat thread sends these independently of the poll loop. If the coordinator doesn't receive a heartbeat within session.timeout.ms (default 45 seconds), it declares the consumer dead and triggers a rebalance.

Key heartbeat settings:

heartbeat.interval.ms: How often to send heartbeats. Default 3 seconds. Should be at most 1/3 of session.timeout.ms.

session.timeout.ms: Maximum time without a heartbeat before the coordinator declares the consumer dead. Default 45 seconds.

max.poll.interval.ms: Maximum time between poll() calls. Default 5 minutes. Even if heartbeats are fine, failing to call poll() within this window marks the consumer as stuck and triggers a rebalance.

HEARTBEAT vs POLL TIMEOUT:

heartbeat.interval.ms = 3s:
  Consumer sends heartbeat every 3 seconds.
  Proves consumer process is alive.

session.timeout.ms = 45s:
  If no heartbeat for 45s → coordinator declares consumer DEAD → rebalance.
  Handles network issues, JVM pauses, garbage collection.

max.poll.interval.ms = 300s:
  If no poll() for 300s → coordinator marks consumer STUCK → rebalance.
  Handles processing taking too long (slow business logic, downstream delays).
  Even if heartbeats are healthy, a stuck consumer triggers rebalance.

Monitoring Consumer Groups

The kafka-consumer-groups.sh command is your primary tool for inspecting consumer group health.

# List all consumer groups:
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list

# Describe a consumer group (shows partitions, offsets, lag, consumer IDs):
bin/kafka-consumer-groups.sh \
  --bootstrap-server localhost:9092 \
  --describe \
  --group order-processors

Output:
GROUP           TOPIC   PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG  CONSUMER-ID
order-processors orders  0          1042            1042            0    consumer-1-abc
order-processors orders  1           889             901           12    consumer-1-abc
order-processors orders  2          1100            1100            0    consumer-2-def
order-processors orders  3           750             750            0    consumer-2-def

Partition 1 has lag of 12 → consumer-1 is slightly behind on P1.
All other partitions: lag = 0 → fully caught up.

Key Points

  • A consumer group shares a group.id. Kafka assigns each partition to exactly one consumer in the group. Multiple consumers share the workload of reading a topic in parallel.
  • Maximum parallelism within a consumer group equals the number of partitions. Extra consumers beyond partition count sit idle as standby.
  • Multiple independent consumer groups can all read the same topic, each maintaining their own offset position without affecting each other.
  • Rebalances occur when consumers join, leave, or are declared dead. Eager rebalances pause all consumers. Cooperative rebalances (Kafka 2.4+) only pause the partitions being moved.
  • CooperativeStickyAssignor is the recommended partition assignment strategy for Kafka 2.4+ — it provides balanced assignment with minimal rebalance disruption.
  • Heartbeats and the poll loop together keep consumers alive in the group. session.timeout.ms governs heartbeat timeout. max.poll.interval.ms governs poll timeout.

Leave a Comment