Leader and Follower Replicas in Apache Kafka

Every partition in Kafka has one leader replica and zero or more follower replicas. This leader-follower model is the engine of Kafka's fault tolerance and performance. Understanding exactly what each role does — and what happens when roles change — gives you the insight needed to diagnose replication problems, tune performance, and design resilient architectures.

The Leader Replica's Responsibilities

The leader replica is the active, authoritative copy of a partition. It handles all producer writes and, by default, all consumer reads. Every message enters Kafka through the leader. Every consumer reads from the leader. The leader is the single point of coordination for its partition.

Handling Producer Writes

When a producer sends a message to partition P, it sends it directly to the broker that is currently the leader for P. The leader appends the message to its local log, assigns it the next offset, and (depending on the acks setting) either immediately responds to the producer or waits for follower confirmation first.

Tracking Follower Progress

The leader maintains a high watermark — the highest offset that all in-sync replicas (ISR) have confirmed receiving. The high watermark is critical: consumers can only read messages up to the high watermark, not beyond it. This ensures consumers never read a message that might be lost (because it hasn't been replicated yet).

HIGH WATERMARK ILLUSTRATION:

Leader log:
  [msg@0][msg@1][msg@2][msg@3][msg@4][msg@5]
                               ↑
                         HIGH WATERMARK = 3
                  (all ISR replicas have confirmed up to offset 3)

Consumer read boundary:
  Consumer can read: msg@0, msg@1, msg@2, msg@3
  Consumer CANNOT read: msg@4, msg@5 (not yet confirmed by all ISR)

Why? If leader crashes NOW:
  msg@4 and msg@5 might only be on the leader.
  New leader (a follower) might not have them.
  Consumer reading msg@5 then seeing it disappear = confusing and broken.

High watermark protects consumers from reading uncommitted data.

The Follower Replica's Responsibilities

A follower replica's sole job is to stay in sync with the leader. It continuously fetches new messages from the leader, appends them to its local log, and reports back to the leader what offset it has synced to. Followers do not serve producer or consumer traffic under normal conditions — they exist purely as warm standbys.

How Followers Fetch From the Leader

Follower replication uses the same fetch API that consumers use. Each follower sends a FetchRequest to the leader: "Give me messages starting at my current offset." The leader responds with available messages. The follower appends them to its log, updates its position, and immediately sends another FetchRequest. This creates a continuous pull loop.

FOLLOWER REPLICATION LOOP:

Follower (Broker 2) replication thread:

Loop:
  Send FetchRequest to Leader (Broker 1):
    "Give me messages from partition P0 starting at my offset 450"
    
  Leader responds:
    [msg@450, msg@451, msg@452, msg@453]
    
  Follower appends to local log
  
  Send FetchRequest:
    "Give me messages starting at offset 454"
    
  Leader responds:
    [msg@454, msg@455]
    
  Follower appends. Reports fetch position to leader.
  
  Send FetchRequest:
    "Give me messages starting at offset 456"
    
  Leader responds: "Nothing new yet. Wait."
  [Long poll: waits up to replica.fetch.wait.max.ms for new messages]

Follower Configuration Settings

num.replica.fetchers: Number of threads per broker dedicated to replication fetching. Default 1. Increase for high-partition-count topics or high-throughput clusters where replication is a bottleneck.

replica.fetch.max.bytes: Maximum bytes fetched per partition per fetch request. Default 1 MB. Increase if partitions have large messages that exceed this limit.

replica.fetch.wait.max.ms: Maximum time a follower waits for the leader to have new messages before returning an empty response. Default 500ms.

Replica States: Online, Offline, and Lagging

Replicas cycle through states based on their connectivity and sync status with the leader.

Online and In-Sync

The normal state. The follower is connected to the leader, actively fetching, and within the lag threshold. It is a member of the ISR. The acks=all guarantee includes this replica.

Online but Lagging (Out of Sync)

The follower is connected but hasn't fetched from the leader within replica.lag.time.max.ms milliseconds. This typically happens when the follower is overwhelmed — too many fetch requests, slow disk I/O, or a full garbage collection pause on the JVM. Kafka removes it from the ISR but keeps it as an eligible candidate to rejoin once it catches up.

FOLLOWER FALLS BEHIND THE ISR:

t=0:   Leader offset = 1000. Follower at 998. ISR = {Leader, Follower}.
t=5s:  Leader offset = 1050. Follower still at 998 (fetch stalled).
t=10s: Leader offset = 1100. Follower at 998. Time since last fetch = 10s.
t=30s: Threshold: replica.lag.time.max.ms = 30000 reached.
       Kafka removes Follower from ISR. ISR = {Leader only}.
       
t=35s: Follower recovers. Fetches messages 998-1100 rapidly.
t=36s: Follower caught up to offset 1100. Reports sync to leader.
       Leader adds Follower back to ISR. ISR = {Leader, Follower}.

Offline

The follower's broker has crashed or is unreachable. It stays offline until the broker restarts, reconnects to the cluster, and its replication thread begins fetching again. Returning to ISR requires catching up with the leader first.

Follower Catchup After Recovery

When a follower comes back online after being down, it has missed messages produced while it was offline. The replication thread restarts and begins fetching from where the follower's log left off — catching up as fast as the network and disk allow. During this catchup phase, the follower is not in the ISR and does not count toward acks=all confirmation. Only after fully catching up does it rejoin the ISR.

FOLLOWER RECOVERY AND CATCHUP:

Leader is at offset 5000.
Follower was at offset 3500 when it crashed.
Follower restarts.

Catchup phase:
  t=0s:   Follower starts fetching from offset 3500
  t=1s:   Fetched 3500-3700 (200 messages in 1 second, fast catch-up)
  t=2s:   Fetched 3700-3900
  ...
  t=7.5s: Follower reaches offset 5000 (caught up to leader)
  
  Follower checks: Am I within replica.lag.time.max.ms? Yes.
  Leader adds follower back to ISR.
  
During catchup (t=0 to t=7.5s):
  acks=all writes only needed Leader confirmation (ISR = {Leader}).
  Once follower rejoins ISR, acks=all requires both.

Fetch from Follower: Kafka 2.4+ Feature

Historically, consumers always read from the partition leader regardless of which broker was closest geographically. This worked fine in single-datacenter setups but caused unnecessary cross-datacenter network costs in multi-region deployments.

Kafka 2.4 introduced follower fetching — consumers can read from the closest replica (leader or follower) rather than always going to the leader. The broker assigns consumers to the closest replica in the same rack or availability zone.

FOLLOWER FETCHING (Kafka 2.4+):

Multi-AZ deployment:
  AZ-1: Broker 1 (Leader for P0)
  AZ-2: Broker 2 (Follower for P0)
  AZ-3: Broker 3 (Follower for P0)

Consumer in AZ-2:
  WITHOUT follower fetching: reads from Broker 1 (AZ-1) → cross-AZ traffic cost
  WITH follower fetching:    reads from Broker 2 (AZ-2) → same-AZ, free/cheap

Producer always writes to Leader (Broker 1, AZ-1) - no change.
Consumer reads from closest replica in same AZ - reduced cost.

Configuration:
  Broker: replica.selector.class=RackAwareReplicaSelector
  Broker: broker.rack=az1 (or az2, az3)
  Consumer: client.rack=az2 (must match broker rack label for routing)

Replica Assignment and Rack Awareness

When Kafka assigns replicas to brokers, it defaults to spreading replicas across brokers in round-robin order. For multi-datacenter or multi-AZ deployments, this default can put two replicas in the same rack or availability zone — meaning a single AZ failure takes down two out of three replicas.

Rack-aware replication spreads replicas across different racks (or AZs) so that a single AZ failure never takes down more than one replica per partition.

RACK-AWARE REPLICA ASSIGNMENT:

Cluster: 6 brokers across 3 AZs (2 brokers per AZ)
  AZ-1: Broker 1, Broker 2
  AZ-2: Broker 3, Broker 4
  AZ-3: Broker 5, Broker 6

Topic P0, RF=3:

WITHOUT rack awareness:
  Leader: Broker 1 (AZ-1)
  Replica: Broker 2 (AZ-1) ← same AZ!
  Replica: Broker 3 (AZ-2)
  AZ-1 failure: loses leader + one replica. Only 1 of 3 copies remain!

WITH rack awareness (broker.rack configured):
  Leader: Broker 1 (AZ-1)
  Replica: Broker 3 (AZ-2)
  Replica: Broker 5 (AZ-3)
  AZ-1 failure: loses leader only. 2 of 3 copies survive. Election proceeds.

Monitoring Replica Health

KEY METRICS TO MONITOR:

1. Under-replicated partitions:
   Count of partitions where ISR < replication factor.
   Should always be 0 in a healthy cluster.
   Alert immediately when > 0.

   kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions

2. ISR shrink rate:
   Rate at which replicas are being removed from ISR.
   High rate = brokers struggling to keep up, network issues, or GC pauses.

3. ISR expand rate:
   Rate at which replicas rejoin ISR.
   Should balance with shrink rate. High expand + shrink = flapping replicas.

4. Leader count per broker:
   Should be roughly equal across brokers.
   Large imbalance = one broker doing much more work than others.

5. Replica lag (bytes behind leader):
   How many bytes behind the leader each follower is.
   Non-zero is normal briefly. Persistent lag = follower performance problem.

Key Points

  • The leader replica handles all producer writes and consumer reads. It maintains the high watermark — the highest offset confirmed by all ISR members.
  • Consumers can only read up to the high watermark. This prevents reading uncommitted data that might not survive a leader failure.
  • Follower replicas continuously fetch from the leader using the same fetch API as consumers. They exist as warm standbys for leader election.
  • ISR members are those within the replica.lag.time.max.ms threshold. Only ISR members are eligible for leader election and count toward acks=all.
  • After recovery, a follower catches up by fetching missed messages before rejoining the ISR.
  • Follower fetching (Kafka 2.4+) lets consumers read from the nearest replica, reducing cross-AZ network costs in multi-region deployments.
  • Rack-aware replication ensures replicas are spread across AZs, preventing a single AZ failure from taking down multiple replicas of the same partition.

Leave a Comment