Replication in Apache Kafka

Data stored on a single machine is always at risk — disks fail, servers crash, data centers lose power. Replication is Kafka's answer to this risk. It stores multiple identical copies of each partition's data on different brokers so that even when a machine fails, your data survives and your system keeps running. Replication is the foundation of Kafka's fault tolerance and is what makes Kafka suitable for production workloads where data loss is unacceptable.

What Replication Means in Kafka

When you create a topic with a replication factor of 3, Kafka stores three copies of every message in that topic — one primary copy and two backup copies. Each copy lives on a different broker. The copies stay in sync as new messages arrive. If one broker dies, the two surviving brokers still have complete copies of all data, and the system elects a new leader automatically.

The Bookcase Analogy

Imagine you have one very important notebook that you cannot afford to lose. If you keep only one copy on your desk, a fire or flood destroys it permanently. If you make three copies — one on your desk, one in a drawer, one in a fireproof safe — you need all three locations to fail simultaneously to lose the data. Kafka's replication works exactly this way across multiple broker servers.

TOPIC: payments (3 partitions, replication factor = 3)

Cluster: Broker 1, Broker 2, Broker 3

Partition 0:
  Broker 1: [pay1][pay4][pay7]...  ← LEADER (primary copy)
  Broker 2: [pay1][pay4][pay7]...  ← FOLLOWER (replica 1)
  Broker 3: [pay1][pay4][pay7]...  ← FOLLOWER (replica 2)

Partition 1:
  Broker 2: [pay2][pay5][pay8]...  ← LEADER
  Broker 1: [pay2][pay5][pay8]...  ← FOLLOWER
  Broker 3: [pay2][pay5][pay8]...  ← FOLLOWER

Partition 2:
  Broker 3: [pay3][pay6][pay9]...  ← LEADER
  Broker 1: [pay3][pay6][pay9]...  ← FOLLOWER
  Broker 2: [pay3][pay6][pay9]...  ← FOLLOWER

Every message exists on all 3 brokers. Any 1 broker can fail safely.

Leaders and Followers

For each partition, one broker is designated the leader. All producer writes go to the leader. All consumer reads come from the leader (by default). The leader is the single source of truth for that partition's current state.

The other brokers holding copies of the partition are followers (also called replicas). Followers do not receive direct producer writes. Instead, they continuously fetch new messages from the leader — the same pull mechanism consumers use — and append them to their local copy of the partition. This is called replication.

REPLICATION FLOW:

PRODUCER → [Leader: Broker 1] → sends ACK to producer (based on acks setting)
                    ↕ (followers pull from leader)
           [Follower: Broker 2] fetches new messages from Broker 1
           [Follower: Broker 3] fetches new messages from Broker 1

Leader: "Here are messages at offsets 100, 101, 102."
Follower 2: writes to local log → "Fetched 100, 101, 102."
Follower 3: writes to local log → "Fetched 100, 101, 102."

The In-Sync Replica Set (ISR)

Not all replicas are equally caught up at all times. A replica that is keeping up with the leader — fetching messages within a defined lag threshold — is part of the In-Sync Replica set (ISR). A replica that falls too far behind (or becomes unreachable) is removed from the ISR.

The ISR is critical because it defines which replicas can safely be elected as the new leader if the current leader fails. Only ISR members are eligible for leader election by default. This prevents electing a stale replica that might be missing recent messages.

IN-SYNC REPLICA (ISR) TRACKING:

Leader: Broker 1 (currently at offset 500)
Broker 2: fetched up to offset 499 → ISR (only 1 behind, within threshold)
Broker 3: fetched up to offset 420 → ISR? Depends on lag threshold

Threshold: replica.lag.time.max.ms = 10000 (10 seconds)
  If Broker 3 hasn't fetched in 10+ seconds → removed from ISR
  If Broker 3 is 80 messages behind but fetched recently → stays in ISR

ISR = {Broker1, Broker2} if Broker3 lags too long.

Producer with acks=all: must wait for all ISR members to confirm.
If ISR = {B1, B2}: only need B1 and B2 to confirm (B3 excluded).

Replica Lag Configuration

replica.lag.time.max.ms: If a follower hasn't sent a fetch request to the leader within this time, it's removed from the ISR. Default 30 seconds. The follower rejoins the ISR once it catches back up and stays in sync.

Leader Election: What Happens When a Leader Fails

When a partition's leader broker fails, Kafka's controller broker detects the failure (via ZooKeeper or KRaft heartbeats) and initiates leader election. A new leader is selected from the partition's ISR — the broker in the ISR that is most caught up becomes the new leader.

LEADER ELECTION SCENARIO:

Before failure:
  Partition 0 Leader: Broker 1 (at offset 500)
  ISR: {Broker 1, Broker 2, Broker 3}
  Broker 2: synced to offset 499
  Broker 3: synced to offset 498

Broker 1 CRASHES:

Controller detects failure.
ISR without Broker 1: {Broker 2, Broker 3}
Controller elects Broker 2 as new leader (most caught up in ISR).

Broker 2 becomes leader:
  Starts accepting producer writes at offset 500.
  Broker 3 now fetches from Broker 2.

Messages at offset 499-500 that were only on Broker 1: LOST?
  offset 499 on Broker 2: YES (Broker 2 had it)
  offset 500 (unconfirmed write to Broker 1 only): LOST if acks=1
  offset 500: SAFE if acks=all (Broker 2 and 3 had confirmed it before leader crash)

Unclean Leader Election: The Availability vs Consistency Trade-off

By default, Kafka only elects leaders from the ISR (unclean.leader.election.enable=false). If all ISR members are down and only out-of-sync replicas remain, Kafka refuses to elect a new leader — the partition becomes unavailable. This protects data integrity: it is better to be temporarily unavailable than to serve stale or missing data.

Setting unclean.leader.election.enable=true allows Kafka to elect a leader from out-of-sync replicas when no ISR members are available. This restores availability but guarantees data loss — the newly elected leader is missing messages the old leader had that were never replicated.

UNCLEAN ELECTION TRADE-OFF:

Scenario: All ISR members crashed. Only stale Broker 4 (RF=4) available.
Broker 4 is at offset 450. Leader was at offset 500 before crash.
Offsets 451-500 exist only in crashed brokers.

unclean.leader.election.enable=false (default):
  Kafka refuses to elect Broker 4.
  Partition UNAVAILABLE until at least one ISR member recovers.
  ✓ No data loss   ✗ Downtime until recovery

unclean.leader.election.enable=true:
  Kafka elects Broker 4 as leader despite lag.
  Partition AVAILABLE immediately.
  Offsets 451-500 are PERMANENTLY LOST.
  ✗ Data loss      ✓ Immediate availability

For financial and audit data: keep false.
For metrics and analytics (some loss acceptable): can enable true.

Preferred Leader Election

When Kafka initially assigns partitions across brokers, it creates a balanced distribution — each broker leads roughly the same number of partitions. Over time, as brokers crash and recover, the leadership can drift — some brokers end up leading many more partitions than others (leader imbalance).

Kafka tracks which broker was the "preferred" (original) leader for each partition. The preferred leader is always the first replica in the replica list. Kafka can automatically rebalance leadership back to preferred leaders using auto.leader.rebalance.enable=true (default true). This runs in the background and restores the original balanced distribution after broker recoveries.

PREFERRED LEADER REBALANCING:

Original assignment:
  P0 preferred leader: Broker 1   P1: Broker 2   P2: Broker 3

After Broker 1 crashes and recovers:
  Broker 2 might still be leading P0 (from emergency election)
  P0 actual leader: Broker 2 (not preferred)

auto.leader.rebalance.enable=true:
  Background task detects imbalance.
  Triggers preferred leader election.
  Broker 1 takes back leadership of P0 (it is recovered and in ISR).

Result: Leadership balanced back to original distribution.

Replication Factor Recommendations

REPLICATION FACTOR GUIDE:

RF=1: No redundancy. Any broker failure = data loss. Never for production.
RF=2: One copy. Tolerates 1 broker failure. Risky (second failure = loss).
RF=3: Two copies. Tolerates 1 broker failure safely.
      Standard recommendation for most production workloads.
RF=5: Tolerates 2 simultaneous failures. For critical data, large clusters.

Required brokers: Must have at least as many brokers as RF.
RF=3 requires minimum 3 brokers.
RF=5 requires minimum 5 brokers.

Cost of higher RF: More disk usage (RF=3 uses 3x the raw data size).
Benefit: Higher durability and fault tolerance.

Observing Replication in Practice

# Describe topic to see replica and ISR status:
bin/kafka-topics.sh \
  --bootstrap-server localhost:9092 \
  --describe \
  --topic payments

Output:
Topic: payments  Partitions: 3  ReplicationFactor: 3
  Partition: 0  Leader: 1  Replicas: 1,2,3  Isr: 1,2,3
  Partition: 1  Leader: 2  Replicas: 2,3,1  Isr: 2,3,1
  Partition: 2  Leader: 3  Replicas: 3,1,2  Isr: 3,1,2

Healthy state: ISR matches Replicas for all partitions.

After Broker 2 crashes:
  Partition: 0  Leader: 1  Replicas: 1,2,3  Isr: 1,3   ← Broker 2 removed from ISR
  Partition: 1  Leader: 3  Replicas: 2,3,1  Isr: 3,1   ← Broker 3 now leads (Broker 2 was leader)

Monitor ISR size. If any partition has ISR smaller than RF: investigate immediately.

Key Points

  • Replication stores multiple copies of each partition across different brokers. Replication factor determines how many copies exist.
  • Each partition has one leader (handles all reads and writes) and multiple followers (replicate from the leader).
  • The In-Sync Replica (ISR) set contains replicas that are sufficiently caught up with the leader. Only ISR members are eligible for leader election.
  • When a leader fails, Kafka's controller elects a new leader from the ISR. This happens automatically with no operator intervention.
  • unclean.leader.election.enable=false (default) prevents data loss by refusing to elect stale replicas. Enable it only when availability matters more than consistency.
  • Replication factor 3 is the standard production recommendation. It tolerates 1 broker failure with no data loss.
  • Monitor ISR size. ISR smaller than replication factor means reduced fault tolerance — a second failure could cause data loss.

Leave a Comment