Kafka Exactly Once Semantics Delivering Messages

Exactly-once semantics (EOS) is the hardest delivery guarantee to achieve in any distributed system. Networks drop packets, brokers crash mid-write, and retries create duplicates. Kafka's exactly-once system combines idempotent producers, transactional writes, and consumer isolation levels to guarantee that each message is processed and its effects applied precisely one time — not zero, not two, but exactly one.

Why Exactly-Once Is Difficult

Consider a payment system: Consumer reads "charge user $100" from Kafka, calls the payment API, then crashes before committing its Kafka offset. On restart, the consumer re-reads the message and calls the payment API again. The user is charged $200. This is a duplicate that at-least-once delivery permits and that business logic cannot easily absorb.

The challenge is that "exactly once" must span three systems: reading from Kafka (consumer), processing (application), and writing results (to Kafka or another system). Each transition is a potential failure point. Kafka's EOS closes all three gaps when both the source and sink are Kafka topics.

The Two Components of Kafka EOS

Kafka's exactly-once guarantee requires two pieces working together: the idempotent producer (which prevents broker-side duplicates) and transactions (which make read-process-write atomic across partitions).

Idempotent Producer Review

The idempotent producer assigns each producer instance a Producer ID (PID) and tags each message with a monotonically increasing sequence number per partition. The broker uses these to deduplicate retries. If the producer retries a message that the broker already stored, the broker silently discards the duplicate and resends the ACK for the original.

The idempotent producer alone guarantees exactly-once to a single partition — but only for writes within a single producer session. It doesn't coordinate across multiple partitions or link the write to the prior read's offset commit.

Kafka Transactions

Transactions extend exactly-once to span multiple partitions and link the output writes to the input offset commit. A transaction groups a set of writes (to any number of topics/partitions) and an offset commit into a single atomic unit. Either everything commits or everything aborts — partial states are never visible to consumers.

TRANSACTION COMPONENTS:

1. Transactional Producer: sends writes + commits inside a transaction
2. Transaction Coordinator: a broker that manages the 2-phase commit protocol
3. Transaction Log: internal Kafka topic (__transaction_state) that records transaction state
4. Read-Committed Consumer: reads only committed transaction data

TRANSACTION FLOW:

[Consumer] reads from input-topic at offset 50
[Producer] beginTransaction()
[Producer] sends result to output-topic partition 0
[Producer] sends result to output-topic partition 1
[Producer] sendOffsetsToTransaction(committed_offset=51, group="my-group")
[Producer] commitTransaction()
              ↕ (2-phase commit via Transaction Coordinator)
All writes and offset become visible simultaneously.

If crash before commitTransaction():
  Transaction Coordinator marks transaction as ABORTED.
  Writes to output-topic are invisible to read_committed consumers.
  offset for my-group stays at 50 (not 51).
  Consumer re-reads from offset 50 on restart → idempotent re-processing.

Setting Up Exactly-Once in Kafka Streams

The easiest way to use Kafka EOS is through Kafka Streams, which handles the transaction lifecycle automatically. You simply set the processing guarantee to exactly-once.

KAFKA STREAMS EOS CONFIGURATION:

Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "my-streams-app");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG,
          StreamsConfig.EXACTLY_ONCE_V2);   ← enables EOS automatically

// Kafka Streams handles:
// - transactional producer setup
// - offset commit inside transactions
// - consumer read_committed isolation
// - transaction abort on failure
// - restart and replay from correct position

EXACTLY_ONCE_V2 vs EXACTLY_ONCE:
  EXACTLY_ONCE (EOS v1): one transaction per task per commit interval. More overhead.
  EXACTLY_ONCE_V2 (EOS v2, Kafka 2.5+): shares transactions across tasks per thread.
                                          Lower overhead. Recommended.

Setting Up Exactly-Once with Raw Producer/Consumer API

For non-Streams applications (custom read-process-write loops), you must manage transactions manually.

EXACTLY-ONCE WITH RAW API (pseudocode):

# Producer setup:
props.put("transactional.id", "payment-processor-1")  ← unique per app instance
props.put("enable.idempotence", "true")                ← required for transactions

producer = KafkaProducer(props)
producer.initTransactions()    ← registers with transaction coordinator

# Consumer setup:
props.put("isolation.level", "read_committed")  ← skip uncommitted/aborted data
props.put("enable.auto.commit", "false")         ← manual control

consumer = KafkaConsumer(props)
consumer.subscribe(["input-topic"])

# Processing loop:
while True:
  records = consumer.poll(1000)
  if records.isEmpty(): continue
  
  try:
    producer.beginTransaction()
    
    for record in records:
      result = process(record)
      producer.send(ProducerRecord("output-topic", result))
    
    # Commit the consumer offset INSIDE the transaction:
    offsets = buildOffsetMap(records)
    producer.sendOffsetsToTransaction(offsets, consumer.groupMetadata())
    
    producer.commitTransaction()   ← atomic commit of all writes + offsets
    
  except Exception as e:
    producer.abortTransaction()    ← all writes rolled back, offsets unchanged
    # Consumer will re-read the same input records on next poll

The Transaction Coordinator and Two-Phase Commit

Behind the scenes, Kafka implements transactions using a two-phase commit protocol managed by a transaction coordinator broker. Every transactional write goes through this coordinator.

TWO-PHASE COMMIT INTERNALS:

Phase 1: Prepare
  Producer calls commitTransaction()
  Transaction Coordinator writes "PREPARE_COMMIT" to __transaction_state
  Coordinator sends commit markers to all involved partitions
  Each partition broker writes a transaction commit marker to its log

Phase 2: Commit
  If all partition markers written successfully:
    Coordinator writes "COMPLETE_COMMIT" to __transaction_state
    Transaction is committed
  If any partition marker fails:
    Coordinator writes "PREPARE_ABORT"
    Coordinator sends abort markers to all partitions
    Transaction is aborted

Crash Recovery:
  On coordinator restart: reads __transaction_state and resumes incomplete transactions
  On producer restart: initTransactions() contacts coordinator, coordinator rolls back incomplete transactions from previous session

Transactional ID and Fencing

The transactional.id is a string you configure on the producer that uniquely identifies this producer instance's role in the application. When a new producer with the same transactional.id starts, it receives a new epoch (generation number). The old producer instance (if still running — a zombie) gets fenced — its writes are rejected by the broker with an InvalidProducerEpoch error.

This fencing mechanism prevents zombie producers — stale instances that came back from a network partition and are trying to write outdated data — from corrupting the transaction log.

ZOMBIE FENCING:

Scenario: Producer instance A crashes. Instance B starts with same transactional.id.

Broker:  Epoch for "payment-processor-1" was 5 (Instance A's epoch).
         Instance B connects → broker assigns epoch 6.

Instance A recovers from network partition (didn't actually crash):
  Instance A tries to write with epoch 5.
  Broker rejects: "InvalidProducerEpoch. Current epoch is 6."
  Instance A is fenced → cannot write.
  
Only Instance B (epoch 6) can write. Data integrity maintained.

RULE: One transactional.id = one active producer at a time.

Consumer Isolation Levels

Exactly-once on the consumer side requires using isolation.level=read_committed. Without this, consumers can read messages from in-progress or aborted transactions, breaking the exactly-once guarantee.

ISOLATION LEVELS:

read_uncommitted (default):
  Consumer reads ALL messages, including:
  - Messages from committed transactions ✓
  - Messages from aborted transactions (should be invisible!) ✗
  - Messages from in-progress transactions (incomplete!) ✗
  Risk: Consumer processes data that gets rolled back. Inconsistency.

read_committed:
  Consumer reads ONLY:
  - Messages from committed transactions ✓
  - Non-transactional messages (produced without transactions) ✓
  Consumer SKIPS:
  - Messages from aborted transactions ✓
  - Messages from in-progress transactions (waits for commit/abort) ✓
  
  Trade-off: Consumer must wait for transactions to complete before
             reading those messages. Small added latency per transaction.

Exactly-Once in Practice: Performance Considerations

Exactly-once semantics come with overhead. Transactions add coordination steps (two-phase commit), and read_committed consumers have slightly higher read latency. The actual overhead varies, but benchmarks typically show a 20-40% throughput reduction compared to at-least-once. Plan your capacity accordingly if you require EOS.

EOS PERFORMANCE TRADE-OFFS:

Higher safety → lower raw throughput (for same hardware):

at-most-once (acks=0):            ~maximum throughput, highest risk
at-least-once (acks=1, retries):  ~90% of maximum, low risk
exactly-once (transactions):      ~60-80% of maximum, no duplicates/loss

For financial systems: throughput reduction is acceptable cost of correctness.
For high-volume analytics: at-least-once + idempotent consumers often sufficient.

Key Points

Exactly-once semantics guarantees each message is processed and its effect applied precisely one time — no loss, no duplicates.
EOS requires two components: idempotent producers (prevent broker-side duplicates) and transactions (atomic read-process-write across partitions).
The transactional.id uniquely identifies a producer's role. New instances fence stale zombie producers using epoch numbers.
sendOffsetsToTransaction() commits the consumer group offset inside the transaction, making the read and the write atomic.
Consumers must use isolation.level=read_committed to skip aborted and in-progress transaction messages.
Kafka Streams simplifies EOS with a single configuration: StreamsConfig.EXACTLY_ONCE_V2. It handles all transaction lifecycle management automatically.
EOS adds 20-40% throughput overhead. Use it where data correctness is non-negotiable. Use at-least-once with idempotent consumers for high-volume scenarios where duplicates can be tolerated or deduplicated.

Previous lesson

Back to course

Next lesson