How Apache Kafka Works

Understanding Kafka's big picture before diving into details saves you from confusion later. Many learners jump straight into configuration and commands, then struggle to explain why Kafka behaves the way it does. This topic gives you the mental model that makes every other Kafka concept click into place.

Kafka works like a post office combined with a newspaper archive. Messages go in, get stored in an organized way, and get delivered to whoever asks for them — without the sender and receiver ever needing to meet.

The Three Roles in Every Kafka System

Every Kafka deployment involves three main roles. Understanding who plays each role and what they do is the foundation of everything else.

Role 1: The Producer

A producer is any application that creates events and sends them into Kafka. The producer decides what message to send, which topic to send it to, and when to send it. Once the message leaves the producer and enters Kafka, the producer's job is done.

In the real world, a producer might be a mobile app sending user activity events, a payment system sending transaction records, a temperature sensor sending readings every second, or a web server sending HTTP request logs.

Role 2: The Kafka Broker (The Server)

A broker is a Kafka server. It receives messages from producers, stores them on disk, and serves them to consumers when requested. A single broker can handle enormous amounts of data, but real Kafka deployments use multiple brokers working together as a cluster for reliability and scale.

Think of a broker as a filing cabinet. Messages arrive, get labeled with a topic and a position number, and get filed away in exact order. Nothing gets deleted immediately — the broker keeps everything for as long as the retention policy says.

Role 3: The Consumer

A consumer is any application that reads messages from Kafka topics. Consumers decide which topics to subscribe to and pull messages from brokers at their own pace. If a consumer falls behind — maybe it got restarted — it can catch up by replaying messages from where it left off.

The Big Picture Diagram

DATA SOURCES           KAFKA CLUSTER            DATA DESTINATIONS
─────────────────────────────────────────────────────────────────
[Mobile App]    ──┐                        ┌──→ [Analytics DB]
[Payment Sys]   ──┤                        ├──→ [Email Service]
[IoT Sensor]    ──┼──→ [BROKER 1] ─────────┼──→ [Dashboard]
[Web Server]    ──┤    [BROKER 2]          ├──→ [Data Warehouse]
[Database]      ──┘    [BROKER 3]          └──→ [Alert System]
   (Producers)        (Kafka Cluster)           (Consumers)

Messages flow in one direction: left to right.
Each consumer reads at its own speed.
Kafka stores everything in between.

How a Message Travels Through Kafka

Follow a single message from birth to destination to understand the complete journey.

Step 1: The Producer Creates a Message

An online store's checkout system detects a successful payment. It creates a message: "Order #54321 paid. Amount: $89. User: john@example.com. Time: 2:34 PM." This message is an event — a record of something that happened.

Step 2: The Producer Chooses a Topic

The producer picks a topic called "orders" and sends the message to the Kafka broker. Choosing the right topic is like choosing the right department at the post office — "orders," "payments," "user-activity," "inventory-updates." Each topic holds a specific type of event.

Step 3: Kafka Stores the Message

The broker receives the message and writes it to disk inside the "orders" topic. Every message gets a unique sequence number called an offset. The first message gets offset 0, the second gets offset 1, and so on. These offsets never change. The order is permanent.

TOPIC: orders
──────────────────────────────────────────────────
Offset 0: "Order #12300 paid. $45. At 9:01 AM"
Offset 1: "Order #12301 paid. $120. At 9:03 AM"
Offset 2: "Order #54321 paid. $89. At 2:34 PM"
Offset 3: ← Next message will go here
──────────────────────────────────────────────────

Step 4: The Consumer Reads the Message

Three different consumers subscribe to the "orders" topic. The warehouse system reads the message to prepare shipping. The email service reads it to send the confirmation email. The analytics dashboard reads it to update the sales total. All three read the same message independently. Reading a message in Kafka does not remove it.

Step 5: The Consumer Tracks Its Position

Each consumer remembers which offset it last read. This is called the consumer offset. If the warehouse system crashes at offset 2, when it restarts it knows to start reading from offset 2 again. Nothing is lost.

Topics and Partitions: Kafka's Storage Structure

A topic is like a category folder. But Kafka doesn't store all messages for a topic in a single file. It splits each topic into partitions for parallelism and scalability.

Why Partitions Exist

Imagine a single cashier at a grocery store. One long line forms. Now add four cashiers — four parallel lines. Everyone gets served faster. Partitions work the same way. Instead of one queue of messages, a topic can have many parallel partitions, each handled by a different broker, allowing many producers to write and many consumers to read simultaneously.

TOPIC: user-activity (with 3 partitions)

Partition 0: [msg@0] [msg@1] [msg@2] [msg@3] ← stored on Broker 1
Partition 1: [msg@0] [msg@1] [msg@2]          ← stored on Broker 2
Partition 2: [msg@0] [msg@1] [msg@2] [msg@3] ← stored on Broker 3

Consumer A reads from Partition 0.
Consumer B reads from Partition 1.
Consumer C reads from Partition 2.
All three work in parallel. Three times the throughput.

What Goes in Each Partition

Messages inside a single partition are always in order — offset 0 comes before offset 1, which comes before offset 2. But across partitions, there is no guaranteed order. If you need strict ordering, you route all related messages to the same partition using a message key (covered in a later topic).

Kafka as a Log: The Fundamental Data Structure

At its heart, Kafka is a distributed commit log. Understanding the log data structure explains most of Kafka's behavior.

A log is an append-only sequence of records. New records go to the end. Records already in the log never change. This is different from a database table where you can update or delete rows.

A Kafka Partition as a Log:

TIME ──────────────────────────────────────────────────────→

[Event A] [Event B] [Event C] [Event D] [Event E] [new →]
offset:0   offset:1   offset:2   offset:3   offset:4

Rules:
✓ New messages always append to the right end
✓ Existing messages never change
✓ Consumers read from left (old) to right (new)
✓ Messages stay for the configured retention period

This log structure gives Kafka its key powers: replay (go back to offset 0 and re-read everything), ordering (messages in a partition are always in the order they arrived), and durability (messages sit on disk until the retention period expires).

Kafka Clusters: Multiple Brokers Working Together

A single Kafka broker can handle a lot, but real production systems need more. Multiple brokers form a cluster. A cluster provides three things a single broker cannot: fault tolerance (if one broker dies, others keep the data), higher throughput (more brokers means more parallel processing), and horizontal scaling (add brokers when you need more capacity).

KAFKA CLUSTER (3 Brokers)

┌──────────────────────────────────────────────┐
│  BROKER 1         BROKER 2         BROKER 3  │
│  ──────────       ──────────       ────────  │
│  Topic A P0       Topic A P1       Topic A P2│
│  Topic B P0       Topic B P1       Topic B P0│
│  (replica)        (leader)         (replica) │
└──────────────────────────────────────────────┘

Data is spread AND copied across brokers.
If Broker 2 crashes, Broker 1 or 3 takes over instantly.

ZooKeeper and KRaft: How the Cluster Coordinates

For a cluster to work, the brokers need to coordinate with each other — who is the leader for each partition, which brokers are alive, what topics exist. Historically, Kafka relied on a separate tool called ZooKeeper to manage this coordination. Think of ZooKeeper as the cluster manager sitting above the brokers, keeping track of everything.

Newer versions of Kafka (2.8+) introduce KRaft mode, which removes the need for ZooKeeper entirely. In KRaft mode, Kafka manages its own coordination internally using a built-in consensus protocol. The result is a simpler, faster, and easier-to-manage system. Both modes are still in use today, and this course covers both.

Push vs Pull: How Data Moves in Kafka

One important design choice in Kafka is that consumers pull data from brokers rather than brokers pushing data to consumers.

In a push model, the broker sends messages to consumers when they arrive. The consumer has no control over the rate — it might get overwhelmed if messages arrive faster than it can process them.

In Kafka's pull model, consumers ask the broker for messages whenever they are ready. If the consumer is busy, it simply waits to ask. If messages pile up while the consumer is busy, they stay safely stored in the broker. When the consumer is ready, it pulls all the waiting messages at once.

PUSH MODEL (Not Kafka):                PULL MODEL (Kafka):

BROKER → pushes → CONSUMER             CONSUMER → asks → BROKER
                                        BROKER → responds → CONSUMER

Risk: Consumer gets overwhelmed.        Benefit: Consumer controls its pace.
No backpressure mechanism.             Backlog waits safely in Kafka.

The Role of Kafka in Modern Architectures

Modern software systems increasingly use event-driven architecture — applications communicate by producing and consuming events rather than calling each other directly. Kafka is the backbone of most event-driven architectures at scale.

Microservices use Kafka so that each service is independent. A payment service emits an event; an inventory service, email service, and analytics service each consume that event on their own schedule without the payment service needing to know they exist.

Data pipelines use Kafka to stream data from production databases into data warehouses, data lakes, and analytics tools in near real time. Instead of batch exports that run once a day, data flows continuously through Kafka.

EVENT-DRIVEN MICROSERVICES VIA KAFKA

[Payment Service] → event: "payment_completed"
                    ↓ written to Kafka topic "payments"
                    ├──→ [Inventory Service] reads it → updates stock
                    ├──→ [Email Service] reads it → sends receipt
                    ├──→ [Analytics Service] reads it → records revenue
                    └──→ [Fraud Service] reads it → checks patterns

Payment Service has zero knowledge of what happens next.
Each downstream service is independent and replaceable.

What Kafka Does Not Do

Kafka is not a database — you cannot run SQL queries on it. Kafka is not a message queue designed for routing complex workflows — RabbitMQ handles that better. Kafka is not designed for very small, simple systems where a basic queue is sufficient. Knowing what Kafka is not helps you choose the right tool for the right job.

Key Points

  • Kafka has three roles: producers create events, brokers store them, consumers read them.
  • Messages travel from producer → Kafka broker (stored in a topic) → consumer.
  • Topics are divided into partitions. Each partition is an ordered, append-only log with offsets.
  • Consumers pull data from Kafka at their own pace. Data stays in Kafka even after being read.
  • A Kafka cluster is multiple brokers working together for fault tolerance and throughput.
  • Kafka is the backbone of event-driven architectures and real-time data pipelines.

Leave a Comment