How Messages Are Stored in Apache Kafka

Understanding how Kafka stores messages on disk reveals why Kafka is so fast, why data is durable, and how features like replay and retention work in practice. Most Kafka users treat the broker as a black box — messages go in, messages come out. But knowing the internal storage mechanism helps you make better decisions about configuration, disk sizing, and performance tuning.

The Anatomy of a Kafka Message

Every message in Kafka has the same structure. Understanding each component tells you what information Kafka tracks and what you can control as a producer.

Message Components

Key (optional): A byte array that identifies the message. Keys determine which partition a message goes to. Two messages with the same key always go to the same partition, preserving order for that key. Keys can be null — if no key is specified, Kafka round-robins messages across partitions.

Value: The actual message content. A byte array. Kafka does not care what format your value is in — it stores raw bytes. You can use plain text, JSON, Avro, Protobuf, or any binary format. Serialization and deserialization are handled by the producer and consumer, not Kafka.

Timestamp: When the message was created. Either set by the producer (message creation time) or set by the broker when it receives the message (log append time). The broker configuration log.message.timestamp.type controls which timestamp type is used.

Headers (optional): Key-value metadata pairs attached to the message without being part of the main value. Headers carry routing information, tracing IDs, content types, or any metadata you want to attach without modifying the message body.

Offset: Assigned by the broker. The unique sequential position of the message within its partition. The producer does not set the offset — Kafka assigns it when the message is stored.

Partition: Assigned by the broker based on the message key (or round-robin if no key). Tells you which partition the message was stored in.

KAFKA MESSAGE STRUCTURE:

┌────────────────────────────────────────────────────┐
│  KEY         │ "user_7823"                         │
│  VALUE       │ {"action":"login","ip":"10.0.0.5"}  │
│  TIMESTAMP   │ 1709654400000 (epoch ms)            │
│  HEADERS     │ {"trace-id": "abc123",              │
│              │  "content-type": "application/json"}│
│  OFFSET      │ 2847 (assigned by broker)           │
│  PARTITION   │ 2 (determined by key hash)          │
└────────────────────────────────────────────────────┘

How Kafka Writes Messages to Disk

Kafka's storage engine is built for one primary goal: write messages as fast as physically possible. It achieves this through several storage techniques.

Sequential Disk Writes

Kafka writes messages sequentially to partition log files. Sequential writes are much faster than random writes because the disk's read/write head doesn't need to jump around. Modern SSDs nearly eliminate seek time anyway, but on spinning hard drives (HDDs), sequential writes can be 100 times faster than random writes.

This is why Kafka uses a simple append-only log format. No searching for the right position to insert a record — just keep writing to the end of the file. This simplicity is a major reason Kafka achieves such high throughput.

Log Segments

Kafka doesn't write all messages for a partition into a single enormous file. It splits the partition's data into log segments — multiple files that each cover a range of offsets.

A new segment starts when the current segment reaches a maximum size (default 1 GB, configurable via log.segment.bytes) or a maximum age (default 7 days, configurable via log.segment.ms). The active segment is the newest one — the one currently being written to. All other segments are closed and read-only.

PARTITION DIRECTORY LAYOUT (orders - Partition 0):

/kafka-data/orders-0/
│
├── 00000000000000000000.log    ← segment starting at offset 0
├── 00000000000000000000.index  ← index for segment starting at offset 0
├── 00000000000000000000.timeindex ← time-based index
│
├── 00000000000000524288.log    ← segment starting at offset 524,288
├── 00000000000000524288.index
├── 00000000000000524288.timeindex
│
├── 00000000000001048576.log    ← ACTIVE segment (currently being written)
├── 00000000000001048576.index
└── 00000000000001048576.timeindex

File name = first offset in that segment
.log = the actual messages
.index = offset-to-byte-position mapping (for fast message lookup)
.timeindex = timestamp-to-offset mapping (for time-based queries)

The Log File Format

Each .log file stores messages in a binary format called a record batch. Instead of writing messages one at a time, Kafka batches multiple messages together and writes the batch as a single unit. This batching reduces disk I/O and network calls significantly.

Each record batch contains a header with batch-level metadata (base offset, compression type, producer ID for transactions, etc.) followed by the individual records. This batch format is also used when messages travel over the network — producers send batches, brokers store batches, consumers receive batches.

LOG FILE BINARY STRUCTURE (simplified):

[RECORD BATCH 1]
  base_offset: 0
  batch_size: 4
  compression: NONE
  records:
    [Record: offset=0, key="u1", value="login"]
    [Record: offset=1, key="u2", value="signup"]
    [Record: offset=2, key="u1", value="purchase"]
    [Record: offset=3, key="u3", value="login"]

[RECORD BATCH 2]
  base_offset: 4
  batch_size: 3
  compression: GZIP
  records:
    [Record: offset=4, key=null, value="heartbeat"]
    [Record: offset=5, key="u2", value="logout"]
    [Record: offset=6, key="u1", value="logout"]

Index Files: Fast Message Lookup

Finding a message at a specific offset would be slow if Kafka had to scan through an entire log file from the beginning. Index files solve this problem.

The Offset Index

The .index file contains a sparse mapping from message offset to the byte position in the .log file. "Sparse" means it doesn't record every offset — it records every Nth offset (controlled by log.index.interval.bytes). When you ask for offset 5000, Kafka looks up the closest lower offset in the index, finds its byte position, then scans forward from that position in the log file. This reduces the scan to a small range rather than the entire file.

OFFSET INDEX EXAMPLE:

.index file:
  offset=0     → byte position=0
  offset=500   → byte position=47,832
  offset=1000  → byte position=95,614
  offset=1500  → byte position=143,299
  ...

Request: "Give me message at offset 750"
Step 1: Look up closest index entry below 750 → offset=500 at byte 47,832
Step 2: Seek to byte 47,832 in the .log file
Step 3: Read forward until offset 750 found (scan ~250 records, not millions)
Step 4: Return the message

The Time Index

The .timeindex file maps timestamps to offsets. This is used when consumers want to start reading from a specific point in time rather than a specific offset. For example: "Give me all messages from the last 2 hours." Kafka uses the time index to find the first offset at or after that timestamp, then reads forward from there.

Page Cache: Why Kafka Is Faster Than You Expect

Kafka relies heavily on the operating system's page cache — a region of RAM that the OS uses to buffer recently read and written file data. When Kafka writes a message to disk, the OS doesn't immediately flush it to physical storage — it goes into the page cache first and gets flushed to disk asynchronously in batches.

When a consumer reads a message that was recently written, the message is usually still in the page cache (not on physical disk), so the read is essentially a memory read — extremely fast. This is one of Kafka's biggest performance tricks: it uses OS-level memory caching to serve reads from RAM whenever possible, without implementing its own application-level caching.

KAFKA + OS PAGE CACHE:

Producer writes message:
  [Message] → [OS Page Cache] → (async flush to disk)
                    ↑ stays in RAM

Consumer reads message shortly after:
  [Consumer request] → [OS Page Cache] ← finds it here!
  Result: RAM speed read, not disk I/O

Consumer reads old message:
  [Consumer request] → [OS Page Cache] ← cache miss
                     → [Physical Disk] → loaded into cache → returned

Implication: Give Kafka brokers lots of RAM to maximize cache hits.
For recent data, Kafka reads from RAM at memory speeds.

Compression: Storing More Data With Less Space

Kafka supports message compression at the producer level. The producer compresses a batch of messages before sending, and the broker stores the compressed batch as-is. The consumer decompresses when reading. Compression reduces network bandwidth and disk usage at the cost of CPU processing.

Compression Codecs in Kafka

GZIP: Best compression ratio. Highest CPU cost. Good for archival data where storage savings outweigh CPU cost.

Snappy: Moderate compression ratio. Low CPU cost. Good balance for general use. Developed by Google for speed.

LZ4: Similar to Snappy but faster compression and decompression. The recommended default for most workloads.

ZSTD (Zstandard): Excellent compression ratio with lower CPU cost than GZIP. Available in Kafka 2.1+. The best choice for most modern deployments.

COMPRESSION COMPARISON:

Codec     Ratio    CPU Cost    Best For
────────  ──────   ────────    ─────────────────────────────
NONE      1.0x     Zero        Low-volume, already-binary data
LZ4       ~2.5x    Very Low    High-throughput, latency-sensitive
Snappy    ~2.5x    Low         General purpose
ZSTD      ~3.5x    Medium      Modern workloads, best ratio+speed
GZIP      ~4.0x    High        Archival, disk-constrained systems

Example savings with ZSTD on JSON data:
  Uncompressed batch: 100 KB
  Compressed batch:   ~28 KB
  Network + disk savings: 72%

Message Serialization: What Goes Into the Byte Arrays

Kafka stores everything as bytes. The format of those bytes is entirely up to you — Kafka has no opinion on how you encode your data. The producer serializes the key and value into bytes before sending. The consumer deserializes them back after receiving.

Common serialization formats used with Kafka:

Plain Strings: Simplest option. Human-readable. No schema enforcement. Good for simple use cases and debugging.

JSON: Human-readable. Schema-flexible (you can add fields freely). Larger than binary formats. No built-in versioning. Good for early-stage development.

Avro: Binary. Schema-defined. Compact. Supports schema evolution (adding/removing fields with backward compatibility). Used with Confluent Schema Registry. The industry standard for Kafka data.

Protobuf (Protocol Buffers): Binary. Schema-defined. Very compact. Strong type safety. Popular in gRPC-heavy environments and when tight schema contracts are needed.

SERIALIZATION COMPARISON (same data, different formats):

Data: { "user_id": 7823, "event": "login", "timestamp": 1709654400 }

JSON:    {"user_id":7823,"event":"login","timestamp":1709654400}
         Size: ~54 bytes  ← readable, flexible, larger

Avro:    [binary: ~18 bytes with schema]  ← compact, schema-evolved
Protobuf: [binary: ~16 bytes with schema] ← compact, strict typing

Choice impacts: storage cost, network cost, consumer compatibility

What Happens When Disk Gets Full

If a Kafka broker's disk fills up, the broker stops accepting new messages for partitions stored on that disk. Producers receive errors. This is a serious operational failure. Preventing it requires proper retention configuration, disk monitoring, and capacity planning.

Kafka provides two mechanisms to control disk usage:

Time-based retention: Delete messages older than a specified number of hours/days (log.retention.hours). This is the most common approach.

Size-based retention: Delete oldest messages when the topic exceeds a specified total size (log.retention.bytes). Useful when disk is the primary constraint.

Both can be configured simultaneously. Kafka applies whichever limit is hit first.

Key Points

Every Kafka message has a key, value, timestamp, optional headers, and an offset assigned by the broker.
Kafka writes messages sequentially to partition log files using an append-only approach, which is much faster than random-access storage.
Partitions are stored as multiple log segment files on disk. Index files enable fast lookup by offset or timestamp without scanning entire files.
Kafka relies on the operating system's page cache to serve recently written messages from RAM, dramatically reducing disk I/O.
Compression (LZ4, ZSTD, GZIP, Snappy) reduces disk and network usage. ZSTD is recommended for most modern use cases.
Kafka stores raw bytes. Serialization formats (JSON, Avro, Protobuf) are handled by producers and consumers, not by Kafka itself.

Previous lessons

Back to courses

Next lessons