Apache Kafka Performance Tuning Key Configuration

Kafka's default configuration is conservative — optimized for correctness and safety rather than maximum performance. In many real-world deployments, the default settings leave significant throughput and latency improvements on the table. Performance tuning is the process of systematically adjusting configuration to match your workload's specific needs — whether that means squeezing out more throughput, reducing latency, or optimizing disk usage.

Performance Dimensions: The Trade-off Triangle

Kafka performance involves three dimensions that constantly trade off against each other: throughput (messages per second), latency (time from produce to consume), and durability (protection against data loss). Improving one typically costs another. Every tuning decision explicitly accepts a position on this triangle.

THE PERFORMANCE TRADE-OFF TRIANGLE:

             THROUGHPUT
            /           \
           /             \
          /               \
     LATENCY ─────────── DURABILITY

acks=all + small batches:   Low throughput, high latency, high durability
acks=0 + large batches:     High throughput, low latency, low durability
acks=1 + medium batches:    Balanced — most common production choice

Know your workload: Is throughput or latency your primary concern?
Is durability negotiable?

Producer Performance Tuning

batch.size: Controlling Batch Efficiency

The producer accumulates messages into batches before sending. Larger batches reduce the number of network requests and increase throughput. Smaller batches reduce latency but create more, smaller network calls.

BATCH SIZE IMPACT:

batch.size=16384 (16KB, default):
  → Good for moderate throughput. May send many small batches.

batch.size=65536 (64KB):
  → Better throughput. Slightly higher latency (waits longer to fill).

batch.size=1048576 (1MB):
  → Maximum throughput for bulk data. Higher memory usage per partition.
  → Useful for log aggregation, batch ETL pipelines.

Rule: Increase batch.size when throughput matters more than latency.

linger.ms: Adding Artificial Delay for Better Batching

With linger.ms=0 (default), the producer sends a batch the moment it is ready, even if it's only one message. Setting linger.ms to a small value (5-20ms) tells the producer to wait briefly to accumulate more messages before sending. This fills batches more efficiently.

LINGER.MS EFFECT:

linger.ms=0:  Batch sent immediately. Latency: 1-2ms. Small batches.
linger.ms=5:  Wait up to 5ms. Latency: 5-7ms. Larger batches.
linger.ms=20: Wait up to 20ms. Latency: 20-22ms. Even larger batches.

Throughput comparison (high-traffic topic):
  linger.ms=0:   100,000 msg/s (many small batches)
  linger.ms=20:  350,000 msg/s (few large batches, ~3.5x improvement!)

Cost: 20ms added latency on average.
Use when: Throughput matters. Latency budget is 20-100ms.

compression.type: Reducing Network and Disk Load

Compression trades CPU time for network bandwidth and disk space. For text-based messages (JSON, log lines), compression ratios of 3-10x are common. For already-binary data (Avro, Protobuf), ratios are lower but still worthwhile.

COMPRESSION CODEC COMPARISON:

compression.type=none    CPU: 0      Network savings: 0     Best for: binary data
compression.type=lz4     CPU: low    Network savings: ~60%  Best for: general use
compression.type=snappy  CPU: low    Network savings: ~60%  Best for: general use
compression.type=zstd    CPU: medium Network savings: ~70%  Best for: most modern workloads
compression.type=gzip    CPU: high   Network savings: ~75%  Best for: archival/cold data

Recommended default for most workloads: zstd
Latency-sensitive workloads: lz4

buffer.memory: Producer Memory Buffer

The record accumulator uses a fixed memory buffer controlled by buffer.memory (default 32 MB). If the buffer fills up (producer is sending faster than the broker accepts), the send() call blocks for up to max.block.ms before throwing an exception. Increase buffer.memory for producers handling burst traffic.

Broker Performance Tuning

num.network.threads and num.io.threads

The broker uses separate thread pools for network I/O (receiving/sending requests) and disk I/O (reading/writing logs).

THREAD POOL SIZING:

num.network.threads (default 3):
  Threads handling network requests (accept, receive, send).
  Increase when: broker handles many simultaneous connections.
  Rule: 2-4 per CPU core available.

num.io.threads (default 8):
  Threads performing disk I/O (log reads and writes).
  Increase when: broker is IO-bound on high-throughput workloads.
  Rule: 8-16 for NVMe SSDs; 4-8 for spinning disks.

socket.send.buffer.bytes (default 100KB):
  TCP socket send buffer. Increase for high-throughput, high-latency networks.

socket.receive.buffer.bytes (default 100KB):
  TCP socket receive buffer. Increase similarly.

log.dirs: Spreading Data Across Multiple Disks

Configuring multiple disk paths in log.dirs allows Kafka to spread partition data across multiple disks, multiplying available I/O bandwidth.

MULTIPLE DISK CONFIGURATION:

Single disk (bottleneck):
  log.dirs=/data/kafka
  All partitions compete for one disk's IOPS. Bottleneck at high throughput.

Four disks (distributed I/O):
  log.dirs=/data/disk1,/data/disk2,/data/disk3,/data/disk4
  Kafka distributes partitions across all 4 disks.
  4x the IOPS available. Throughput scales with disk count.

Use JBOD (Just a Bunch of Disks): don't RAID them. Let Kafka spread the data.
RAID adds overhead and removes Kafka's ability to balance across disks.

Page Cache Optimization

Kafka relies on the OS page cache for fast reads. The broker should have as much RAM as possible dedicated to page cache. Avoid running other memory-heavy processes on Kafka broker machines. The JVM heap for Kafka itself should be kept relatively small (4-8 GB) — the rest of the RAM goes to page cache.

MEMORY ALLOCATION FOR KAFKA BROKERS:

Server with 64 GB RAM:

JVM Heap:           6 GB  (KAFKA_HEAP_OPTS="-Xmx6g -Xms6g")
OS overhead:        2 GB
Page Cache:        56 GB  ← dedicated to Kafka's read/write buffering

Large page cache means:
  Recent messages served from RAM (not disk) → microsecond reads
  Write batches cached before disk flush → fast sequential writes
  More data within cache = fewer disk reads for consumers

Anti-pattern: Allocating 32GB JVM heap. 
  Large heap → frequent GC pauses → replication lag → ISR shrinks → trouble.

log.flush Settings: Durability vs Throughput

By default, Kafka relies entirely on OS page cache flushing — it doesn't call fsync() after every write. The OS flushes cached writes to disk asynchronously in the background. This is fast but means there's a brief window where a power failure could lose unflushed data.

FLUSH SETTINGS:

log.flush.interval.messages=Long.MAX_VALUE (default, effectively disabled):
  OS controls when to flush pages to disk. Fast. Small loss risk on power failure.
  Kafka relies on replication (not fsync) for durability. Recommended.

log.flush.interval.messages=10000:
  Kafka calls fsync() every 10,000 messages. Slower but each write is on disk.
  Not recommended: fsync is expensive and Kafka's replication is a better safety net.

Best practice: Leave flush disabled. Use replication factor 3 + acks=all.
  Replication across 3 brokers provides much stronger durability than fsync,
  because it protects against broker failure, not just power failure.

Consumer Performance Tuning

fetch.min.bytes and fetch.max.wait.ms

Control the minimum data the broker sends per fetch and how long it waits to accumulate that minimum before responding. These two settings together determine consumer fetch efficiency.

CONSUMER FETCH TUNING:

Real-time scenario (dashboard, alerts):
  fetch.min.bytes=1         ← respond immediately, even 1 byte
  fetch.max.wait.ms=100     ← wait max 100ms if nothing available
  Result: Low latency. More frequent, smaller fetches.

Throughput scenario (batch processing, analytics):
  fetch.min.bytes=1048576   ← wait for 1MB of data
  fetch.max.wait.ms=1000    ← but no longer than 1 second
  Result: Higher throughput. Fewer, larger fetches. Slightly higher latency.

max.partition.fetch.bytes (default 1MB):
  Max bytes fetched per partition per request.
  Increase for high-volume partitions.

fetch.max.bytes (default 50MB):
  Max total bytes per fetch request across all partitions.
  Increase for consumers reading many partitions simultaneously.

max.poll.records

Controls how many records are returned per poll() call. Increase for faster processing logic. Decrease if processing is slow and you risk hitting max.poll.interval.ms timeout.

Network and OS Tuning

TCP Settings

Kafka is a network-intensive application. OS-level TCP settings significantly affect performance on high-throughput clusters.

LINUX NETWORK TUNING FOR KAFKA:

# Increase network buffer sizes (important for high-throughput):
sysctl -w net.core.rmem_max=134217728
sysctl -w net.core.wmem_max=134217728
sysctl -w net.ipv4.tcp_rmem="4096 87380 134217728"
sysctl -w net.ipv4.tcp_wmem="4096 65536 134217728"

# Enable TCP window scaling:
sysctl -w net.ipv4.tcp_window_scaling=1

# Increase connection backlog:
sysctl -w net.core.somaxconn=65536
sysctl -w net.ipv4.tcp_max_syn_backlog=65536

These settings matter when: throughput exceeds 1 Gbps per broker,
or when latency spikes occur at high load.

File System and Mount Options

FILESYSTEM RECOMMENDATIONS:

Filesystem: ext4 or XFS (XFS preferred for large partition counts)
  Mount options: noatime,nodiratime,data=writeback (ext4)
                 noatime,nodiratime,logbufs=8 (XFS)
  noatime: Don't update access time on read → eliminates extra disk writes
  data=writeback: Faster writes for journaling

SSD vs HDD:
  SSDs dramatically reduce latency for random reads (old messages not in cache)
  Sequential writes (the main Kafka workload) are fast on both
  SSDs recommended for brokers with many partitions and random consumer access
  HDDs acceptable for sequential workloads with warm caches

Performance Testing: Kafka's Built-In Tools

BUILT-IN PERFORMANCE TEST TOOLS:

# Producer performance test (1 million messages, 1KB each):
bin/kafka-producer-perf-test.sh \
  --topic perf-test \
  --num-records 1000000 \
  --record-size 1024 \
  --throughput -1 \
  --producer-props bootstrap.servers=localhost:9092 acks=1

Output:
  1000000 records sent, 452261.306531 records/sec (441.67 MB/sec),
  2.20 ms avg latency, 242.00 ms max latency

# Consumer performance test:
bin/kafka-consumer-perf-test.sh \
  --bootstrap-server localhost:9092 \
  --topic perf-test \
  --messages 1000000 \
  --threads 1

Output:
  start.time, end.time, data.consumed.in.MB, MB.sec, data.consumed.in.nMsg,
  nMsg.sec, rebalance.time.ms, fetch.time.ms, fetch.MB.sec, fetch.nMsg.sec

Key Points

  • The performance triangle trades throughput, latency, and durability. Every tuning decision is a position on this triangle.
  • Producer tuning: Increase batch.size and linger.ms for throughput. Use zstd compression for most workloads. Keep buffer.memory sized for burst traffic.
  • Broker tuning: Use multiple disks in log.dirs. Keep JVM heap small (4-8 GB) to maximize page cache. Size network and I/O thread pools appropriately.
  • Consumer tuning: Increase fetch.min.bytes and fetch.max.wait.ms for batch workloads. Keep them low for real-time scenarios.
  • OS tuning: Increase TCP buffer sizes for high-throughput networks. Mount Kafka data directories with noatime to reduce unnecessary disk writes.
  • Measure first. Use kafka-producer-perf-test.sh and kafka-consumer-perf-test.sh to establish baselines before and after tuning.

Leave a Comment