Kafka Production Deployment Best Practices

A Kafka cluster that runs perfectly in testing can fail spectacularly in production if the underlying hardware, configuration, and operational practices are not aligned with real workloads. Production deployments face problems that never appear in development — sudden traffic spikes, disk failures, network partitions, misconfigured clients, and software bugs discovered only under sustained load.

This guide walks through every layer of a production Kafka deployment: hardware selection, broker configuration, topic design, security hardening, capacity planning, and operational readiness.

Hardware Selection for Production Brokers

Kafka is a disk-heavy, network-heavy system. CPU matters much less than storage speed and network bandwidth. Choose hardware accordingly.

KAFKA BROKER RESOURCE PRIORITY:

┌─────────────────────────────────────────────────────────┐
│  Priority 1 → Storage (NVMe SSDs or fast HDDs)          │
│  Priority 2 → Network (10 Gbps minimum, 25 Gbps ideal)  │
│  Priority 3 → RAM (64 GB–256 GB for OS page cache)      │
│  Priority 4 → CPU (16–32 cores, usually not bottleneck) │
└─────────────────────────────────────────────────────────┘

Recommended Production Broker Spec (mid-scale):
────────────────────────────────────────────────
  CPU:     32 cores (Intel Xeon or AMD EPYC)
  RAM:     128 GB
  Storage: 12 × 4 TB NVMe SSDs in JBOD (just a bunch of disks)
  Network: 25 Gbps NIC
  OS:      Linux (Ubuntu 22.04 LTS or RHEL 9)

Use JBOD (Just a Bunch of Disks) instead of RAID. Kafka manages its own replication across brokers. RAID adds write overhead without protecting you better than Kafka's built-in replication does. If one disk fails, Kafka re-replicates only the affected partitions from surviving brokers — far faster than rebuilding a RAID array.

Why Page Cache Matters More Than Java Heap

Kafka reads and writes data using the operating system's page cache, not Java memory. The page cache stores recently accessed disk data in RAM. When a consumer reads a message that was just produced, the OS often serves it directly from page cache without touching disk at all.

Message Flow with Page Cache:

Producer writes message
        │
        ▼
   OS Page Cache (RAM)
        │
        ├──▶ Written to disk asynchronously (durable)
        │
        └──▶ Consumer reads from Page Cache immediately
             (no disk read needed — ultra fast!)

Result:
  Without page cache: Producer → Disk → Consumer  (slow)
  With page cache:    Producer → RAM  → Consumer  (fast)

Recommendation:
  Allocate only 6 GB heap to Kafka JVM (-Xmx6g)
  Leave remaining ~120 GB of RAM free for OS page cache

Broker Configuration Essentials

The default Kafka configuration is designed for quick starts and demos. Production requires deliberate tuning of dozens of parameters.

Critical server.properties Settings

# Replication and Durability
default.replication.factor=3          # Every topic gets 3 copies
min.insync.replicas=2                 # At least 2 replicas must confirm write
unclean.leader.election.enable=false  # Never elect a stale replica as leader

# Log Retention
log.retention.hours=168               # Keep data for 7 days
log.retention.bytes=1073741824        # OR cap at 1 GB per partition (whichever first)
log.segment.bytes=1073741824          # Each log segment = 1 GB

# Performance
num.network.threads=8                 # Threads handling network I/O
num.io.threads=16                     # Threads handling disk I/O
socket.send.buffer.bytes=102400       # Network send buffer
socket.receive.buffer.bytes=102400    # Network receive buffer
socket.request.max.bytes=104857600    # Max request size = 100 MB

# Compression
compression.type=lz4                  # LZ4 gives best speed/ratio balance

The Replication Safety Triangle

                    Data Safety
                        ▲
                        │
         ┌──────────────┼──────────────┐
         │              │              │
         │   replication.factor=3      │
         │   min.insync.replicas=2     │
         │   acks=all (producer side)  │
         │              │              │
         └──────────────┼──────────────┘
                        │
         Without all three, you have gaps:
         
         factor=3, acks=1  → Fast but data loss possible
         factor=1, acks=all → Safe-looking but one disk = total loss
         factor=3, min.isr=1 → Allows writes to 1 replica (defeats purpose)
         
         All three together → True durability guarantee

Topic Design for Production

Topics and their partition counts are the most consequential design decisions you make. Changing partition counts after creation causes partition reassignment, which stresses the cluster. Plan ahead.

How to Choose Partition Count

Partition Count Formula:
─────────────────────────────────────────────
Target throughput per topic: T MB/s
Single partition throughput:  P MB/s (typically 10–100 MB/s per partition)

Required partitions = T / P

Example:
  Target: 500 MB/s topic throughput
  Each partition handles ~50 MB/s on this hardware
  Partitions needed = 500 / 50 = 10 partitions

Then add headroom for future growth:
  Final partition count = 10 × 2 = 20 partitions

Also consider consumer parallelism:
  Max parallel consumers = Number of partitions
  If you want 30 consumer threads → need at least 30 partitions

Partition Naming and Organization

Good Topic Naming Convention:
─────────────────────────────────────────────
Format: {team}.{domain}.{entity}.{event-type}

Examples:
  payments.billing.invoice.created
  logistics.shipping.package.status-updated
  auth.users.account.password-changed
  analytics.clickstream.pageview.raw

Bad Naming (avoid):
  topic1
  test
  ORDERS
  my-topic-v2-final-FINAL

Compacted Topics vs. Regular Topics

Regular Topic (cleanup.policy=delete):
──────────────────────────────────────
  Offset: 0   1   2   3   4   5   6   7   8
  Key:    A   B   A   C   B   A   D   C   A
  Value:  10  20  15  30  25  18  40  35  22

  After 7 days → all messages deleted regardless

Compacted Topic (cleanup.policy=compact):
──────────────────────────────────────────
  Same messages come in. Kafka keeps only the LATEST
  value for each key.

  After compaction:
  Offset: 5   6   7   8
  Key:    A   D   C   A  ← only latest A (offset 8) survives
  
  Wait — compaction keeps latest per key:
  Final state: Key B=25, Key C=35, Key D=40, Key A=22

Use compacted topics for:
  • Current user preferences
  • Latest product prices
  • Device configuration state
  • Any "what is the current value" use case

Network Architecture and Security

Network Segmentation

Production Network Layout:

┌─────────────────────────────────────────────────┐
│                  DMZ / Public Zone              │
│   [Load Balancer] [API Gateway]                 │
└─────────────────────┬───────────────────────────┘
                      │ 
┌─────────────────────▼───────────────────────────┐
│               Application Zone                  │
│   [Producer Apps]      [Consumer Apps]          │
└──────────┬───────────────────────┬──────────────┘
           │                       │
           │  Internal Network     │
┌──────────▼───────────────────────▼──────────────┐
│                  Kafka Zone                     │
│  [Broker 1] [Broker 2] [Broker 3]               │
│  [ZooKeeper/KRaft quorum]                       │
└─────────────────────────────────────────────────┘
           │
┌──────────▼──────────────────────────────────────┐
│              Storage / Monitoring Zone          │
│  [Prometheus] [Grafana] [Log Aggregation]       │
└─────────────────────────────────────────────────┘

Firewall rules:
  App Zone → Kafka Zone: Allow port 9092 (PLAINTEXT) or 9093 (TLS)
  Kafka Zone → Kafka Zone: Allow all (broker-to-broker replication)
  Outside → Kafka Zone: Block everything

Authentication and Authorization

Never run a production Kafka cluster without authentication. An unauthenticated Kafka cluster allows any machine on the network to read every message, write garbage data, or delete topics.

Security Stack for Production Kafka:

Layer 1 — Encryption in Transit:
  Use TLS on listener port 9093
  All data between clients and brokers encrypted
  All data between brokers encrypted during replication

Layer 2 — Authentication:
  SASL/SCRAM-SHA-512 (username/password, easy to manage)
  OR SASL/GSSAPI (Kerberos, for enterprises with existing AD/LDAP)
  OR mTLS (mutual TLS with client certificates)

Layer 3 — Authorization (ACLs):
  kafka-acls.sh --add \
    --allow-principal User:order-service \
    --operation Write \
    --topic orders \
    --bootstrap-server broker1:9093

ACL Matrix Example:
┌────────────────┬──────────────┬──────────┬──────────┐
│ Service        │ Topic        │ Produce  │ Consume  │
├────────────────┼──────────────┼──────────┼──────────┤
│ order-service  │ orders       │ ✅ YES   │ ❌ NO    │
│ fraud-detector │ orders       │ ❌ NO    │ ✅ YES   │
│ analytics-svc  │ orders       │ ❌ NO    │ ✅ YES   │
│ analytics-svc  │ user-events  │ ❌ NO    │ ✅ YES   │
│ admin-tool     │ *            │ ✅ YES   │ ✅ YES   │
└────────────────┴──────────────┴──────────┴──────────┘

Producer Configuration for Production

Production Producer Settings:
──────────────────────────────────────────────────────
acks=all                    # Wait for all ISR replicas to confirm
retries=2147483647          # Retry forever (let delivery.timeout control)
delivery.timeout.ms=120000  # Give up after 2 minutes total
enable.idempotence=true     # Prevent duplicate messages on retry
max.in.flight.requests.per.connection=5  # With idempotence, safe to use 5
linger.ms=5                 # Wait 5ms to batch messages together
batch.size=65536            # 64 KB batch size
compression.type=lz4        # Compress batches before sending
buffer.memory=67108864      # 64 MB local send buffer

The Idempotence Safety Net:
────────────────────────────
Without idempotence:
  Producer sends message ──▶ Broker receives, writes, but ACK lost
  Producer retries       ──▶ Broker writes AGAIN = DUPLICATE message

With enable.idempotence=true:
  Producer sends message with sequence number (seq=1)
  Broker receives, writes, ACK lost
  Producer retries with SAME sequence number (seq=1)
  Broker sees: "I already processed seq=1 from this producer"
  Broker discards duplicate → exactly-once delivery

Consumer Configuration for Production

Production Consumer Settings:
──────────────────────────────────────────────────────
group.id=my-consumer-group          # Unique per consumer group
auto.offset.reset=earliest          # Start from beginning if no committed offset
enable.auto.commit=false            # Commit offsets manually after processing
max.poll.records=500                # Process 500 records per poll
max.poll.interval.ms=300000         # 5 minutes max between polls before rebalance
session.timeout.ms=45000            # 45 seconds before broker considers consumer dead
heartbeat.interval.ms=15000         # Send heartbeat every 15 seconds
fetch.min.bytes=1                   # Return data as soon as any is available
fetch.max.wait.ms=500               # Wait max 500ms for fetch.min.bytes

Manual Commit Pattern (the right way):
──────────────────────────────────────
while (true) {
  records = consumer.poll(Duration.ofMillis(100))
  
  for (record : records) {
    processRecord(record)          // Do your business logic
  }
  
  consumer.commitSync()            // Commit AFTER processing, not before
}

Why NOT auto-commit:
  Auto-commit fires every 5 seconds regardless of processing
  If your app crashes between commit and processing → messages lost
  Manual commit → only commit what you actually finished processing

Capacity Planning

Capacity Planning Worksheet:
──────────────────────────────────────────────────────────

Step 1: Measure your write throughput
  Messages per second:        ___________
  Average message size:       ___________ bytes
  Write MB/s =                messages/sec × avg_size / 1,000,000

Step 2: Account for replication overhead
  Replication factor = 3
  Total disk write rate = Write MB/s × 3

Step 3: Estimate disk space
  Daily data = Total write rate × 86,400 seconds
  Retention period = 7 days
  Total storage needed = Daily data × 7 × 1.2 (20% headroom)

Step 4: Estimate network bandwidth
  Outbound = Write MB/s × (number of consumer groups)
  Total bandwidth = Write MB/s (replication) + Outbound

Example for 10,000 messages/sec at 1 KB each:
  Write throughput:    10 MB/s
  With replication:    30 MB/s disk write
  Daily storage:       30 × 86,400 = 2,592,000 MB ≈ 2.5 TB/day
  7-day retention:     17.5 TB + 20% = ~21 TB disk needed
  With 3 consumer groups: network out = 30 MB/s

Cluster Sizing Guidance

Cluster Size Tiers:

SMALL (dev/staging, low production):
  3 brokers, 3 ZooKeeper/KRaft nodes
  Handles up to 100 MB/s throughput
  
MEDIUM (standard production):
  6–9 brokers, 3 ZooKeeper/KRaft nodes
  Handles 100 MB/s – 1 GB/s throughput

LARGE (high-traffic production):
  12–30+ brokers, 3–5 ZooKeeper/KRaft nodes
  Handles 1 GB/s+ throughput

Rule: Never run fewer than 3 brokers in production.
  With 2 brokers and replication factor 3 → impossible (not enough brokers)
  With 2 brokers and replication factor 2 → one broker failure = full outage
  With 3 brokers and replication factor 3 → one broker failure → cluster continues

Rolling Upgrades and Maintenance

Safe Broker Upgrade Procedure:
────────────────────────────────
1. Check cluster health: kafka-topics.sh --describe (no under-replicated partitions)
2. Take one broker offline (stop Kafka service)
3. Upgrade Kafka software on that broker
4. Start Kafka on that broker
5. Wait for cluster to fully recover:
   • All partitions fully replicated (under-replicated = 0)
   • Broker rejoins ISR for all its partitions
6. Repeat for next broker

Never upgrade two brokers simultaneously — doing so risks temporary data unavailability.

Timeline Visualization:
──────────────────────────────────────────────────────────
Broker 1:  [Running] [STOP] [Upgrade] [Start] [Recovery] [Running]
Broker 2:  [Running] [Running] [Running] [STOP] [Upgrade] ...
Broker 3:  [Running] [Running] [Running] [Running] [Running] ...
                     ↑                    ↑
              Wait for full          Wait for full
              recovery before        recovery before
              touching broker 2      touching broker 3

Disaster Recovery and Backup Strategy

DR Strategy Options:
──────────────────────────────────────────────────────────

Option 1: MirrorMaker 2 (Kafka's built-in replication)
  Primary cluster (us-east-1) ──▶ MirrorMaker 2 ──▶ DR cluster (us-west-2)
  
  Replication lag: seconds to minutes
  Failover: manual (redirect clients to DR cluster)
  Data loss window: 1–60 seconds typically
  Cost: doubles hardware cost

Option 2: Multi-Region Active-Active
  Region A cluster ◀──────▶ Region B cluster
  Both serve traffic simultaneously
  MirrorMaker 2 syncs between them
  
  No failover needed — traffic reroutes automatically
  Highest cost and complexity

Option 3: Confluent/MSK cross-region replication
  Managed service handles replication complexity
  Recommended when operational simplicity matters more than cost

For most teams: Option 1 (MirrorMaker 2) is the right starting point.

Key Points — Production Deployment Checklist

Use minimum 3 brokers with replication factor 3 and min.insync.replicas=2 for production durability.
Set unclean.leader.election.enable=false to prevent stale replicas from becoming leaders and causing data loss.
Give Kafka only 6 GB JVM heap; leave the rest of RAM for the OS page cache.
Use JBOD storage instead of RAID — Kafka manages its own redundancy better.
Enable authentication (SASL) and authorization (ACLs) on every production cluster.
Set enable.idempotence=true and acks=all on producers to prevent duplicates and data loss.
Always commit consumer offsets manually after processing, never with auto-commit.
Perform rolling upgrades one broker at a time, waiting for full recovery between each step.
Plan disk capacity based on throughput × replication factor × retention period × 1.2 headroom.

Previous lesson

Back to course

Next lesson