Kafka Production Deployment Best Practices
A Kafka cluster that runs perfectly in testing can fail spectacularly in production if the underlying hardware, configuration, and operational practices are not aligned with real workloads. Production deployments face problems that never appear in development — sudden traffic spikes, disk failures, network partitions, misconfigured clients, and software bugs discovered only under sustained load.
This guide walks through every layer of a production Kafka deployment: hardware selection, broker configuration, topic design, security hardening, capacity planning, and operational readiness.
Hardware Selection for Production Brokers
Kafka is a disk-heavy, network-heavy system. CPU matters much less than storage speed and network bandwidth. Choose hardware accordingly.
KAFKA BROKER RESOURCE PRIORITY: ┌─────────────────────────────────────────────────────────┐ │ Priority 1 → Storage (NVMe SSDs or fast HDDs) │ │ Priority 2 → Network (10 Gbps minimum, 25 Gbps ideal) │ │ Priority 3 → RAM (64 GB–256 GB for OS page cache) │ │ Priority 4 → CPU (16–32 cores, usually not bottleneck) │ └─────────────────────────────────────────────────────────┘ Recommended Production Broker Spec (mid-scale): ──────────────────────────────────────────────── CPU: 32 cores (Intel Xeon or AMD EPYC) RAM: 128 GB Storage: 12 × 4 TB NVMe SSDs in JBOD (just a bunch of disks) Network: 25 Gbps NIC OS: Linux (Ubuntu 22.04 LTS or RHEL 9)
Use JBOD (Just a Bunch of Disks) instead of RAID. Kafka manages its own replication across brokers. RAID adds write overhead without protecting you better than Kafka's built-in replication does. If one disk fails, Kafka re-replicates only the affected partitions from surviving brokers — far faster than rebuilding a RAID array.
Why Page Cache Matters More Than Java Heap
Kafka reads and writes data using the operating system's page cache, not Java memory. The page cache stores recently accessed disk data in RAM. When a consumer reads a message that was just produced, the OS often serves it directly from page cache without touching disk at all.
Message Flow with Page Cache:
Producer writes message
│
▼
OS Page Cache (RAM)
│
├──▶ Written to disk asynchronously (durable)
│
└──▶ Consumer reads from Page Cache immediately
(no disk read needed — ultra fast!)
Result:
Without page cache: Producer → Disk → Consumer (slow)
With page cache: Producer → RAM → Consumer (fast)
Recommendation:
Allocate only 6 GB heap to Kafka JVM (-Xmx6g)
Leave remaining ~120 GB of RAM free for OS page cache
Broker Configuration Essentials
The default Kafka configuration is designed for quick starts and demos. Production requires deliberate tuning of dozens of parameters.
Critical server.properties Settings
# Replication and Durability default.replication.factor=3 # Every topic gets 3 copies min.insync.replicas=2 # At least 2 replicas must confirm write unclean.leader.election.enable=false # Never elect a stale replica as leader # Log Retention log.retention.hours=168 # Keep data for 7 days log.retention.bytes=1073741824 # OR cap at 1 GB per partition (whichever first) log.segment.bytes=1073741824 # Each log segment = 1 GB # Performance num.network.threads=8 # Threads handling network I/O num.io.threads=16 # Threads handling disk I/O socket.send.buffer.bytes=102400 # Network send buffer socket.receive.buffer.bytes=102400 # Network receive buffer socket.request.max.bytes=104857600 # Max request size = 100 MB # Compression compression.type=lz4 # LZ4 gives best speed/ratio balance
The Replication Safety Triangle
Data Safety
▲
│
┌──────────────┼──────────────┐
│ │ │
│ replication.factor=3 │
│ min.insync.replicas=2 │
│ acks=all (producer side) │
│ │ │
└──────────────┼──────────────┘
│
Without all three, you have gaps:
factor=3, acks=1 → Fast but data loss possible
factor=1, acks=all → Safe-looking but one disk = total loss
factor=3, min.isr=1 → Allows writes to 1 replica (defeats purpose)
All three together → True durability guarantee
Topic Design for Production
Topics and their partition counts are the most consequential design decisions you make. Changing partition counts after creation causes partition reassignment, which stresses the cluster. Plan ahead.
How to Choose Partition Count
Partition Count Formula: ───────────────────────────────────────────── Target throughput per topic: T MB/s Single partition throughput: P MB/s (typically 10–100 MB/s per partition) Required partitions = T / P Example: Target: 500 MB/s topic throughput Each partition handles ~50 MB/s on this hardware Partitions needed = 500 / 50 = 10 partitions Then add headroom for future growth: Final partition count = 10 × 2 = 20 partitions Also consider consumer parallelism: Max parallel consumers = Number of partitions If you want 30 consumer threads → need at least 30 partitions
Partition Naming and Organization
Good Topic Naming Convention:
─────────────────────────────────────────────
Format: {team}.{domain}.{entity}.{event-type}
Examples:
payments.billing.invoice.created
logistics.shipping.package.status-updated
auth.users.account.password-changed
analytics.clickstream.pageview.raw
Bad Naming (avoid):
topic1
test
ORDERS
my-topic-v2-final-FINAL
Compacted Topics vs. Regular Topics
Regular Topic (cleanup.policy=delete): ────────────────────────────────────── Offset: 0 1 2 3 4 5 6 7 8 Key: A B A C B A D C A Value: 10 20 15 30 25 18 40 35 22 After 7 days → all messages deleted regardless Compacted Topic (cleanup.policy=compact): ────────────────────────────────────────── Same messages come in. Kafka keeps only the LATEST value for each key. After compaction: Offset: 5 6 7 8 Key: A D C A ← only latest A (offset 8) survives Wait — compaction keeps latest per key: Final state: Key B=25, Key C=35, Key D=40, Key A=22 Use compacted topics for: • Current user preferences • Latest product prices • Device configuration state • Any "what is the current value" use case
Network Architecture and Security
Network Segmentation
Production Network Layout:
┌─────────────────────────────────────────────────┐
│ DMZ / Public Zone │
│ [Load Balancer] [API Gateway] │
└─────────────────────┬───────────────────────────┘
│
┌─────────────────────▼───────────────────────────┐
│ Application Zone │
│ [Producer Apps] [Consumer Apps] │
└──────────┬───────────────────────┬──────────────┘
│ │
│ Internal Network │
┌──────────▼───────────────────────▼──────────────┐
│ Kafka Zone │
│ [Broker 1] [Broker 2] [Broker 3] │
│ [ZooKeeper/KRaft quorum] │
└─────────────────────────────────────────────────┘
│
┌──────────▼──────────────────────────────────────┐
│ Storage / Monitoring Zone │
│ [Prometheus] [Grafana] [Log Aggregation] │
└─────────────────────────────────────────────────┘
Firewall rules:
App Zone → Kafka Zone: Allow port 9092 (PLAINTEXT) or 9093 (TLS)
Kafka Zone → Kafka Zone: Allow all (broker-to-broker replication)
Outside → Kafka Zone: Block everything
Authentication and Authorization
Never run a production Kafka cluster without authentication. An unauthenticated Kafka cluster allows any machine on the network to read every message, write garbage data, or delete topics.
Security Stack for Production Kafka:
Layer 1 — Encryption in Transit:
Use TLS on listener port 9093
All data between clients and brokers encrypted
All data between brokers encrypted during replication
Layer 2 — Authentication:
SASL/SCRAM-SHA-512 (username/password, easy to manage)
OR SASL/GSSAPI (Kerberos, for enterprises with existing AD/LDAP)
OR mTLS (mutual TLS with client certificates)
Layer 3 — Authorization (ACLs):
kafka-acls.sh --add \
--allow-principal User:order-service \
--operation Write \
--topic orders \
--bootstrap-server broker1:9093
ACL Matrix Example:
┌────────────────┬──────────────┬──────────┬──────────┐
│ Service │ Topic │ Produce │ Consume │
├────────────────┼──────────────┼──────────┼──────────┤
│ order-service │ orders │ ✅ YES │ ❌ NO │
│ fraud-detector │ orders │ ❌ NO │ ✅ YES │
│ analytics-svc │ orders │ ❌ NO │ ✅ YES │
│ analytics-svc │ user-events │ ❌ NO │ ✅ YES │
│ admin-tool │ * │ ✅ YES │ ✅ YES │
└────────────────┴──────────────┴──────────┴──────────┘
Producer Configuration for Production
Production Producer Settings: ────────────────────────────────────────────────────── acks=all # Wait for all ISR replicas to confirm retries=2147483647 # Retry forever (let delivery.timeout control) delivery.timeout.ms=120000 # Give up after 2 minutes total enable.idempotence=true # Prevent duplicate messages on retry max.in.flight.requests.per.connection=5 # With idempotence, safe to use 5 linger.ms=5 # Wait 5ms to batch messages together batch.size=65536 # 64 KB batch size compression.type=lz4 # Compress batches before sending buffer.memory=67108864 # 64 MB local send buffer The Idempotence Safety Net: ──────────────────────────── Without idempotence: Producer sends message ──▶ Broker receives, writes, but ACK lost Producer retries ──▶ Broker writes AGAIN = DUPLICATE message With enable.idempotence=true: Producer sends message with sequence number (seq=1) Broker receives, writes, ACK lost Producer retries with SAME sequence number (seq=1) Broker sees: "I already processed seq=1 from this producer" Broker discards duplicate → exactly-once delivery
Consumer Configuration for Production
Production Consumer Settings:
──────────────────────────────────────────────────────
group.id=my-consumer-group # Unique per consumer group
auto.offset.reset=earliest # Start from beginning if no committed offset
enable.auto.commit=false # Commit offsets manually after processing
max.poll.records=500 # Process 500 records per poll
max.poll.interval.ms=300000 # 5 minutes max between polls before rebalance
session.timeout.ms=45000 # 45 seconds before broker considers consumer dead
heartbeat.interval.ms=15000 # Send heartbeat every 15 seconds
fetch.min.bytes=1 # Return data as soon as any is available
fetch.max.wait.ms=500 # Wait max 500ms for fetch.min.bytes
Manual Commit Pattern (the right way):
──────────────────────────────────────
while (true) {
records = consumer.poll(Duration.ofMillis(100))
for (record : records) {
processRecord(record) // Do your business logic
}
consumer.commitSync() // Commit AFTER processing, not before
}
Why NOT auto-commit:
Auto-commit fires every 5 seconds regardless of processing
If your app crashes between commit and processing → messages lost
Manual commit → only commit what you actually finished processing
Capacity Planning
Capacity Planning Worksheet: ────────────────────────────────────────────────────────── Step 1: Measure your write throughput Messages per second: ___________ Average message size: ___________ bytes Write MB/s = messages/sec × avg_size / 1,000,000 Step 2: Account for replication overhead Replication factor = 3 Total disk write rate = Write MB/s × 3 Step 3: Estimate disk space Daily data = Total write rate × 86,400 seconds Retention period = 7 days Total storage needed = Daily data × 7 × 1.2 (20% headroom) Step 4: Estimate network bandwidth Outbound = Write MB/s × (number of consumer groups) Total bandwidth = Write MB/s (replication) + Outbound Example for 10,000 messages/sec at 1 KB each: Write throughput: 10 MB/s With replication: 30 MB/s disk write Daily storage: 30 × 86,400 = 2,592,000 MB ≈ 2.5 TB/day 7-day retention: 17.5 TB + 20% = ~21 TB disk needed With 3 consumer groups: network out = 30 MB/s
Cluster Sizing Guidance
Cluster Size Tiers: SMALL (dev/staging, low production): 3 brokers, 3 ZooKeeper/KRaft nodes Handles up to 100 MB/s throughput MEDIUM (standard production): 6–9 brokers, 3 ZooKeeper/KRaft nodes Handles 100 MB/s – 1 GB/s throughput LARGE (high-traffic production): 12–30+ brokers, 3–5 ZooKeeper/KRaft nodes Handles 1 GB/s+ throughput Rule: Never run fewer than 3 brokers in production. With 2 brokers and replication factor 3 → impossible (not enough brokers) With 2 brokers and replication factor 2 → one broker failure = full outage With 3 brokers and replication factor 3 → one broker failure → cluster continues
Rolling Upgrades and Maintenance
Safe Broker Upgrade Procedure:
────────────────────────────────
1. Check cluster health: kafka-topics.sh --describe (no under-replicated partitions)
2. Take one broker offline (stop Kafka service)
3. Upgrade Kafka software on that broker
4. Start Kafka on that broker
5. Wait for cluster to fully recover:
• All partitions fully replicated (under-replicated = 0)
• Broker rejoins ISR for all its partitions
6. Repeat for next broker
Never upgrade two brokers simultaneously — doing so risks temporary data unavailability.
Timeline Visualization:
──────────────────────────────────────────────────────────
Broker 1: [Running] [STOP] [Upgrade] [Start] [Recovery] [Running]
Broker 2: [Running] [Running] [Running] [STOP] [Upgrade] ...
Broker 3: [Running] [Running] [Running] [Running] [Running] ...
↑ ↑
Wait for full Wait for full
recovery before recovery before
touching broker 2 touching broker 3
Disaster Recovery and Backup Strategy
DR Strategy Options: ────────────────────────────────────────────────────────── Option 1: MirrorMaker 2 (Kafka's built-in replication) Primary cluster (us-east-1) ──▶ MirrorMaker 2 ──▶ DR cluster (us-west-2) Replication lag: seconds to minutes Failover: manual (redirect clients to DR cluster) Data loss window: 1–60 seconds typically Cost: doubles hardware cost Option 2: Multi-Region Active-Active Region A cluster ◀──────▶ Region B cluster Both serve traffic simultaneously MirrorMaker 2 syncs between them No failover needed — traffic reroutes automatically Highest cost and complexity Option 3: Confluent/MSK cross-region replication Managed service handles replication complexity Recommended when operational simplicity matters more than cost For most teams: Option 1 (MirrorMaker 2) is the right starting point.
Key Points — Production Deployment Checklist
- Use minimum 3 brokers with replication factor 3 and min.insync.replicas=2 for production durability.
- Set unclean.leader.election.enable=false to prevent stale replicas from becoming leaders and causing data loss.
- Give Kafka only 6 GB JVM heap; leave the rest of RAM for the OS page cache.
- Use JBOD storage instead of RAID — Kafka manages its own redundancy better.
- Enable authentication (SASL) and authorization (ACLs) on every production cluster.
- Set enable.idempotence=true and acks=all on producers to prevent duplicates and data loss.
- Always commit consumer offsets manually after processing, never with auto-commit.
- Perform rolling upgrades one broker at a time, waiting for full recovery between each step.
- Plan disk capacity based on throughput × replication factor × retention period × 1.2 headroom.
