Kafka Streams Building Real Time Applications

Kafka Streams is a client library for building stream processing applications that read from Kafka topics, transform the data in real time, and write results back to Kafka topics. It runs inside your own application — no separate processing cluster to manage, no Spark jobs to submit, no Flink cluster to operate. Every Kafka Streams application is a regular Java or Scala application that scales by running multiple instances.

What Kafka Streams Is and Is Not

Kafka Streams is not a separate server or cluster. It is a library you import into your Java or Kotlin application. Your application becomes a stream processor. You run as many instances of your application as you need — they coordinate automatically through Kafka's consumer group mechanism, each instance handling a subset of the partitions.

KAFKA STREAMS APPLICATION TOPOLOGY:

[Input Topic: orders] → [Kafka Streams App] → [Output Topic: orders-enriched]

Your application contains the processing logic:
  - Filter only paid orders
  - Look up customer details from a local state store
  - Enrich order with customer name and tier
  - Write enriched record to output topic

3 instances of your app running:
  Instance 1: processes partitions 0, 1
  Instance 2: processes partitions 2, 3
  Instance 3: processes partitions 4, 5

Add a 4th instance → automatic rebalancing → better parallelism.
Remove one instance → automatic rebalancing → others absorb its partitions.

The Topology: DAG of Stream Processing

A Kafka Streams application is defined as a processing topology — a directed acyclic graph (DAG) of source nodes (topics), processor nodes (transformations), and sink nodes (output topics). Messages flow through the graph from sources to sinks.

SIMPLE TOPOLOGY EXAMPLE:

SOURCE NODE:  orders topic (reads all messages)
       ↓
FILTER NODE:  keep only messages where status = "PAID"
       ↓
MAP NODE:     transform each record: add "processed_at" timestamp
       ↓
SINK NODE:    write to paid-orders topic

Code (Kafka Streams DSL):

StreamsBuilder builder = new StreamsBuilder();

KStream orders = builder.stream("orders");

orders
  .filter((key, order) → order.getStatus().equals("PAID"))
  .mapValues(order → {
    order.setProcessedAt(Instant.now());
    return order;
  })
  .to("paid-orders");

KafkaStreams streams = new KafkaStreams(builder.build(), config);
streams.start();

KStream vs KTable: Two Views of a Topic

Kafka Streams provides two fundamental abstractions for representing topic data.

KStream: The Event Stream

A KStream represents an unbounded, append-only sequence of events. Every record in a KStream is a new, independent event. There is no concept of "updating" a previous record — every new record is additive. Think of a KStream like a river — water always flows forward and each wave is distinct.

KStream is used for: clickstream data, sensor readings, transaction events, log lines — any data where each record represents a new occurrence, not an update to an existing entity.

KTable: The Change Log (Latest State)

A KTable represents the latest value for each key in a topic. It treats the topic as a changelog — each record with the same key overwrites the previous value for that key. Think of a KTable like a spreadsheet — it shows you the current state, and when you update a cell, the old value is replaced.

KTable is used for: user profiles, product catalog, account balances, configuration data — any data where you care about the current state, not the history.

KSTREAM vs KTABLE COMPARISON:

Topic: user-updates
  offset 0: key=user1, value={name:"Alice", plan:"free"}
  offset 1: key=user2, value={name:"Bob",   plan:"premium"}
  offset 2: key=user1, value={name:"Alice", plan:"premium"}  ← Alice upgraded

As KStream (every record is new):
  user1 → free      (event: user1 signed up)
  user2 → premium   (event: user2 signed up)
  user1 → premium   (event: user1 upgraded)
  3 records total. All are stored.

As KTable (latest value per key):
  user1 → premium   (current state: Alice is premium)
  user2 → premium   (current state: Bob is premium)
  Only 2 records. Old user1→free is replaced by user1→premium.

KStream.to("output") emits all records.
KTable.toStream().to("output") emits records when state changes.

Stateless vs Stateful Operations

Stateless Operations

Stateless operations transform each record independently without needing to remember anything about previous records. They are simple and highly parallelizable.

STATELESS OPERATIONS:

filter:    Keep records matching a predicate
           orders.filter((k,v) → v.getAmount() > 100)

map:       Transform key and/or value
           orders.map((k,v) → new KeyValue<>(v.getUserId(), v))

mapValues: Transform value only (key unchanged)
           orders.mapValues(v → v.toEnriched())

flatMap:   One input record → multiple output records
           sentences.flatMap((k,v) → Arrays.stream(v.split(" "))
                                          .map(word → new KeyValue<>(k, word)))

peek:      Inspect each record without changing it (for logging/debugging)
           orders.peek((k,v) → log.info("Processing: " + k))

branch:    Split stream into multiple sub-streams based on predicates
           Map branches = orders.split()
             .branch((k,v) → v.getStatus().equals("PAID"), Branched.as("paid"))
             .branch((k,v) → v.getStatus().equals("FAILED"), Branched.as("failed"))

Stateful Operations

Stateful operations need to remember information across multiple records. They use state stores — local key-value databases (backed by RocksDB) that the stream processor maintains. State stores are fault-tolerant: their changes are logged to a Kafka changelog topic and restored on restart.

STATEFUL OPERATIONS:

count(): Count occurrences per key
  KTable orderCountPerUser =
    orders.groupByKey().count();

aggregate(): General-purpose accumulation per key
  KTable totalRevenuePerUser =
    orders.groupByKey().aggregate(
      () → 0.0,                                     // initializer
      (key, order, total) → total + order.getAmount() // aggregator
    );

windowedBy(): Apply aggregation within time windows
  KTable, Long> ordersPerHour =
    orders
      .groupByKey()
      .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofHours(1)))
      .count();

join(): Combine two streams or stream+table based on matching keys
  (covered below)

Joins in Kafka Streams

Kafka Streams supports three types of joins that combine data from two sources based on matching keys.

JOIN TYPES:

STREAM-STREAM JOIN (KStream + KStream):
  Joins two event streams within a time window.
  Both streams must have the same key.
  Use case: Match web clicks with ad impressions within 5 minutes.

  clicks.join(
    impressions,
    (click, impression) → new ClickImpression(click, impression),
    JoinWindows.ofTimeDifferenceWithNoGrace(Duration.ofMinutes(5))
  )

STREAM-TABLE JOIN (KStream + KTable):
  Enriches a stream with current lookup data from a table.
  Non-windowed: uses the current state of the table.
  Use case: Enrich order events with current customer tier from a user table.

  orders.join(
    userTierTable,
    (order, userTier) → order.withTier(userTier)
  )

TABLE-TABLE JOIN (KTable + KTable):
  Combines two current-state tables into a unified view.
  Use case: Join product table with inventory table to show products with stock levels.

  products.join(
    inventory,
    (product, stock) → product.withStock(stock)
  )

Windowing: Aggregating Over Time

Stream processing often needs to compute aggregations over time windows — "how many orders in the last hour?", "what is the average temperature over the last 5 minutes?", "which users clicked more than 10 times in any 10-minute window?"

WINDOW TYPES:

TUMBLING WINDOW (fixed, non-overlapping):
  Duration = 1 hour
  Window 1: 09:00 - 10:00 (closed, results emitted)
  Window 2: 10:00 - 11:00 (closed, results emitted)
  Window 3: 11:00 - 12:00 (open, accumulating)
  Use: Hourly reports, daily summaries.

  TimeWindows.ofSizeWithNoGrace(Duration.ofHours(1))

HOPPING WINDOW (fixed size, overlapping):
  Size = 1 hour, Advance = 15 minutes
  Window 1: 09:00 - 10:00
  Window 2: 09:15 - 10:15
  Window 3: 09:30 - 10:30
  Each event appears in 4 windows (1hr / 15min).
  Use: Rolling averages with overlap.

  TimeWindows.of(Duration.ofHours(1)).advanceBy(Duration.ofMinutes(15))

SESSION WINDOW (activity-based, variable size):
  Groups events with the same key that occur within a gap duration of each other.
  If no events for 30 minutes → session closes.
  Use: User session analytics.

  SessionWindows.ofInactivityGapWithNoGrace(Duration.ofMinutes(30))

State Stores and Fault Tolerance

Every stateful operation in Kafka Streams maintains a local RocksDB state store. When your application computes an aggregation (count, sum, join), the intermediate results live in this local store — fast local disk access, no network round trip.

State stores are fault-tolerant through changelog topics — every state change is written as a message to a Kafka topic (the changelog). If the application restarts or a partition moves to another instance, the new instance rebuilds the state store by replaying the changelog. Standby replicas (optional) keep pre-warmed copies of state on other instances for faster recovery.

STATE STORE FAULT TOLERANCE:

Instance 1 maintains:
  Local RocksDB: {user1: count=42, user2: count=17}
  Changelog topic: orders-count-changelog-partition-0
    → [user1: 42] [user2: 17] (latest state backed to Kafka)

Instance 1 CRASHES.

Instance 2 takes over partition 0:
  Reads changelog topic from beginning:
    user1: 42, user2: 17
  Rebuilds local RocksDB: {user1: 42, user2: 17}
  Resumes processing from last committed offset.

Recovery time: proportional to changelog size. Use standby replicas
(num.standby.replicas=1) to keep a hot copy on another instance for faster recovery.

Kafka Streams vs Apache Flink and Spark Streaming

COMPARISON:

Feature            Kafka Streams      Apache Flink       Spark Streaming
─────────────────────────────────────────────────────────────────────
Deployment         Library in your    Separate cluster   Separate cluster
                   application        required           required

Language           Java, Kotlin,      Java, Python,      Scala, Python,
                   Scala              Scala              Java, R

State management   RocksDB, built-in  Built-in,          Managed by
                   in Kafka           checkpoints        Spark

Operational        Low (no cluster)   High (cluster ops) High (cluster ops)
complexity

Best for           Kafka-native       Complex analytics, Large batch + stream
                   microservices      complex joins      hybrid workloads
                   event-driven apps  multi-source joins

Choose Kafka Streams when: Your source and sink are Kafka. You want
  simple deployment. Operational simplicity is important.
  
Choose Flink when: Complex multi-source joins, very large state,
  advanced windowing semantics, machine learning in-stream.

Key Points

Kafka Streams is a Java library — not a separate cluster. Your application IS the stream processor. Run multiple instances for parallelism.
KStream represents an append-only event stream. KTable represents the latest value per key (current state view).
Stateless operations (filter, map, flatMap) transform records without memory. Stateful operations (count, aggregate, join) use local RocksDB state stores.
Three join types: stream-stream (windowed, both streams), stream-table (enrich stream with current lookup state), table-table (combine two current-state views).
Windows aggregate events over time: tumbling (fixed, non-overlapping), hopping (fixed, overlapping), session (activity-gap based).
State stores are fault-tolerant via changelog topics. Failed instances rebuild state by replaying the changelog. Standby replicas reduce recovery time.

Previous lesson

Back to course

Next lesson