ADE Real-Time Data Processing

Most data engineering work processes data in batches — collecting data over a period, then processing it all at once. But some business scenarios demand immediate action on data as it arrives. A fraud detection system cannot wait until midnight to analyze transactions. A factory sensor alarm cannot wait for a scheduled pipeline. Real-time data processing handles these scenarios.

Batch Processing vs Stream Processing

Understanding the difference between batch and stream processing is fundamental before building real-time pipelines.

Batch processing: Data accumulates over a time window (hourly, daily) and is processed all at once. Like collecting letters in a mailbox all week and sorting them on Friday.

Stream processing: Data is processed as each event arrives, within milliseconds or seconds. Like a postal worker at the door who sorts and delivers each letter the moment it arrives.

AspectBatch ProcessingStream Processing
Data freshnessHours or days oldMilliseconds to seconds old
ComplexitySimpler to build and debugMore complex
CostLower — resources used only during processingHigher — infrastructure runs continuously
Use casesDaily reports, monthly analytics, ETL loadsFraud detection, IoT monitoring, live dashboards

Azure Event Hubs — The Data Ingestion Layer

Azure Event Hubs is a fully managed, real-time data ingestion service. It acts as a large, durable buffer — receiving millions of events per second from producers and holding them for consumers to process.

Think of Event Hubs as a conveyor belt in a factory. Products (events) land on the belt continuously. Workers at different stations (consumers) pick products off the belt and process them. The belt runs nonstop. If a worker takes a break, the belt keeps moving — and when the worker returns, the products are still there waiting.

Key Concepts in Event Hubs

Event: A single unit of data — a JSON message, a log entry, a sensor reading. Each event is typically a few kilobytes in size.

Producer: An application or device that sends events to Event Hubs. A mobile app sending click events, a temperature sensor sending readings every second, a payment gateway sending transaction records.

Consumer: An application or service that reads events from Event Hubs. Azure Stream Analytics, Databricks Structured Streaming, Azure Functions, and custom applications are common consumers.

Partition: Event Hubs divides the event stream into partitions for parallel processing. Each partition is an ordered sequence of events. More partitions enable higher throughput and more consumers reading in parallel. You set the partition count when creating the Event Hub — plan this carefully because it cannot be reduced after creation.

Consumer Group: A view of the entire event stream for a specific application. Each consumer group reads the full stream independently. This lets multiple downstream applications process the same events simultaneously without interfering with each other.

Retention Period: How long events stay in Event Hubs before being automatically deleted. Default is 1 day. Maximum is 7 days (Standard tier) or 90 days (Premium tier). If a consumer goes offline for longer than the retention period, it misses events.

Event Hubs Capture

Event Hubs Capture automatically saves incoming events to ADLS Gen2 in Avro format. This creates a permanent archive of all events without writing any additional code. A single toggle in the Event Hub configuration activates it. This is useful for replay — if your stream processing pipeline has a bug, you reprocess events from the stored archive after fixing the bug.

Azure Stream Analytics — Processing Events in Real Time

Azure Stream Analytics (ASA) is a fully managed real-time query engine. It reads events from Event Hubs (or IoT Hub), applies SQL-like transformations in real time, and writes results to outputs like Azure SQL Database, ADLS Gen2, Power BI, or another Event Hub.

You write SQL queries — a language most data engineers already know — to define how incoming events are processed. Stream Analytics handles all the distributed execution behind the scenes.

The Three Parts of a Stream Analytics Job

  • Input: The event source — usually Event Hubs or IoT Hub. You configure the connection and tell ASA the data format (JSON, CSV, Avro).
  • Query: The SQL logic that transforms incoming events.
  • Output: Where the results go — Azure SQL, ADLS Gen2, Power BI, another Event Hub, Cosmos DB.

Stream Analytics Windowing Functions

In stream processing, you often need to aggregate events over a time window. Stream Analytics provides several windowing functions for this.

Tumbling Window: Fixed-size, non-overlapping time windows. Every event falls into exactly one window. Calculate the total number of page views every 5 minutes.

-- Count transactions every 1 minute
SELECT
    System.Timestamp() AS window_end,
    region,
    COUNT(*) AS transaction_count,
    SUM(amount) AS total_amount
FROM transactions TIMESTAMP BY event_time
GROUP BY
    region,
    TumblingWindow(minute, 1)

Hopping Window: Fixed-size windows that overlap. A 10-minute window that hops every 2 minutes means each event can appear in multiple windows. Useful for rolling aggregations.

Sliding Window: A window that moves continuously. It outputs a result whenever an event occurs, covering all events within the last N minutes. Useful for detecting anomalies — "alert me if any single transaction exceeds 3 standard deviations from the mean over the last 10 minutes."

Session Window: Groups events that arrive close together with no gap larger than a specified timeout. Useful for user session analysis on websites — group clicks by user and end the session if the user is inactive for 30 minutes.

Reference Data in Stream Analytics

You can join streaming data against static reference data stored as a CSV or JSON file in Blob Storage. This enriches real-time events with lookup information — like joining a transaction stream with a product catalog to add product names and categories to each transaction record.

Azure IoT Hub — For Device Data

Azure IoT Hub is similar to Event Hubs but designed specifically for Internet of Things (IoT) devices. It supports bi-directional communication — the cloud can send commands back to devices. For pure data ingestion from devices, IoT Hub and Event Hubs work interchangeably as a Stream Analytics input.

A Complete Real-Time Pipeline Example

A ride-sharing company wants to detect surge demand in real time and notify drivers to move toward high-demand zones.

  1. The mobile app sends a ride request event to Azure Event Hubs every time a customer opens the app and requests a ride
  2. Azure Stream Analytics reads the events and counts ride requests by zone every 2 minutes using a Tumbling Window
  3. Stream Analytics checks if any zone exceeds 50 ride requests in that 2-minute window
  4. When the threshold is breached, Stream Analytics writes an alert record to Azure SQL Database
  5. A notification service reads from the SQL Database and sends push notifications to nearby drivers
  6. Event Hubs Capture saves all raw events to ADLS Gen2 for later batch analysis

Key Points

  • Use Event Hubs to ingest millions of events per second from any source
  • Consumer Groups let multiple applications read the same event stream independently
  • Enable Event Hubs Capture to archive all events to ADLS Gen2 automatically
  • Stream Analytics uses SQL syntax — minimal learning curve for data engineers who know SQL
  • Use Tumbling Windows for fixed-interval aggregations and Sliding Windows for continuous anomaly detection
  • Stream processing requires always-on infrastructure — evaluate whether real-time is truly needed before adding this complexity

Leave a Comment