Event Hub Capture and Stream Processing

Azure Event Hub Capture is a built-in feature that automatically saves all incoming events to Azure Blob Storage or Azure Data Lake Storage Gen2 as they arrive. Stream Processing with services like Azure Stream Analytics, Azure Functions, and Azure Databricks enables real-time analysis of the events flowing through Event Hub.

What Is Event Hub Capture?

Event Hub Capture continuously archives every event that flows through an Event Hub into a storage destination, without any additional code. The captured data is stored in Apache Avro format — a compact, binary, schema-embedded format suitable for big data processing.

Think of Event Hub Capture as a built-in DVR for your event stream. Events flow through Event Hub in real time (consumed immediately by analytics systems), and Capture simultaneously records everything to permanent storage for later replay or batch analysis.

Event Hub Capture Flow

                     +────────────────────────────────────+
Producers            |         EVENT HUB                  |
(IoT, Apps)  ──────> |         "telemetry"                |
                     |                                    |
                     |  Events arrive in partitions       |
                     +──────────┬─────────────────────────+
                                |
               +────────────────+────────────────+
               |                                 |
               v (real-time stream)               v (capture)
    +──────────────────────+          +───────────────────────────────+
    | Consumer Applications|          | Azure Blob Storage             |
    | - Stream Analytics   |          | OR Azure Data Lake Gen2        |
    | - Azure Functions    |          |                                |
    | - Azure Databricks   |          |  Files saved in Avro format:  |
    +──────────────────────+          |  {Namespace}/{EventHub}/       |
                                      |  {PartitionId}/               |
                                      |  {Year}/{Month}/{Day}/        |
                                      |  {Hour}/{Minute}/{Second}.avro|
                                      +───────────────────────────────+

Event Hub Capture File Organization

Captured files are organized in a hierarchical folder structure based on the namespace, event hub name, partition, and capture time window.

Storage Account: "eventhubarchive"
Container: "capture-data"

Folder structure:
capture-data/
  prod-events-ns/
    telemetry/
      0/             (Partition 0)
        2024/
          06/
            15/
              10/
                00/
                  00.avro   (events captured 10:00:00–10:00:59)
                  30/
                  30.avro   (if time window = 30 seconds)
              11/
                00/
                  00.avro
      1/             (Partition 1)
        2024/
          ...

The Avro File Format

Apache Avro is a binary serialization format. Each Avro file has an embedded schema that describes the data structure. Tools like Azure Data Factory, Azure Databricks, Azure Synapse Analytics, and Apache Spark can read Avro files natively.

Each record in the captured Avro file contains the event body plus Event Hub metadata:

Avro FieldDescription
SequenceNumberEvent's sequence number within the partition
OffsetEvent's offset within the partition
EnqueuedTimeUtcTime the event arrived in Event Hub (UTC)
SystemPropertiesPublisher, partition key, and other system metadata
PropertiesUser-defined properties attached by the producer
BodyThe raw event payload as a byte array

Configuring Event Hub Capture

Prerequisites

  • Standard, Premium, or Dedicated Event Hub tier (Capture is not available in Basic)
  • Azure Blob Storage account or Azure Data Lake Storage Gen2 account in the same or linked region

Capture Configuration Settings

SettingDescriptionRange / Default
Capture enabledTurn Capture on or offOn / Off
Time windowHow often to create a new Avro file (in seconds)60 to 900 seconds (default: 300)
Size windowCreate a new file when current file reaches this size10 MB to 500 MB (default: 300 MB)
DestinationStorage account and container for captured filesRequired
File name formatCustom naming pattern for captured filesDefault: {Namespace}/{EventHub}/{PartitionId}/{Year}/{Month}/{Day}/{Hour}/{Minute}/{Second}
Do not emit empty filesSkip creating Avro files when no events arrived in the time windowEnabled / Disabled

A new Avro file is created when either the time window or the size window is reached — whichever comes first.

Enabling Capture in Azure Portal

  1. Open the Event Hub (inside the namespace)
  2. Click Capture in the left menu
  3. Toggle Capture to On
  4. Select the destination: Azure Blob Storage or Azure Data Lake Storage Gen2
  5. Choose the Storage Account and Container
  6. Set Time Window and Size Window
  7. Optionally configure a custom filename format
  8. Click Save

Capture Use Case – Data Warehouse Loading

SCENARIO: E-commerce clickstream analytics

1. Website publishes clickstream events to Event Hub "clickstream"
   (10,000 events/minute at peak)

2. Event Hub Capture saves Avro files to Blob Storage every 5 minutes

3. Azure Data Factory pipeline runs hourly:
   - Reads new Avro files from Blob Storage
   - Transforms and loads data into Azure Synapse Analytics (data warehouse)

4. Power BI reports query Synapse for daily and weekly metrics

RESULT: Real-time streaming + long-term analytics storage
        using a single Event Hub with Capture enabled

Stream Processing with Azure Stream Analytics

Azure Stream Analytics is a fully managed real-time analytics service that connects directly to Event Hub as an input. It processes the continuous event stream using SQL-like queries and outputs results to various destinations.

Stream Analytics + Event Hub Architecture

IoT Devices --> Event Hub --> Stream Analytics Job --> Azure SQL Database
                                                  --> Power BI Streaming Dataset
                                                  --> Azure Blob Storage
                                                  --> Another Event Hub (for downstream)

Stream Analytics Query Example – Anomaly Detection

Stream Analytics SQL query:

SELECT
    deviceId,
    AVG(temperature) AS avgTemperature,
    MAX(temperature) AS maxTemperature,
    System.Timestamp() AS windowEndTime
FROM
    [EventHubInput] TIMESTAMP BY timestamp
GROUP BY
    deviceId,
    TumblingWindow(minute, 5)
HAVING
    AVG(temperature) > 85

This query calculates a 5-minute average temperature per device. When the average exceeds 85°C, Stream Analytics writes a record to the output destination — which could trigger an alert in a Power BI dashboard or insert a row into an Azure SQL alert table.

Stream Analytics Window Types

Window TypeDescriptionExample Use
Tumbling WindowFixed-size, non-overlapping time segmentsAggregate every 5 minutes
Hopping WindowFixed-size windows that overlapEvery-minute report covering last 5 minutes
Sliding WindowEmit output whenever an event occurs; window covers preceding periodAlert within 30 seconds of a threshold breach
Session WindowGroups events with no inactivity gap longer than the specified timeoutGrouping user session clicks with no 30-minute gap
Snapshot WindowGroups events that share the same timestampGrouping simultaneous events from multiple sensors

Stream Processing with Azure Databricks

Azure Databricks supports Structured Streaming, which processes Event Hub events as a continuous stream using Apache Spark. Databricks is preferred for complex transformations, machine learning pipelines, and scenarios requiring full Spark capabilities.

Databricks + Event Hub – Key Integration Points

Databricks reads Event Hub using the azure-eventhubs-spark connector.

Spark Structured Streaming query:

df = spark.readStream \
  .format("eventhubs") \
  .options(**ehConf) \
  .load()

# Parse JSON body
from pyspark.sql.functions import from_json, col
schema = StructType([...])
parsed = df.withColumn("body", from_json(col("body").cast("string"), schema))

# Filter anomalies
anomalies = parsed.filter(col("body.temperature") > 85)

# Write to Delta Lake for further analysis
anomalies.writeStream \
  .format("delta") \
  .outputMode("append") \
  .option("checkpointLocation", "/checkpoints/anomalies") \
  .start("/delta/anomalies")

Capture vs Streaming – Complementary Roles

AspectEvent Hub CaptureStream Processing (ASA / Databricks)
PurposeArchive all events for long-term storageAnalyze events in real-time or near-real-time
LatencyMinutes (file-based, not real-time)Seconds to milliseconds
Output formatAvro files in Blob/ADLSSQL tables, dashboards, alerts, downstream queues
Coding requiredNo code — configuration onlySQL (ASA) or Python/Scala (Databricks)
Best forHistorical analysis, compliance, batch ETLAlerts, dashboards, real-time decisions

Complete Architecture: IoT Analytics Pipeline

10,000 IoT Sensors
       |
       v
+──────────────────+
| Azure Event Hub  |
| "iot-telemetry" |
| 8 Partitions     |
+──────┬───────────+
       |
       +─────────────────────────────────────────+
       |                                         |
       v (Capture)                               v (Stream)
+────────────────────+                 +──────────────────────────+
| Azure Blob Storage |                 | Azure Stream Analytics   |
| Avro files every   |                 | Real-time anomaly SQL    |
| 5 minutes          |                 | query every 1 minute     |
+────────┬───────────+                 +──────────────────────────+
         |                                         |
         v (batch ETL, hourly)                     v
+────────────────────+                 +──────────────────────────+
| Azure Synapse      |                 | Power BI Streaming       |
| Analytics          |                 | Dashboard (live alerts)  |
| Historical reports |                 +──────────────────────────+
+────────────────────+

Summary

Event Hub Capture provides zero-code, automatic archiving of all events to Blob Storage or ADLS in Avro format. Stream processing with Azure Stream Analytics or Azure Databricks enables real-time analysis with SQL-like windowed queries. These two capabilities work simultaneously — Capture archives everything while stream processors analyze events in real time — delivering both immediate insights and long-term data retention in a single pipeline.

Leave a Comment