Event Hub Capture and Stream Processing
Azure Event Hub Capture is a built-in feature that automatically saves all incoming events to Azure Blob Storage or Azure Data Lake Storage Gen2 as they arrive. Stream Processing with services like Azure Stream Analytics, Azure Functions, and Azure Databricks enables real-time analysis of the events flowing through Event Hub.
What Is Event Hub Capture?
Event Hub Capture continuously archives every event that flows through an Event Hub into a storage destination, without any additional code. The captured data is stored in Apache Avro format — a compact, binary, schema-embedded format suitable for big data processing.
Think of Event Hub Capture as a built-in DVR for your event stream. Events flow through Event Hub in real time (consumed immediately by analytics systems), and Capture simultaneously records everything to permanent storage for later replay or batch analysis.
Event Hub Capture Flow
+────────────────────────────────────+
Producers | EVENT HUB |
(IoT, Apps) ──────> | "telemetry" |
| |
| Events arrive in partitions |
+──────────┬─────────────────────────+
|
+────────────────+────────────────+
| |
v (real-time stream) v (capture)
+──────────────────────+ +───────────────────────────────+
| Consumer Applications| | Azure Blob Storage |
| - Stream Analytics | | OR Azure Data Lake Gen2 |
| - Azure Functions | | |
| - Azure Databricks | | Files saved in Avro format: |
+──────────────────────+ | {Namespace}/{EventHub}/ |
| {PartitionId}/ |
| {Year}/{Month}/{Day}/ |
| {Hour}/{Minute}/{Second}.avro|
+───────────────────────────────+
Event Hub Capture File Organization
Captured files are organized in a hierarchical folder structure based on the namespace, event hub name, partition, and capture time window.
Storage Account: "eventhubarchive"
Container: "capture-data"
Folder structure:
capture-data/
prod-events-ns/
telemetry/
0/ (Partition 0)
2024/
06/
15/
10/
00/
00.avro (events captured 10:00:00–10:00:59)
30/
30.avro (if time window = 30 seconds)
11/
00/
00.avro
1/ (Partition 1)
2024/
...
The Avro File Format
Apache Avro is a binary serialization format. Each Avro file has an embedded schema that describes the data structure. Tools like Azure Data Factory, Azure Databricks, Azure Synapse Analytics, and Apache Spark can read Avro files natively.
Each record in the captured Avro file contains the event body plus Event Hub metadata:
| Avro Field | Description |
|---|---|
| SequenceNumber | Event's sequence number within the partition |
| Offset | Event's offset within the partition |
| EnqueuedTimeUtc | Time the event arrived in Event Hub (UTC) |
| SystemProperties | Publisher, partition key, and other system metadata |
| Properties | User-defined properties attached by the producer |
| Body | The raw event payload as a byte array |
Configuring Event Hub Capture
Prerequisites
- Standard, Premium, or Dedicated Event Hub tier (Capture is not available in Basic)
- Azure Blob Storage account or Azure Data Lake Storage Gen2 account in the same or linked region
Capture Configuration Settings
| Setting | Description | Range / Default |
|---|---|---|
| Capture enabled | Turn Capture on or off | On / Off |
| Time window | How often to create a new Avro file (in seconds) | 60 to 900 seconds (default: 300) |
| Size window | Create a new file when current file reaches this size | 10 MB to 500 MB (default: 300 MB) |
| Destination | Storage account and container for captured files | Required |
| File name format | Custom naming pattern for captured files | Default: {Namespace}/{EventHub}/{PartitionId}/{Year}/{Month}/{Day}/{Hour}/{Minute}/{Second} |
| Do not emit empty files | Skip creating Avro files when no events arrived in the time window | Enabled / Disabled |
A new Avro file is created when either the time window or the size window is reached — whichever comes first.
Enabling Capture in Azure Portal
- Open the Event Hub (inside the namespace)
- Click Capture in the left menu
- Toggle Capture to On
- Select the destination: Azure Blob Storage or Azure Data Lake Storage Gen2
- Choose the Storage Account and Container
- Set Time Window and Size Window
- Optionally configure a custom filename format
- Click Save
Capture Use Case – Data Warehouse Loading
SCENARIO: E-commerce clickstream analytics
1. Website publishes clickstream events to Event Hub "clickstream"
(10,000 events/minute at peak)
2. Event Hub Capture saves Avro files to Blob Storage every 5 minutes
3. Azure Data Factory pipeline runs hourly:
- Reads new Avro files from Blob Storage
- Transforms and loads data into Azure Synapse Analytics (data warehouse)
4. Power BI reports query Synapse for daily and weekly metrics
RESULT: Real-time streaming + long-term analytics storage
using a single Event Hub with Capture enabled
Stream Processing with Azure Stream Analytics
Azure Stream Analytics is a fully managed real-time analytics service that connects directly to Event Hub as an input. It processes the continuous event stream using SQL-like queries and outputs results to various destinations.
Stream Analytics + Event Hub Architecture
IoT Devices --> Event Hub --> Stream Analytics Job --> Azure SQL Database
--> Power BI Streaming Dataset
--> Azure Blob Storage
--> Another Event Hub (for downstream)
Stream Analytics Query Example – Anomaly Detection
Stream Analytics SQL query:
SELECT
deviceId,
AVG(temperature) AS avgTemperature,
MAX(temperature) AS maxTemperature,
System.Timestamp() AS windowEndTime
FROM
[EventHubInput] TIMESTAMP BY timestamp
GROUP BY
deviceId,
TumblingWindow(minute, 5)
HAVING
AVG(temperature) > 85
This query calculates a 5-minute average temperature per device. When the average exceeds 85°C, Stream Analytics writes a record to the output destination — which could trigger an alert in a Power BI dashboard or insert a row into an Azure SQL alert table.
Stream Analytics Window Types
| Window Type | Description | Example Use |
|---|---|---|
| Tumbling Window | Fixed-size, non-overlapping time segments | Aggregate every 5 minutes |
| Hopping Window | Fixed-size windows that overlap | Every-minute report covering last 5 minutes |
| Sliding Window | Emit output whenever an event occurs; window covers preceding period | Alert within 30 seconds of a threshold breach |
| Session Window | Groups events with no inactivity gap longer than the specified timeout | Grouping user session clicks with no 30-minute gap |
| Snapshot Window | Groups events that share the same timestamp | Grouping simultaneous events from multiple sensors |
Stream Processing with Azure Databricks
Azure Databricks supports Structured Streaming, which processes Event Hub events as a continuous stream using Apache Spark. Databricks is preferred for complex transformations, machine learning pipelines, and scenarios requiring full Spark capabilities.
Databricks + Event Hub – Key Integration Points
Databricks reads Event Hub using the azure-eventhubs-spark connector.
Spark Structured Streaming query:
df = spark.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
# Parse JSON body
from pyspark.sql.functions import from_json, col
schema = StructType([...])
parsed = df.withColumn("body", from_json(col("body").cast("string"), schema))
# Filter anomalies
anomalies = parsed.filter(col("body.temperature") > 85)
# Write to Delta Lake for further analysis
anomalies.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "/checkpoints/anomalies") \
.start("/delta/anomalies")
Capture vs Streaming – Complementary Roles
| Aspect | Event Hub Capture | Stream Processing (ASA / Databricks) |
|---|---|---|
| Purpose | Archive all events for long-term storage | Analyze events in real-time or near-real-time |
| Latency | Minutes (file-based, not real-time) | Seconds to milliseconds |
| Output format | Avro files in Blob/ADLS | SQL tables, dashboards, alerts, downstream queues |
| Coding required | No code — configuration only | SQL (ASA) or Python/Scala (Databricks) |
| Best for | Historical analysis, compliance, batch ETL | Alerts, dashboards, real-time decisions |
Complete Architecture: IoT Analytics Pipeline
10,000 IoT Sensors
|
v
+──────────────────+
| Azure Event Hub |
| "iot-telemetry" |
| 8 Partitions |
+──────┬───────────+
|
+─────────────────────────────────────────+
| |
v (Capture) v (Stream)
+────────────────────+ +──────────────────────────+
| Azure Blob Storage | | Azure Stream Analytics |
| Avro files every | | Real-time anomaly SQL |
| 5 minutes | | query every 1 minute |
+────────┬───────────+ +──────────────────────────+
| |
v (batch ETL, hourly) v
+────────────────────+ +──────────────────────────+
| Azure Synapse | | Power BI Streaming |
| Analytics | | Dashboard (live alerts) |
| Historical reports | +──────────────────────────+
+────────────────────+
Summary
Event Hub Capture provides zero-code, automatic archiving of all events to Blob Storage or ADLS in Avro format. Stream processing with Azure Stream Analytics or Azure Databricks enables real-time analysis with SQL-like windowed queries. These two capabilities work simultaneously — Capture archives everything while stream processors analyze events in real time — delivering both immediate insights and long-term data retention in a single pipeline.
