ADE Semi-Structured and Unstructured Data

Real-world data does not always arrive as clean, rectangular tables. JSON files from APIs, XML exports from legacy systems, log files with variable formats, and nested documents from NoSQL databases are all common in data engineering work. This topic covers how to handle these formats effectively across Azure services.

Understanding Data Structure Types

Data is broadly classified into three structural types based on how organized it is when it arrives.

Structured data has a fixed, predefined schema — rows and columns with consistent types. Examples: SQL database tables, CSV files with consistent headers. Every record has the same fields.

Semi-structured data has some organization but no rigid schema. Fields can vary between records and values can be nested. Examples: JSON, XML, Avro, Parquet (when columns contain complex types). Most modern APIs return JSON.

Unstructured data has no schema at all. Examples: plain text files, images, audio, video, PDFs, raw log files. The data engineer typically extracts meaning from these using specific parsers or AI services.

Processing JSON Data

JSON (JavaScript Object Notation) is the most common format for API responses and event data. It supports nested objects and arrays, which SQL tables cannot represent directly.

Flattening JSON in PySpark

from pyspark.sql.functions import col, explode, from_json
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType

# Schema for a nested JSON order record
order_schema = StructType([
    StructField("order_id", IntegerType()),
    StructField("customer", StructType([
        StructField("name", StringType()),
        StructField("city", StringType())
    ])),
    StructField("items", ArrayType(StructType([
        StructField("product_id", StringType()),
        StructField("quantity", IntegerType()),
        StructField("price", IntegerType())
    ])))
])

# Read raw JSON
df_raw = spark.read.format("json") \
    .schema(order_schema) \
    .load("abfss://bronze@mystorage.dfs.core.windows.net/orders/")

# Flatten nested struct — access nested fields with dot notation
df_flat = df_raw.select(
    col("order_id"),
    col("customer.name").alias("customer_name"),
    col("customer.city").alias("customer_city")
)

# Explode array — one row per item per order
df_items = df_raw.select(
    col("order_id"),
    explode(col("items")).alias("item")
).select(
    col("order_id"),
    col("item.product_id"),
    col("item.quantity"),
    col("item.price")
)

Reading JSON from Event Hubs

Events from Event Hubs arrive as binary. Decode and parse them:

from pyspark.sql.functions import from_json, col

df_parsed = df_stream \
    .select(col("body").cast("string").alias("json_str")) \
    .select(from_json(col("json_str"), event_schema).alias("data")) \
    .select("data.*")

Processing JSON in ADF Data Flows

ADF Data Flows handle nested JSON through the Flatten transformation. It expands arrays into separate rows and brings nested object fields to the top level. No code required — configure it visually in the Data Flow canvas.

For simple JSON flattening without nested arrays, the Derived Column transformation with dot notation accesses nested fields directly: customer.city accesses the city field inside the customer object.

Processing XML Data

XML is commonly exported from enterprise systems like SAP, legacy ERP, and healthcare systems. PySpark does not natively support XML. Use the spark-xml library from Databricks or add it as a Maven dependency.

# Read XML with spark-xml library
df_xml = spark.read \
    .format("com.databricks.spark.xml") \
    .option("rowTag", "Order") \
    .load("abfss://bronze@mystorage.dfs.core.windows.net/orders.xml")

In ADF, the XML format is supported natively in the Copy Activity source and Data Flow source. Configure it by selecting XML as the file format and specifying the root and row element tags.

Handling Log Files

Application and web server logs are text files with variable, unstructured content. A typical web server access log line looks like:

192.168.1.1 - - [15/Jan/2024:10:23:45 +0000] "GET /products?id=123 HTTP/1.1" 200 4823

Parse these with regular expressions in PySpark:

from pyspark.sql.functions import regexp_extract

log_pattern = r'(\d+\.\d+\.\d+\.\d+) .+ \[(.+)\] "(\w+) (.+) HTTP.+" (\d{3}) (\d+)'

df_logs = spark.read.text("abfss://bronze@mystorage.dfs.core.windows.net/weblogs/")

df_parsed = df_logs.select(
    regexp_extract(col("value"), log_pattern, 1).alias("ip_address"),
    regexp_extract(col("value"), log_pattern, 2).alias("timestamp"),
    regexp_extract(col("value"), log_pattern, 3).alias("method"),
    regexp_extract(col("value"), log_pattern, 4).alias("url"),
    regexp_extract(col("value"), log_pattern, 5).cast("integer").alias("status_code"),
    regexp_extract(col("value"), log_pattern, 6).cast("integer").alias("bytes_sent")
)

Avro Format

Avro is a binary serialization format with a built-in schema. It is the default format for Event Hubs Capture and is commonly used in Kafka pipelines. Avro is compact and schema-enforced — every file includes the schema definition, so readers always know the data structure.

# Read Avro files from Event Hubs Capture
df_avro = spark.read.format("avro") \
    .load("abfss://bronze@mystorage.dfs.core.windows.net/eventhubs-capture/")

Delta Format for Nested Data

Delta tables support complex types including structs, arrays, and maps as column types. You can store semi-structured data in a Delta table and still query it with SQL using dot notation and the LATERAL VIEW EXPLODE syntax.

-- Query a nested struct column in a Delta table using Synapse or Databricks SQL
SELECT
    order_id,
    customer.name AS customer_name,
    customer.city AS customer_city
FROM silver.orders
WHERE customer.city = 'Mumbai';

Azure AI Services for Unstructured Data

Truly unstructured data — PDFs, images, audio — requires AI-powered extraction before it can be used in a data pipeline. Azure offers specialized services for this:

Azure AI Document Intelligence: Extracts structured data from PDFs, forms, invoices, and receipts
Azure AI Vision: Extracts text from images (OCR) and classifies image content
Azure AI Speech: Transcribes audio to text
Azure OpenAI: Processes and summarizes unstructured text using large language models

The typical pipeline pattern: unstructured file arrives in bronze layer → ADF triggers an Azure Function or Logic App → the Function calls an AI service API → the extracted structured data is written to the silver layer as JSON or Parquet.

Key Points

Use from_json in PySpark to parse JSON strings and explode to flatten arrays into separate rows
ADF Data Flows handle JSON nesting visually through the Flatten transformation — no code needed
Use the spark-xml library for XML processing in Databricks notebooks
Parse log files with regular expressions in PySpark to extract structured fields from unstructured text
Avro is the native format for Event Hubs Capture — read it with spark.read.format("avro")
Use Azure AI Document Intelligence and Azure AI Vision to extract structured data from PDFs and images before processing

Previous lesson

Back to course

Next lesson