ADE Monitoring, Logging, and Alerting

A data platform without monitoring is a black box. You do not know when pipelines fail, when performance degrades, or when data quality drops — until someone complains. Monitoring turns the black box into a transparent system where problems surface immediately and the team can act before business impact occurs.

The Three Layers of Monitoring

A complete monitoring strategy covers three layers, each answering a different question.

Infrastructure monitoring: Is the compute running? Are resources healthy? Is there enough capacity?
Pipeline monitoring: Did the pipeline run? Did it succeed? How long did it take?
Data quality monitoring: Is the data itself correct? Are row counts as expected? Are there unexpected nulls or outliers?

Azure Monitor — The Central Monitoring Platform

Azure Monitor is the unified monitoring service that collects metrics and logs from every Azure resource. Every service — ADF, Databricks, Synapse, ADLS Gen2, Azure SQL — sends telemetry to Azure Monitor automatically.

Metrics

Metrics are numeric values measured at regular intervals. They answer questions like: How many pipeline runs completed in the last hour? What is the current CPU usage on my Databricks cluster? How much storage is consumed in ADLS Gen2? Metrics are lightweight, always-on, and retained for 93 days by default.

Logs

Logs are detailed records of events — exactly what happened, when, and with what result. Pipeline activity logs, query execution logs, authentication events. Logs go to a Log Analytics Workspace where you query them using KQL (Kusto Query Language).

// KQL — Find all failed ADF pipeline runs in the last 24 hours
ADFPipelineRun
| where TimeGenerated > ago(24h)
| where Status == "Failed"
| project PipelineName, Start, End, Status, ErrorMessage
| order by Start desc

// KQL — Count pipeline runs by status for the last 7 days
ADFPipelineRun
| where TimeGenerated > ago(7d)
| summarize RunCount = count() by Status
| render piechart

Monitoring Azure Data Factory

ADF has a built-in Monitor tab in ADF Studio. It shows every pipeline run, trigger run, and activity run with status, start time, duration, and error details. This is the first place to check when investigating a pipeline failure.

Diagnostic Settings

To retain pipeline run history beyond 45 days and enable advanced KQL queries, configure Diagnostic Settings on the ADF instance to send logs to a Log Analytics Workspace. Categories to enable:

PipelineRuns
ActivityRuns
TriggerRuns
SandboxPipelineRuns (for Data Flow debug runs)

ADF Alert Rules

Create alert rules in Azure Monitor that fire when pipeline failures exceed a threshold. For production pipelines, set an alert for any single failure. For less critical pipelines, alert when failure rate exceeds 20% over an hour.

Alert actions route to Action Groups — which send notifications via email, SMS, Azure Logic App, or webhook. A Logic App can post a formatted failure message to a Microsoft Teams channel including the pipeline name, error message, and a link to the failed run.

Monitoring Azure Databricks

Cluster Event Logs

Databricks records all cluster events — start, stop, auto-scaling, termination. Access them from the cluster detail page. Use these logs to verify that clusters terminated on schedule and to investigate unexpected terminations.

Spark UI

The Spark UI is the most valuable debugging tool for slow Databricks jobs. It shows every Spark stage, task, and execution plan. Diagnose data skew by looking for tasks that take 10× longer than others. Identify shuffle operations that are causing memory spill. Access the Spark UI from the cluster page while a job is running or from job run history after completion.

Ganglia Metrics

Ganglia provides real-time metrics on cluster CPU, memory, and network usage. Access it from the Metrics tab of a running cluster. High memory usage with frequent garbage collection indicates the cluster needs more memory per node or the code needs optimization.

Structured Logging in Notebooks

Add log statements inside notebooks to record processing metrics — rows read, rows written, rows rejected, processing duration. Write these metrics to a logging table in Delta format for dashboards and alerts.

import time
from datetime import datetime

start_time = time.time()

# Your transformation code here
df_processed = df_raw.filter(...).withColumn(...)
rows_written = df_processed.count()
df_processed.write.format("delta").mode("overwrite").save(output_path)

duration = time.time() - start_time

# Write metrics to a log table
log_entry = [(datetime.now(), "transform_sales", rows_written, duration, "SUCCESS")]
df_log = spark.createDataFrame(log_entry, ["run_time", "job_name", "rows_written", "duration_sec", "status"])
df_log.write.format("delta").mode("append").save("abfss://logs@mystorageaccount.dfs.core.windows.net/job_runs/")

Data Quality Monitoring

Infrastructure being healthy does not mean the data is correct. A pipeline can succeed technically while writing entirely wrong data. Data quality monitoring checks the data itself.

Row Count Checks

After every load, compare the row count in the destination against expectations. If yesterday's sales file had 45,000 rows and today's has 3,000, something is wrong — even if the pipeline ran successfully. A 90% drop in row count is a data quality alert, not a pipeline alert.

Null Checks

Define rules for columns that must never be null. A fact table's amount column should never be null. Run a check after each load and alert or halt processing if violations are found.

# Data quality check in Databricks
null_count = df_loaded.filter(col("amount").isNull()).count()
if null_count > 0:
    raise ValueError(f"Data quality failure: {null_count} null values found in 'amount' column")

Great Expectations — A Data Quality Framework

Great Expectations is a popular open-source Python library for data quality validation. You define expectations — rules about what the data should look like — and run them against your DataFrames. Results are logged, and failures halt the pipeline before bad data propagates downstream.

Building a Monitoring Dashboard

Collect all monitoring metrics in one place — a monitoring dashboard — so the team can see the health of the entire data platform at a glance. Use Azure Workbooks inside Azure Monitor for a no-code approach, or connect Power BI to your log analytics and job metrics tables for a more custom dashboard.

A good data platform monitoring dashboard shows:

Last 24 hours pipeline success rate
Average and maximum pipeline run duration over the past 7 days
Row counts by pipeline — trend over 30 days
Current cluster status and cost
Open alerts and incidents

Key Points

Monitor at three layers — infrastructure health, pipeline execution, and data quality
Send ADF diagnostic logs to a Log Analytics Workspace for long-term retention and KQL-based analysis
Set alert rules for pipeline failures and route them to Action Groups that notify the team via email or Teams
Use the Spark UI to diagnose data skew and shuffle-heavy jobs in Databricks
Write row counts and processing metrics to a Delta log table after every job run
Always add row count and null checks after data loads — a technically successful pipeline can still write bad data

Previous lesson

Back to course

Next lesson