ADE Monitoring, Logging, and Alerting
A data platform without monitoring is a black box. You do not know when pipelines fail, when performance degrades, or when data quality drops — until someone complains. Monitoring turns the black box into a transparent system where problems surface immediately and the team can act before business impact occurs.
The Three Layers of Monitoring
A complete monitoring strategy covers three layers, each answering a different question.
- Infrastructure monitoring: Is the compute running? Are resources healthy? Is there enough capacity?
- Pipeline monitoring: Did the pipeline run? Did it succeed? How long did it take?
- Data quality monitoring: Is the data itself correct? Are row counts as expected? Are there unexpected nulls or outliers?
Azure Monitor — The Central Monitoring Platform
Azure Monitor is the unified monitoring service that collects metrics and logs from every Azure resource. Every service — ADF, Databricks, Synapse, ADLS Gen2, Azure SQL — sends telemetry to Azure Monitor automatically.
Metrics
Metrics are numeric values measured at regular intervals. They answer questions like: How many pipeline runs completed in the last hour? What is the current CPU usage on my Databricks cluster? How much storage is consumed in ADLS Gen2? Metrics are lightweight, always-on, and retained for 93 days by default.
Logs
Logs are detailed records of events — exactly what happened, when, and with what result. Pipeline activity logs, query execution logs, authentication events. Logs go to a Log Analytics Workspace where you query them using KQL (Kusto Query Language).
// KQL — Find all failed ADF pipeline runs in the last 24 hours ADFPipelineRun | where TimeGenerated > ago(24h) | where Status == "Failed" | project PipelineName, Start, End, Status, ErrorMessage | order by Start desc
// KQL — Count pipeline runs by status for the last 7 days ADFPipelineRun | where TimeGenerated > ago(7d) | summarize RunCount = count() by Status | render piechart
Monitoring Azure Data Factory
ADF has a built-in Monitor tab in ADF Studio. It shows every pipeline run, trigger run, and activity run with status, start time, duration, and error details. This is the first place to check when investigating a pipeline failure.
Diagnostic Settings
To retain pipeline run history beyond 45 days and enable advanced KQL queries, configure Diagnostic Settings on the ADF instance to send logs to a Log Analytics Workspace. Categories to enable:
- PipelineRuns
- ActivityRuns
- TriggerRuns
- SandboxPipelineRuns (for Data Flow debug runs)
ADF Alert Rules
Create alert rules in Azure Monitor that fire when pipeline failures exceed a threshold. For production pipelines, set an alert for any single failure. For less critical pipelines, alert when failure rate exceeds 20% over an hour.
Alert actions route to Action Groups — which send notifications via email, SMS, Azure Logic App, or webhook. A Logic App can post a formatted failure message to a Microsoft Teams channel including the pipeline name, error message, and a link to the failed run.
Monitoring Azure Databricks
Cluster Event Logs
Databricks records all cluster events — start, stop, auto-scaling, termination. Access them from the cluster detail page. Use these logs to verify that clusters terminated on schedule and to investigate unexpected terminations.
Spark UI
The Spark UI is the most valuable debugging tool for slow Databricks jobs. It shows every Spark stage, task, and execution plan. Diagnose data skew by looking for tasks that take 10× longer than others. Identify shuffle operations that are causing memory spill. Access the Spark UI from the cluster page while a job is running or from job run history after completion.
Ganglia Metrics
Ganglia provides real-time metrics on cluster CPU, memory, and network usage. Access it from the Metrics tab of a running cluster. High memory usage with frequent garbage collection indicates the cluster needs more memory per node or the code needs optimization.
Structured Logging in Notebooks
Add log statements inside notebooks to record processing metrics — rows read, rows written, rows rejected, processing duration. Write these metrics to a logging table in Delta format for dashboards and alerts.
import time
from datetime import datetime
start_time = time.time()
# Your transformation code here
df_processed = df_raw.filter(...).withColumn(...)
rows_written = df_processed.count()
df_processed.write.format("delta").mode("overwrite").save(output_path)
duration = time.time() - start_time
# Write metrics to a log table
log_entry = [(datetime.now(), "transform_sales", rows_written, duration, "SUCCESS")]
df_log = spark.createDataFrame(log_entry, ["run_time", "job_name", "rows_written", "duration_sec", "status"])
df_log.write.format("delta").mode("append").save("abfss://logs@mystorageaccount.dfs.core.windows.net/job_runs/")
Data Quality Monitoring
Infrastructure being healthy does not mean the data is correct. A pipeline can succeed technically while writing entirely wrong data. Data quality monitoring checks the data itself.
Row Count Checks
After every load, compare the row count in the destination against expectations. If yesterday's sales file had 45,000 rows and today's has 3,000, something is wrong — even if the pipeline ran successfully. A 90% drop in row count is a data quality alert, not a pipeline alert.
Null Checks
Define rules for columns that must never be null. A fact table's amount column should never be null. Run a check after each load and alert or halt processing if violations are found.
# Data quality check in Databricks
null_count = df_loaded.filter(col("amount").isNull()).count()
if null_count > 0:
raise ValueError(f"Data quality failure: {null_count} null values found in 'amount' column")
Great Expectations — A Data Quality Framework
Great Expectations is a popular open-source Python library for data quality validation. You define expectations — rules about what the data should look like — and run them against your DataFrames. Results are logged, and failures halt the pipeline before bad data propagates downstream.
Building a Monitoring Dashboard
Collect all monitoring metrics in one place — a monitoring dashboard — so the team can see the health of the entire data platform at a glance. Use Azure Workbooks inside Azure Monitor for a no-code approach, or connect Power BI to your log analytics and job metrics tables for a more custom dashboard.
A good data platform monitoring dashboard shows:
- Last 24 hours pipeline success rate
- Average and maximum pipeline run duration over the past 7 days
- Row counts by pipeline — trend over 30 days
- Current cluster status and cost
- Open alerts and incidents
Key Points
- Monitor at three layers — infrastructure health, pipeline execution, and data quality
- Send ADF diagnostic logs to a Log Analytics Workspace for long-term retention and KQL-based analysis
- Set alert rules for pipeline failures and route them to Action Groups that notify the team via email or Teams
- Use the Spark UI to diagnose data skew and shuffle-heavy jobs in Databricks
- Write row counts and processing metrics to a Delta log table after every job run
- Always add row count and null checks after data loads — a technically successful pipeline can still write bad data
