ADE Azure Synapse Analytics

Azure Synapse Analytics is an all-in-one analytics platform. It combines data warehousing, big data processing, data integration, and reporting in a single service. Instead of managing four separate tools, you work inside one unified environment called the Synapse Studio.

What Makes Synapse Different

Before Synapse, a typical data platform required separate Azure services stitched together: ADF for pipelines, ADLS Gen2 for storage, Databricks for processing, and Azure SQL Data Warehouse for reporting. Each service had its own interface, billing, and management overhead.

Synapse brings most of this under one roof. It is not a replacement for every service — Databricks remains the stronger choice for complex machine learning and advanced Spark development — but Synapse covers a wide range of data engineering and analytics scenarios in a single workspace.

Core Components of Azure Synapse

Dedicated SQL Pool — The Data Warehouse

The Dedicated SQL Pool is a massively parallel processing (MPP) data warehouse. It is designed for querying large amounts of structured data quickly. Analysts run complex SQL queries against billions of rows and get results in seconds rather than hours.

A traditional database server works like a single cashier at a checkout counter — all queries line up and wait their turn. A Dedicated SQL Pool works like 60 cashiers working in parallel — every query gets distributed across multiple compute nodes and runs simultaneously.

Key concept — Distribution: When you load data into a Dedicated SQL Pool, rows are distributed across 60 distributions (sub-databases). Queries run in parallel across all 60. Choosing the right distribution key is critical for performance. Distribute on a column that appears frequently in JOIN conditions and has high cardinality (many distinct values), such as customer_id or order_id.

Table Types in Dedicated SQL Pool
  • Round Robin: Rows distributed randomly. Simple to set up. Good for staging tables where data is loaded once and then moved.
  • Hash Distributed: Rows with the same hash value go to the same distribution. Best for large fact tables queried with JOINs.
  • Replicated: Full copy of the table exists on every compute node. Best for small dimension tables (under 2 GB) that join with large tables frequently.

Serverless SQL Pool — Query Without a Warehouse

The Serverless SQL Pool lets you run SQL queries directly against files in ADLS Gen2 without loading data into a warehouse first. You pay per terabyte of data scanned — there is no infrastructure to manage or provision.

This is extremely useful for ad-hoc exploration. An analyst wants to query a CSV file that just landed in the data lake without waiting for an ETL pipeline to load it into a database. Serverless SQL Pool makes this instant.

-- Query a CSV file in ADLS Gen2 directly with SQL
SELECT TOP 100 *
FROM OPENROWSET(
    BULK 'https://mystorageaccount.dfs.core.windows.net/bronze/sales/2024/01/*.csv',
    FORMAT = 'CSV',
    HEADER_ROW = TRUE
) AS [result]
-- Query Parquet files
SELECT region, SUM(amount) AS total_sales
FROM OPENROWSET(
    BULK 'https://mystorageaccount.dfs.core.windows.net/silver/sales/**',
    FORMAT = 'PARQUET'
) AS [result]
GROUP BY region

Apache Spark Pool

Synapse includes a Spark pool for large-scale data processing using Python, Scala, or SQL. This is similar to Azure Databricks but integrated directly into Synapse Studio. For organizations that want everything in one place, the Synapse Spark Pool is a convenient option. For teams that rely heavily on Databricks-specific features like Unity Catalog and MLflow, Databricks remains the better choice.

Synapse Pipelines

Synapse includes a built-in pipeline tool that is identical in interface and capability to Azure Data Factory. The same Copy Activity, Data Flows, Triggers, and Linked Services work inside Synapse Pipelines. This eliminates the need to switch between ADF and Synapse for data movement tasks.

Synapse Link

Synapse Link creates a live, no-ETL connection between operational databases (Cosmos DB, SQL Server, Dataverse) and Synapse Analytics. Data in those systems becomes instantly queryable in Synapse without building a pipeline. Changes in the operational database replicate automatically to Synapse for analytics — with zero impact on the operational system's performance.

The Synapse Workspace and Linked Storage

When you create a Synapse workspace, you link it to an ADLS Gen2 account. This storage account becomes the primary data lake for the workspace. Spark jobs read from and write to it. Pipelines move data through it. SQL queries run against files in it. Everything works together within this unified environment.

External Tables and Views in Serverless SQL Pool

Instead of writing OPENROWSET queries every time, you can create External Tables and Views that abstract the file paths. Analysts query them like regular database tables without knowing where the files physically live.

-- Create an external table pointing to Parquet files in the data lake
CREATE EXTERNAL TABLE sales_silver (
    order_id INT,
    customer_id INT,
    order_date DATE,
    amount DECIMAL(10,2),
    region VARCHAR(50)
)
WITH (
    LOCATION = 'silver/sales/**',
    DATA_SOURCE = MyDataLake,
    FILE_FORMAT = ParquetFormat
)

PolyBase and COPY INTO — Loading Data into Dedicated SQL Pool

To load data from ADLS Gen2 into the Dedicated SQL Pool, you use one of two methods:

  • COPY INTO: The modern, recommended approach. Simple SQL statement. Handles CSV, Parquet, ORC. Supports loading multiple files at once using wildcard paths.
  • PolyBase: Older approach using external tables. Still widely used in existing solutions.
-- Load Parquet files into a Dedicated SQL Pool table using COPY INTO
COPY INTO dbo.sales_fact
FROM 'https://mystorageaccount.dfs.core.windows.net/gold/sales/'
WITH (
    FILE_TYPE = 'PARQUET',
    CREDENTIAL = (IDENTITY = 'Managed Identity')
)

When to Use Synapse vs Databricks

ScenarioBetter Choice
Structured data warehousing and SQL analyticsAzure Synapse — Dedicated SQL Pool
Ad-hoc SQL queries on data lake filesAzure Synapse — Serverless SQL Pool
Complex Python transformations and MLAzure Databricks
Everything in one workspace with minimal tool switchingAzure Synapse
Advanced Delta Lake, Unity Catalog, MLflowAzure Databricks

Key Points

  • Synapse Analytics is a unified platform — data warehousing, Spark, pipelines, and SQL in one workspace
  • Dedicated SQL Pool is an MPP warehouse built for querying billions of structured rows
  • Serverless SQL Pool queries files in ADLS Gen2 directly with SQL — no loading required
  • Choose Hash Distribution for large fact tables and Replicated for small dimension tables
  • Synapse Pipelines use the same interface as ADF — no new skills needed for pipeline building
  • Use Synapse for SQL-heavy workloads; use Databricks for Python-heavy or ML workloads

Leave a Comment