Azure Synapse Analytics

Modern analytics requires querying terabytes of data, running machine learning models, building real-time dashboards, and orchestrating complex data pipelines — all from the same platform. Azure Synapse Analytics is an enterprise analytics service that brings together SQL data warehousing, big data processing with Apache Spark, data integration pipelines, and Power BI visualization into a single unified workspace.

What is Azure Synapse Analytics?

Azure Synapse Analytics is the evolution of Azure SQL Data Warehouse combined with big data capabilities. It eliminates the traditional silos between data warehousing, data lake processing, and business intelligence tools by combining all of them in one workspace with shared security, governance, and management.

Azure Synapse Workspace

  Azure Synapse Workspace
  │
  ├── Synapse Studio (Unified web IDE)
  │   ├── Data Hub         → Browse data sources and linked services
  │   ├── Develop Hub      → Write SQL scripts, notebooks, data flows
  │   ├── Integrate Hub    → Build data pipelines (ADF-compatible)
  │   ├── Monitor Hub      → View pipeline, job, and query runs
  │   └── Manage Hub       → Configure pools, linked services, access
  │
  ├── Dedicated SQL Pool   → Enterprise data warehouse (formerly SQL DW)
  ├── Serverless SQL Pool  → Query data in Data Lake on-demand (no infra)
  ├── Apache Spark Pool    → Big data processing with Python/Scala/R
  ├── Data Explorer Pool   → Real-time log and telemetry analytics (KQL)
  └── Synapse Link         → Zero-ETL analytics over Cosmos DB / SQL / Dataverse

Dedicated SQL Pool (SQL Data Warehouse)

The Dedicated SQL Pool is a traditional enterprise data warehouse built for massive analytical queries. It uses a Massively Parallel Processing (MPP) architecture — queries are distributed across 60 compute nodes that work in parallel, enabling queries over trillions of rows to complete in seconds.

MPP Architecture

  Client Query → Control Node
                      │
          ┌───────────┼───────────┐
          ▼           ▼           ▼
     Node 1      Node 2 ... Node 60
    (Computes   (Computes   (Computes
    its shard)  its shard)  its shard)
          │           │           │
          └───────────┴───────────┘
                      │
              Results combined and
              returned to client

Data Warehouse Units (DWU)

The Dedicated SQL Pool is sized in Data Warehouse Units (DWU) — a blended measure of CPU, memory, and I/O. More DWUs means faster queries and the ability to handle more concurrent users. DWUs can be scaled up before a heavy reporting period and scaled down (or paused completely) to save cost during off-hours.

Feature	Description
Scale	Scale from DW100c (development) to DW30000c (large enterprise) with no downtime
Pause	Pause the pool to stop compute billing when not in use — storage costs continue
Distribution	Tables are distributed across 60 nodes using hash or round-robin distribution
Columnar Storage	Data is stored in columnstore format — massively reduces I/O for analytical queries
Result Set Cache	Frequently run queries are cached — subsequent identical queries return instantly

Serverless SQL Pool

The Serverless SQL Pool allows querying data stored in Azure Data Lake Storage using standard T-SQL — without creating or managing any infrastructure. There are no resources to provision and no databases to populate — simply write a SELECT statement pointing to files in the data lake.

Querying Data Lake Files with Serverless SQL

  -- Query CSV files in Data Lake directly
  SELECT
    CustomerID,
    ProductName,
    Quantity,
    SaleAmount
  FROM
    OPENROWSET(
      BULK 'https://mydatalake.dfs.core.windows.net/sales/2024/01/*.csv',
      FORMAT = 'CSV',
      HEADER_ROW = TRUE
    ) AS SalesData
  WHERE SaleAmount > 10000
  ORDER BY SaleAmount DESC;

  -- Query Parquet files (much faster due to columnar format)
  SELECT Region, SUM(Revenue) AS TotalRevenue
  FROM
    OPENROWSET(
      BULK 'https://mydatalake.dfs.core.windows.net/finance/revenue/**',
      FORMAT = 'PARQUET'
    ) AS RevenueData
  GROUP BY Region;

Billing is per terabyte of data scanned. There is no charge when the pool is idle.

Apache Spark Pool

The Spark Pool runs Apache Spark workloads for big data processing, machine learning, and complex transformations that SQL cannot handle. Spark pools in Synapse auto-scale and auto-pause — nodes start when a job is submitted and shut down after the job completes.

Supports Python (PySpark), Scala, R, .NET, and SQL (Spark SQL)
Notebooks provide an interactive development experience
Integration with Azure ML for model training on large datasets
Shared metadata between Spark and SQL pools — tables created in Spark are automatically visible in the Serverless SQL Pool

Synapse Link – Zero-ETL Analytics

Synapse Link creates a continuous, near-real-time link between operational databases and Azure Synapse Analytics — with zero ETL pipelines required. Changes in the operational database are automatically replicated to the analytical store, where Synapse queries them without any impact on the operational database's performance.

Supported Synapse Link Sources

Azure Cosmos DB: Operational NoSQL data is automatically replicated to a Cosmos DB Analytical Store in columnar format, queryable by Synapse Spark and Serverless SQL.
Azure SQL Database / SQL Managed Instance: Transactional tables replicated to Synapse in near-real time.
Dataverse (Microsoft Power Platform): Business data from Dynamics 365 and Power Apps becomes queryable in Synapse without export jobs.

Lake Database

A Lake Database in Synapse stores data in Azure Data Lake Storage but presents it with a relational schema — tables, columns, relationships, and data types — visible to both Spark and Serverless SQL pools. This bridges the gap between the schema-free nature of data lakes and the structured querying experience of a relational database.