Azure Synapse Analytics

Modern analytics requires querying terabytes of data, running machine learning models, building real-time dashboards, and orchestrating complex data pipelines — all from the same platform. Azure Synapse Analytics is an enterprise analytics service that brings together SQL data warehousing, big data processing with Apache Spark, data integration pipelines, and Power BI visualization into a single unified workspace.

What is Azure Synapse Analytics?

Azure Synapse Analytics is the evolution of Azure SQL Data Warehouse combined with big data capabilities. It eliminates the traditional silos between data warehousing, data lake processing, and business intelligence tools by combining all of them in one workspace with shared security, governance, and management.

Azure Synapse Workspace

  Azure Synapse Workspace
  │
  ├── Synapse Studio (Unified web IDE)
  │   ├── Data Hub         → Browse data sources and linked services
  │   ├── Develop Hub      → Write SQL scripts, notebooks, data flows
  │   ├── Integrate Hub    → Build data pipelines (ADF-compatible)
  │   ├── Monitor Hub      → View pipeline, job, and query runs
  │   └── Manage Hub       → Configure pools, linked services, access
  │
  ├── Dedicated SQL Pool   → Enterprise data warehouse (formerly SQL DW)
  ├── Serverless SQL Pool  → Query data in Data Lake on-demand (no infra)
  ├── Apache Spark Pool    → Big data processing with Python/Scala/R
  ├── Data Explorer Pool   → Real-time log and telemetry analytics (KQL)
  └── Synapse Link         → Zero-ETL analytics over Cosmos DB / SQL / Dataverse

Dedicated SQL Pool (SQL Data Warehouse)

The Dedicated SQL Pool is a traditional enterprise data warehouse built for massive analytical queries. It uses a Massively Parallel Processing (MPP) architecture — queries are distributed across 60 compute nodes that work in parallel, enabling queries over trillions of rows to complete in seconds.

MPP Architecture

  Client Query → Control Node
                      │
          ┌───────────┼───────────┐
          ▼           ▼           ▼
     Node 1      Node 2 ... Node 60
    (Computes   (Computes   (Computes
    its shard)  its shard)  its shard)
          │           │           │
          └───────────┴───────────┘
                      │
              Results combined and
              returned to client

Data Warehouse Units (DWU)

The Dedicated SQL Pool is sized in Data Warehouse Units (DWU) — a blended measure of CPU, memory, and I/O. More DWUs means faster queries and the ability to handle more concurrent users. DWUs can be scaled up before a heavy reporting period and scaled down (or paused completely) to save cost during off-hours.

FeatureDescription
ScaleScale from DW100c (development) to DW30000c (large enterprise) with no downtime
PausePause the pool to stop compute billing when not in use — storage costs continue
DistributionTables are distributed across 60 nodes using hash or round-robin distribution
Columnar StorageData is stored in columnstore format — massively reduces I/O for analytical queries
Result Set CacheFrequently run queries are cached — subsequent identical queries return instantly

Serverless SQL Pool

The Serverless SQL Pool allows querying data stored in Azure Data Lake Storage using standard T-SQL — without creating or managing any infrastructure. There are no resources to provision and no databases to populate — simply write a SELECT statement pointing to files in the data lake.

Querying Data Lake Files with Serverless SQL

  -- Query CSV files in Data Lake directly
  SELECT
    CustomerID,
    ProductName,
    Quantity,
    SaleAmount
  FROM
    OPENROWSET(
      BULK 'https://mydatalake.dfs.core.windows.net/sales/2024/01/*.csv',
      FORMAT = 'CSV',
      HEADER_ROW = TRUE
    ) AS SalesData
  WHERE SaleAmount > 10000
  ORDER BY SaleAmount DESC;

  -- Query Parquet files (much faster due to columnar format)
  SELECT Region, SUM(Revenue) AS TotalRevenue
  FROM
    OPENROWSET(
      BULK 'https://mydatalake.dfs.core.windows.net/finance/revenue/**',
      FORMAT = 'PARQUET'
    ) AS RevenueData
  GROUP BY Region;

Billing is per terabyte of data scanned. There is no charge when the pool is idle.

Apache Spark Pool

The Spark Pool runs Apache Spark workloads for big data processing, machine learning, and complex transformations that SQL cannot handle. Spark pools in Synapse auto-scale and auto-pause — nodes start when a job is submitted and shut down after the job completes.

  • Supports Python (PySpark), Scala, R, .NET, and SQL (Spark SQL)
  • Notebooks provide an interactive development experience
  • Integration with Azure ML for model training on large datasets
  • Shared metadata between Spark and SQL pools — tables created in Spark are automatically visible in the Serverless SQL Pool

Synapse Link – Zero-ETL Analytics

Synapse Link creates a continuous, near-real-time link between operational databases and Azure Synapse Analytics — with zero ETL pipelines required. Changes in the operational database are automatically replicated to the analytical store, where Synapse queries them without any impact on the operational database's performance.

Supported Synapse Link Sources

  • Azure Cosmos DB: Operational NoSQL data is automatically replicated to a Cosmos DB Analytical Store in columnar format, queryable by Synapse Spark and Serverless SQL.
  • Azure SQL Database / SQL Managed Instance: Transactional tables replicated to Synapse in near-real time.
  • Dataverse (Microsoft Power Platform): Business data from Dynamics 365 and Power Apps becomes queryable in Synapse without export jobs.

Lake Database

A Lake Database in Synapse stores data in Azure Data Lake Storage but presents it with a relational schema — tables, columns, relationships, and data types — visible to both Spark and Serverless SQL pools. This bridges the gap between the schema-free nature of data lakes and the structured querying experience of a relational database.

Azure Synapse Analytics vs Azure Databricks

FeatureAzure Synapse AnalyticsAzure Databricks
Primary FocusUnified analytics platform — SQL DW + Spark + pipelines in one workspaceCollaborative data engineering, advanced ML/AI, Delta Lake
SQL WarehousingNative (Dedicated SQL Pool, Serverless SQL Pool)Databricks SQL (newer addition)
SparkManaged Spark Pools (auto-scale, auto-pause)Highly optimized, collaborative, Photon engine
Delta LakeSupported but limitedNative, first-class support — created by Databricks
Best ForEnterprise analytics teams needing SQL DW + Spark in one placeData engineering teams doing complex transformations and ML

Common Synapse Analytics Architecture

  Raw Data Sources
  (SQL Server, Salesforce, Files, IoT)
          │
          ▼
  Azure Data Lake Storage Gen2
  (Raw zone → Cleansed zone → Curated zone)
          │
          ├──────────────────────────────────┐
          ▼                                  ▼
  Synapse Pipelines                Apache Spark Pool
  (Orchestrate data movement)      (Complex transformations,
                                    ML feature engineering)
          │                                  │
          └──────────────┬───────────────────┘
                         ▼
              Dedicated SQL Pool
              (Data Warehouse — star schema)
                         │
                         ▼
              Power BI / Reporting Tools
              (Business Intelligence Dashboards)

Key Takeaways

  • Azure Synapse Analytics unifies data warehousing, Spark big data processing, pipeline orchestration, and BI into one workspace.
  • Dedicated SQL Pool uses MPP architecture across 60 nodes for sub-second queries on trillions of rows.
  • Serverless SQL Pool queries data lake files directly using T-SQL with no infrastructure to manage — billing per TB scanned.
  • Spark Pools auto-scale and auto-pause, supporting Python, Scala, R, and SQL for big data and ML workloads.
  • Synapse Link provides near-real-time, zero-ETL analytics over Cosmos DB, Azure SQL, and Dataverse operational data.

Leave a Comment