ADE Azure SQL Database and Data Modeling

Not all data belongs in a data lake. Structured, relational data that needs fast querying, strong consistency, and transactional integrity belongs in a relational database. Azure SQL Database is Microsoft's fully managed relational database service in the cloud — and understanding how to model data inside it is a foundational skill for every Azure data engineer.

What is Azure SQL Database

Azure SQL Database is a PaaS (Platform as a Service) relational database built on Microsoft SQL Server. Microsoft handles patching, backups, high availability, and scaling. You focus on your schema, data, and queries.

You write standard SQL to interact with it — the same T-SQL syntax used in on-premises SQL Server. This makes migration from on-premises environments straightforward.

Azure SQL Deployment Options

Azure offers three related SQL services. Choosing the right one depends on your workload.

Service	Best For	Key Characteristic
Azure SQL Database	Single application databases	Fully managed, scales automatically, serverless option available
Azure SQL Managed Instance	Lifting on-premises SQL Server to cloud	Near 100% SQL Server compatibility including SQL Agent, linked servers
SQL Server on Azure VM	Full control over OS and SQL Server	IaaS — you manage the OS, patches, and high availability yourself

For most new data engineering projects, Azure SQL Database is the right choice. Use SQL Managed Instance when you are migrating an existing on-premises SQL Server that uses features not available in Azure SQL Database.

Pricing Tiers — Choosing the Right Compute

Azure SQL Database offers several pricing tiers:

DTU-based (Basic, Standard, Premium): A bundled measure of CPU, memory, and I/O. Simple to understand. Good for predictable workloads.
vCore-based (General Purpose, Business Critical, Hyperscale): You choose the number of CPU cores and memory independently. More flexibility and better cost optimization for demanding workloads.
Serverless: Compute automatically pauses when the database is idle and resumes when queries arrive. You pay only when the database is active. Ideal for development databases and workloads with unpredictable usage patterns.

Relational Data Modeling Fundamentals

Data modeling defines how data is organized in tables and how those tables relate to each other. Good data modeling makes queries fast, data consistent, and storage efficient. Poor modeling leads to slow queries, duplicate data, and update anomalies.

Primary Keys and Foreign Keys

A Primary Key uniquely identifies each row in a table. No two rows can share the same primary key value. A customer table has a customer_id primary key — every customer gets a unique ID.

A Foreign Key in one table references the primary key of another table. An orders table has a customer_id column that references the customer_id in the customers table. This relationship ensures every order belongs to a real customer.

Normalization — Eliminating Redundancy

Normalization is the process of organizing a database to reduce duplicate data. Instead of storing a customer's name and city in every order row, you store them once in a customers table and reference them with a customer_id.

Think of a hotel. Instead of writing the room type description — "King bed, sea view, 45 sqm, minibar, bath" — on every booking record, you assign a room type code and store the description once in a room types table. Every booking just references the code.

The three most common normal forms:

1NF: Each column contains atomic (single, indivisible) values. No repeating groups. Each row is unique.
2NF: All non-key columns depend on the entire primary key — not just part of it. Applies when the primary key is composite (made of multiple columns).
3NF: No column depends on another non-key column. Remove transitive dependencies. Example: if you store zip_code and city in the same table, city depends on zip_code — not directly on the primary key. Move city to a separate zip code table.

Star Schema — The Data Warehouse Model

Normalized schemas are great for transactional databases. For analytics databases and data warehouses, a different model works better — the Star Schema.

A Star Schema has two types of tables:

Fact Table: Stores measurable events — sales transactions, website clicks, sensor readings. Contains numeric metrics and foreign keys that connect to dimension tables.
Dimension Tables: Describe the "who, what, where, when" of the facts — customers, products, dates, locations. Contain descriptive attributes.

The fact table sits in the center. Dimension tables surround it like points of a star — hence the name.

A Sales Star Schema Example

Fact table: fact_sales

sale_id (PK)
customer_key (FK to dim_customer)
product_key (FK to dim_product)
date_key (FK to dim_date)
store_key (FK to dim_store)
quantity, unit_price, discount, total_amount

Dimension tables:

dim_customer: customer_key, customer_name, city, country, segment
dim_product: product_key, product_name, category, brand, unit_cost
dim_date: date_key, full_date, year, quarter, month, weekday, is_holiday
dim_store: store_key, store_name, region, country, manager

An analyst asking "What was the total revenue by product category in Q3 2024 for European stores?" joins the four tables and gets the answer in one clean SQL query. The Star Schema makes this type of multi-dimensional analysis fast and intuitive.

Slowly Changing Dimensions (SCD)

Dimension data changes over time. A customer moves to a new city. A product gets recategorized. How you handle these changes affects the accuracy of historical analysis. These scenarios are called Slowly Changing Dimensions.

SCD Type 1 — Overwrite

Simply update the old value with the new value. No history is kept. Use when historical accuracy of that attribute does not matter. Example: correcting a misspelled customer name.

SCD Type 2 — Add a New Row

Keep the old row and insert a new row with the updated value. Add start_date, end_date, and is_current columns to track which version is active. Use when historical accuracy matters. Example: tracking a customer's city over time so that sales reports show where the customer lived when each purchase was made.

-- SCD Type 2 structure
customer_key | customer_id | customer_name | city      | start_date | end_date   | is_current
1            | 1001        | Alice Brown   | Mumbai    | 2022-01-01 | 2023-05-31 | N
2            | 1001        | Alice Brown   | Bangalore | 2023-06-01 | 9999-12-31 | Y

SCD Type 3 — Add a New Column

Add a column to store the previous value alongside the current value. Limited to tracking one change back. Example: add a previous_city column next to current_city.

Indexes — Making Queries Faster

An index in a database is like the index at the back of a textbook. Instead of reading every page to find a topic, you look up the topic in the index and jump directly to the right page.

Clustered Index: Determines the physical order of rows in the table. Each table can have only one clustered index. Usually applied to the primary key.
Non-Clustered Index: A separate structure that points to the rows. A table can have multiple non-clustered indexes. Apply to columns frequently used in WHERE clauses and JOINs.
Columnstore Index: Stores data in columns rather than rows. Dramatically faster for analytical queries that aggregate large numbers of rows. The default choice for data warehouse tables in Azure Synapse Dedicated SQL Pool.

Key Points

Azure SQL Database is a fully managed PaaS relational database — great for structured, transactional data
Use Serverless tier for development databases and workloads with unpredictable usage to save costs
Normalize transactional databases to eliminate redundancy and maintain data integrity
Use Star Schema for analytics databases — fact table at center, dimension tables surrounding it
Implement SCD Type 2 when you need to preserve historical accuracy of dimension attributes
Apply Columnstore Indexes on large analytical tables for significantly faster aggregation queries

Previous lesson

Back to course

Next lesson