ADE Azure SQL Database and Data Modeling
Not all data belongs in a data lake. Structured, relational data that needs fast querying, strong consistency, and transactional integrity belongs in a relational database. Azure SQL Database is Microsoft's fully managed relational database service in the cloud — and understanding how to model data inside it is a foundational skill for every Azure data engineer.
What is Azure SQL Database
Azure SQL Database is a PaaS (Platform as a Service) relational database built on Microsoft SQL Server. Microsoft handles patching, backups, high availability, and scaling. You focus on your schema, data, and queries.
You write standard SQL to interact with it — the same T-SQL syntax used in on-premises SQL Server. This makes migration from on-premises environments straightforward.
Azure SQL Deployment Options
Azure offers three related SQL services. Choosing the right one depends on your workload.
| Service | Best For | Key Characteristic |
|---|---|---|
| Azure SQL Database | Single application databases | Fully managed, scales automatically, serverless option available |
| Azure SQL Managed Instance | Lifting on-premises SQL Server to cloud | Near 100% SQL Server compatibility including SQL Agent, linked servers |
| SQL Server on Azure VM | Full control over OS and SQL Server | IaaS — you manage the OS, patches, and high availability yourself |
For most new data engineering projects, Azure SQL Database is the right choice. Use SQL Managed Instance when you are migrating an existing on-premises SQL Server that uses features not available in Azure SQL Database.
Pricing Tiers — Choosing the Right Compute
Azure SQL Database offers several pricing tiers:
- DTU-based (Basic, Standard, Premium): A bundled measure of CPU, memory, and I/O. Simple to understand. Good for predictable workloads.
- vCore-based (General Purpose, Business Critical, Hyperscale): You choose the number of CPU cores and memory independently. More flexibility and better cost optimization for demanding workloads.
- Serverless: Compute automatically pauses when the database is idle and resumes when queries arrive. You pay only when the database is active. Ideal for development databases and workloads with unpredictable usage patterns.
Relational Data Modeling Fundamentals
Data modeling defines how data is organized in tables and how those tables relate to each other. Good data modeling makes queries fast, data consistent, and storage efficient. Poor modeling leads to slow queries, duplicate data, and update anomalies.
Primary Keys and Foreign Keys
A Primary Key uniquely identifies each row in a table. No two rows can share the same primary key value. A customer table has a customer_id primary key — every customer gets a unique ID.
A Foreign Key in one table references the primary key of another table. An orders table has a customer_id column that references the customer_id in the customers table. This relationship ensures every order belongs to a real customer.
Normalization — Eliminating Redundancy
Normalization is the process of organizing a database to reduce duplicate data. Instead of storing a customer's name and city in every order row, you store them once in a customers table and reference them with a customer_id.
Think of a hotel. Instead of writing the room type description — "King bed, sea view, 45 sqm, minibar, bath" — on every booking record, you assign a room type code and store the description once in a room types table. Every booking just references the code.
The three most common normal forms:
- 1NF: Each column contains atomic (single, indivisible) values. No repeating groups. Each row is unique.
- 2NF: All non-key columns depend on the entire primary key — not just part of it. Applies when the primary key is composite (made of multiple columns).
- 3NF: No column depends on another non-key column. Remove transitive dependencies. Example: if you store zip_code and city in the same table, city depends on zip_code — not directly on the primary key. Move city to a separate zip code table.
Star Schema — The Data Warehouse Model
Normalized schemas are great for transactional databases. For analytics databases and data warehouses, a different model works better — the Star Schema.
A Star Schema has two types of tables:
- Fact Table: Stores measurable events — sales transactions, website clicks, sensor readings. Contains numeric metrics and foreign keys that connect to dimension tables.
- Dimension Tables: Describe the "who, what, where, when" of the facts — customers, products, dates, locations. Contain descriptive attributes.
The fact table sits in the center. Dimension tables surround it like points of a star — hence the name.
A Sales Star Schema Example
Fact table: fact_sales
- sale_id (PK)
- customer_key (FK to dim_customer)
- product_key (FK to dim_product)
- date_key (FK to dim_date)
- store_key (FK to dim_store)
- quantity, unit_price, discount, total_amount
Dimension tables:
- dim_customer: customer_key, customer_name, city, country, segment
- dim_product: product_key, product_name, category, brand, unit_cost
- dim_date: date_key, full_date, year, quarter, month, weekday, is_holiday
- dim_store: store_key, store_name, region, country, manager
An analyst asking "What was the total revenue by product category in Q3 2024 for European stores?" joins the four tables and gets the answer in one clean SQL query. The Star Schema makes this type of multi-dimensional analysis fast and intuitive.
Slowly Changing Dimensions (SCD)
Dimension data changes over time. A customer moves to a new city. A product gets recategorized. How you handle these changes affects the accuracy of historical analysis. These scenarios are called Slowly Changing Dimensions.
SCD Type 1 — Overwrite
Simply update the old value with the new value. No history is kept. Use when historical accuracy of that attribute does not matter. Example: correcting a misspelled customer name.
SCD Type 2 — Add a New Row
Keep the old row and insert a new row with the updated value. Add start_date, end_date, and is_current columns to track which version is active. Use when historical accuracy matters. Example: tracking a customer's city over time so that sales reports show where the customer lived when each purchase was made.
-- SCD Type 2 structure customer_key | customer_id | customer_name | city | start_date | end_date | is_current 1 | 1001 | Alice Brown | Mumbai | 2022-01-01 | 2023-05-31 | N 2 | 1001 | Alice Brown | Bangalore | 2023-06-01 | 9999-12-31 | Y
SCD Type 3 — Add a New Column
Add a column to store the previous value alongside the current value. Limited to tracking one change back. Example: add a previous_city column next to current_city.
Indexes — Making Queries Faster
An index in a database is like the index at the back of a textbook. Instead of reading every page to find a topic, you look up the topic in the index and jump directly to the right page.
- Clustered Index: Determines the physical order of rows in the table. Each table can have only one clustered index. Usually applied to the primary key.
- Non-Clustered Index: A separate structure that points to the rows. A table can have multiple non-clustered indexes. Apply to columns frequently used in WHERE clauses and JOINs.
- Columnstore Index: Stores data in columns rather than rows. Dramatically faster for analytical queries that aggregate large numbers of rows. The default choice for data warehouse tables in Azure Synapse Dedicated SQL Pool.
Key Points
- Azure SQL Database is a fully managed PaaS relational database — great for structured, transactional data
- Use Serverless tier for development databases and workloads with unpredictable usage to save costs
- Normalize transactional databases to eliminate redundancy and maintain data integrity
- Use Star Schema for analytics databases — fact table at center, dimension tables surrounding it
- Implement SCD Type 2 when you need to preserve historical accuracy of dimension attributes
- Apply Columnstore Indexes on large analytical tables for significantly faster aggregation queries
