Azure Cosmos DB for Data Engineers
Not every data problem fits a relational table. When data has no fixed structure, when applications need responses in under 10 milliseconds worldwide, or when the data model evolves constantly, a NoSQL database is the better tool. Azure Cosmos DB is Microsoft's globally distributed NoSQL database — and data engineers need to know how it works and when to use it.
What is Azure Cosmos DB
Cosmos DB is a fully managed NoSQL database that stores data as documents (JSON), key-value pairs, wide-column data, or graph data. It guarantees single-digit millisecond read and write latency anywhere in the world. You replicate data to multiple Azure regions with one click, and your application reads from the nearest region automatically.
Think of a global restaurant chain. Customers in Tokyo, London, and New York all want the same menu available locally — not fetched from a single kitchen in one city. Cosmos DB replicates your data to servers in every region you choose, so every user reads from a server physically close to them.
When to Use Cosmos DB vs Azure SQL Database
| Use Cosmos DB When | Use Azure SQL Database When |
|---|---|
| Data structure varies between records | Data has a consistent, fixed structure |
| You need sub-10ms response globally | Complex multi-table SQL queries are the priority |
| Write throughput is massive (millions/sec) | Strong ACID transactions across multiple tables |
| Schema changes frequently | Schema is stable and well-defined |
| IoT device state, user sessions, product catalogs | Financial records, orders, HR data |
Cosmos DB APIs — Choosing Your Data Model
Cosmos DB supports multiple APIs. Each API presents a different data model and query language. You choose the API when creating the database account — it cannot be changed later.
- Core (SQL) API: Stores JSON documents. Query with a SQL-like language. The most commonly used API. Good default choice for new projects.
- MongoDB API: Stores JSON documents. Use MongoDB query syntax. Best for teams migrating existing MongoDB applications to Azure.
- Cassandra API: Wide-column store. Use CQL (Cassandra Query Language). Best for migrating Apache Cassandra workloads.
- Gremlin API: Graph database. Store entities (vertices) and relationships (edges). Best for social networks, fraud detection graphs, and recommendation engines.
- Table API: Key-value store. Compatible with Azure Table Storage. Best for simple key-value lookup scenarios.
Key Concepts in the Core SQL API
Containers and Items
In Cosmos DB, a Container is roughly equivalent to a table in SQL, and an Item is equivalent to a row. But unlike SQL rows, each item is a JSON document and can have a completely different structure from other items in the same container.
// Item 1 — a simple product
{
"id": "prod-001",
"name": "Running Shoes",
"brand": "SwiftStep",
"price": 89.99,
"sizes": [7, 8, 9, 10, 11],
"category": "Footwear"
}
// Item 2 — same container, different structure
{
"id": "prod-002",
"name": "Yoga Mat",
"brand": "FlexCore",
"price": 34.99,
"material": "Natural rubber",
"thickness_mm": 6,
"category": "Equipment"
}
Partition Key — The Most Important Design Decision
Cosmos DB distributes data across multiple physical partitions for scalability. The Partition Key determines which partition an item goes to. This is the most critical design decision in a Cosmos DB project.
A good partition key:
- Has high cardinality — many distinct values (customer_id, order_id — good; country — often bad because data concentrates in a few countries)
- Distributes reads and writes evenly — no single partition receives most of the traffic ("hot partition")
- Appears in most of your queries — Cosmos DB is fastest when a query targets a single partition
A hot partition problem looks like this: an e-commerce platform uses category as the partition key. The "Electronics" category gets 80% of all traffic. One partition handles 80% of the load while all others sit idle. Response times suffer despite having plenty of total capacity.
Request Units (RU/s)
Cosmos DB measures database operations in Request Units (RUs). A single point-read of a 1 KB item costs 1 RU. A write costs about 5 RUs. A complex query that scans many items costs more RUs. You provision a throughput capacity (e.g., 1000 RU/s). If your workload exceeds this, requests are rate-limited.
Monitor RU consumption in the Azure portal and adjust throughput based on actual usage. Serverless mode is available for unpredictable workloads — you pay per RU consumed rather than provisioning a fixed capacity.
Cosmos DB Change Feed
The Change Feed is one of the most powerful features of Cosmos DB for data engineering. Every time an item is created or updated in a container, that change is captured in an ordered log — the change feed.
Downstream systems subscribe to the change feed and react to every change in near real time. This enables event-driven architectures without building complex polling mechanisms.
Common data engineering uses of the Change Feed:
- Sync data from Cosmos DB into Azure Synapse or ADLS Gen2 for analytics
- Trigger downstream processing when specific records change
- Maintain a search index in Azure AI Search that stays in sync with Cosmos DB documents
- Replicate data to other systems for redundancy or integration
Synapse Link for Cosmos DB — Analytics Without ETL
Normally, running analytical queries on an operational Cosmos DB database would compete with application reads and writes for resources, slowing down both. Synapse Link creates a separate, fully isolated analytical store that automatically mirrors the Cosmos DB data.
You enable Synapse Link in Cosmos DB and in the Synapse workspace. Synapse's Serverless SQL Pool or Spark Pool can then query Cosmos DB data directly — without any ETL pipeline — and without impacting application performance.
-- Query Cosmos DB data directly from Synapse Serverless SQL Pool
SELECT
product_category,
AVG(price) AS avg_price,
COUNT(*) AS item_count
FROM OPENROWSET(
'CosmosDB',
N'Account=mycosmosaccount;Database=products',
products
) AS [result]
GROUP BY product_category
Indexing in Cosmos DB
Cosmos DB indexes all properties in every document by default. This makes any ad-hoc query fast without needing to pre-define indexes. The trade-off is higher write cost and storage usage.
For write-heavy workloads, customize the indexing policy to index only the properties you actually query. This reduces write RU cost significantly.
// Custom indexing policy — index only category and price
{
"indexingMode": "consistent",
"includedPaths": [
{ "path": "/category/?" },
{ "path": "/price/?" }
],
"excludedPaths": [
{ "path": "/*" }
]
}
Key Points
- Cosmos DB is a globally distributed NoSQL database — best for flexible schema, global low latency, and high write throughput
- Choose the partition key carefully — it must distribute data evenly and appear in most queries
- The Change Feed captures every insert and update — use it to build event-driven pipelines without polling
- Synapse Link enables analytics directly on Cosmos DB data without ETL pipelines or impacting application performance
- Use Serverless mode for unpredictable or low-volume workloads to avoid over-provisioning RU/s
- Customize the indexing policy in write-heavy containers to reduce RU consumption
