Databricks on Cloud Platforms

Databricks does not run in isolation. It runs on top of cloud infrastructure — the servers, storage, networking, and security services that public cloud providers supply. Understanding how Databricks deploys on each major cloud platform explains why certain configuration choices exist, how billing works, how data connects to other cloud services, and how to make informed decisions about which cloud to use for a Databricks deployment.

Databricks runs on three cloud platforms: Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). On each platform, Databricks operates in a fundamentally similar way — a managed control plane operated by Databricks communicates with a data plane running inside the customer's own cloud account. But the specifics differ in meaningful ways: the storage services, the identity systems, the compute instance types, and the networking configurations all reflect each cloud's unique ecosystem.

The Control Plane and Data Plane Architecture

Every Databricks deployment, regardless of cloud, follows a two-plane architecture. This architecture defines where different components run and who manages them.

Control Plane (Managed by Databricks)

The control plane is Databricks' own infrastructure, operated and maintained by Databricks in its cloud accounts. The control plane includes the web application (the Databricks workspace UI), the cluster manager (which creates and terminates compute resources), the job scheduler, the notebook server, and the MLflow tracking server.

Users connect to the control plane when they open the Databricks web interface, run notebooks, or schedule jobs. The control plane handles authentication, authorization, and coordination. However, customer data never enters the control plane. Data stays in the customer's cloud account at all times.

Data Plane (Managed by the Customer, Inside Customer's Cloud Account)

The data plane runs inside the customer's own cloud account — their AWS account, their Azure subscription, or their GCP project. It includes the compute clusters (virtual machines that run Spark), the object storage where data lives, and the network infrastructure connecting them.

When a Databricks cluster starts, the control plane instructs the customer's cloud account to launch virtual machines using the customer's own compute quota and compute credits. The data never leaves the customer's cloud account. The control plane only sends instructions — not data.

This architecture addresses a fundamental enterprise concern: data sovereignty. Legal and regulatory requirements in many industries prohibit sending sensitive data to third-party infrastructure. The data plane model satisfies these requirements because Databricks only manages software and orchestration, while the customer's cloud account holds all actual data and compute.

Databricks on AWS

AWS is where Databricks originated. The AWS integration is mature, deeply integrated with the AWS ecosystem, and the platform of choice for many data-intensive organizations already committed to AWS.

AWS Storage: S3

Amazon S3 (Simple Storage Service) is the primary storage layer for Databricks on AWS. Delta tables, raw data files, notebooks, MLflow artifacts, and checkpoint directories all live in S3 buckets. S3 provides essentially unlimited storage capacity at low cost, with strong durability guarantees.

Databricks accesses S3 through two mechanisms:

Instance profiles — IAM (Identity and Access Management) roles attached to the cluster's EC2 instances grant permission to access specific S3 buckets. No credentials need to be stored in notebooks or cluster configurations. The cluster inherits permissions from the attached role automatically.
Unity Catalog storage credentials — For Unity Catalog-governed access, Databricks uses an IAM role with a cross-account trust relationship to access external S3 locations on behalf of the metastore.

AWS Compute: EC2 Instances

Databricks clusters on AWS run on Amazon EC2 (Elastic Compute Cloud) virtual machines. Databricks supports a wide range of EC2 instance families:

General purpose (m-family, e.g., m5.xlarge, m5.4xlarge) — Balanced CPU and memory for mixed workloads
Memory optimized (r-family, e.g., r5.4xlarge, r5.8xlarge) — High RAM-to-CPU ratio for memory-intensive analytics
Compute optimized (c-family, e.g., c5.4xlarge) — High CPU performance for compute-intensive jobs
Storage optimized (i-family, e.g., i3.2xlarge) — Fast local NVMe SSDs for workloads benefiting from Delta Cache
GPU instances (p-family, g-family) — For machine learning workloads using GPU acceleration

AWS Networking: VPC

Databricks on AWS runs inside a VPC (Virtual Private Cloud) — a logically isolated section of the AWS network. By default, Databricks creates and manages the VPC. Organizations with specific network requirements use VPC injection, deploying Databricks into their own existing VPC where they control routing, firewall rules, and connectivity to on-premises networks via VPN or AWS Direct Connect.

In a VPC injection deployment, the cluster instances and data in S3 remain on the customer's private network. Traffic between the cluster and S3 travels through a VPC endpoint, staying entirely within the AWS network without traversing the public internet.

AWS Security: IAM and KMS

AWS IAM (Identity and Access Management) governs all access permissions in Databricks on AWS. IAM roles define what actions Databricks can perform — which EC2 instances it can create, which S3 buckets it can read, which KMS keys it can use for encryption.

AWS KMS (Key Management Service) manages encryption keys. All data at rest in S3 can be encrypted using KMS keys. Databricks supports both AWS-managed keys (simpler to set up) and customer-managed keys (CMK), where the customer controls the encryption key lifecycle. Regulated industries requiring proof that they control their encryption keys use CMK.

AWS-Specific Integrations

Databricks on AWS connects naturally with AWS services that many organizations already use:

Amazon Kinesis — Structured Streaming reads from Kinesis data streams for real-time processing
AWS Glue Data Catalog — Databricks can use the Glue catalog as a Hive metastore for organizations already using AWS Glue
Amazon Redshift — The Databricks-Redshift connector reads and writes Redshift tables efficiently
AWS Secrets Manager — Sensitive credentials (database passwords, API keys) store in Secrets Manager and reference in notebooks without hardcoding
AWS CloudWatch — Cluster logs and metrics stream to CloudWatch for centralized monitoring

Databricks on Azure (Azure Databricks)

Azure Databricks is the most tightly integrated version of Databricks on any cloud. Microsoft and Databricks created it as a joint product, meaning it integrates more deeply with Microsoft's services than either Databricks on AWS or GCP. Azure Databricks is a first-party service in the Azure Portal — customers purchase it through Microsoft, it appears in their Azure subscription alongside other Azure resources, and Microsoft's support team can assist with it directly.

Azure Storage: ADLS Gen2

Azure Data Lake Storage Gen2 (ADLS Gen2) serves the same role that S3 serves on AWS — it is the primary data storage layer for Databricks on Azure. ADLS Gen2 combines the cost-effectiveness of Azure Blob Storage with hierarchical namespace support (proper directory structures) optimized for analytics workloads.

Databricks accesses ADLS Gen2 through managed identities — Azure's equivalent of AWS IAM instance profiles. A managed identity attached to the cluster grants Databricks permission to access specific ADLS Gen2 containers without storing any credentials. Unity Catalog on Azure uses a managed identity to access external data locations on behalf of the metastore.

Azure Compute: Virtual Machines

Databricks clusters on Azure run on Azure Virtual Machines. The VM families available for Databricks clusters mirror the same categories available on AWS:

General purpose (D-series, e.g., Standard_D8s_v3) — Balanced CPU and memory
Memory optimized (E-series, e.g., Standard_E16s_v3) — High memory for large-scale analytics
Compute optimized (F-series, e.g., Standard_F16s_v2) — High CPU performance
Storage optimized (L-series, e.g., Standard_L8s_v2) — Fast local NVMe for Delta Cache
GPU (NC-series, ND-series) — For GPU-accelerated machine learning

Azure Networking: VNet

Azure uses Virtual Networks (VNets) instead of VPCs. Azure Databricks supports VNet injection — deploying clusters into a customer-managed VNet with custom subnet configurations, Network Security Groups (NSGs), and routing tables. Private Link enables private connectivity from the Databricks workspace to ADLS Gen2, keeping all data traffic on the Microsoft backbone network.

Azure Identity: Azure Active Directory

Azure Active Directory (Azure AD, now called Microsoft Entra ID) is the identity backbone of Azure Databricks. Users authenticate to Databricks using their Azure AD credentials — the same username and password they use for Microsoft 365, Teams, and other Microsoft services. Single Sign-On (SSO) works automatically.

Azure AD groups synchronize directly with Databricks access control. When an organization's IT department adds a user to the "Data Engineers" Azure AD group, that user automatically gains the Databricks permissions assigned to the Data Engineers group. When a user's Azure AD account is disabled (because they left the company), their Databricks access disappears simultaneously without any separate action.

Azure-Specific Integrations

Azure Event Hubs — Structured Streaming reads from Event Hubs topics using the Kafka-compatible API, enabling real-time data processing from Azure's managed streaming service
Azure Synapse Analytics — The Databricks-Synapse connector uses Azure's internal network to move data efficiently between Databricks and Synapse data warehouse
Azure Data Factory — ADF orchestrates Databricks notebook and job runs within larger ETL pipelines
Azure Key Vault — Sensitive credentials store in Key Vault and surface in Databricks notebooks as secret scopes
Azure Monitor — Cluster logs and Spark metrics stream to Azure Monitor for alerts and dashboards
Microsoft Fabric — Databricks integrates with Microsoft Fabric (Microsoft's unified analytics platform), sharing data through OneLake shortcuts
Power BI — Direct integration allows Power BI reports to query Databricks SQL warehouses and Delta tables with an optimized connector

Azure Marketplace and Billing

Azure Databricks is purchased through the Azure Marketplace. Costs appear on the same Azure bill as other Azure services. Organizations with Azure Enterprise Agreements or Azure credits can apply those credits to Databricks usage. This unified billing simplifies financial management for organizations already committed to Microsoft's enterprise agreements.

Databricks on Google Cloud Platform (GCP)

Databricks on GCP is the newest of the three cloud deployments, generally available since 2021. It is growing in adoption as organizations already using GCP's analytics and AI services seek to add Databricks' capabilities to their data stack.

GCP Storage: Google Cloud Storage

Google Cloud Storage (GCS) serves as the primary storage layer on GCP, equivalent to S3 on AWS and ADLS Gen2 on Azure. GCS buckets hold Delta tables, raw data, and all Databricks artifacts.

Access to GCS uses Google's Workload Identity Federation — a mechanism that grants Databricks clusters permission to access GCS buckets using service account impersonation, without storing long-lived credentials on the cluster.

GCP Compute: Compute Engine

Databricks clusters on GCP run on Google Compute Engine VMs. The relevant machine families include:

N2 and N2D standard — General purpose workloads with balanced resources
M2 and M3 — Memory-optimized for large-scale analytics requiring massive RAM
C2 and C3 — Compute-optimized for CPU-intensive workloads
A2 — GPU-accelerated instances for deep learning and ML workloads

GCP Networking: VPC

GCP uses VPCs for network isolation. Databricks on GCP supports VPC peering to connect Databricks clusters to customer GCP resources. Private Service Connect enables secure private connectivity between the Databricks workspace and GCP services without traversing the public internet.

GCP Identity: Service Accounts and Cloud Identity

GCP uses service accounts for machine-to-machine authentication. Databricks clusters operate as specific service accounts with precisely defined permissions granted through GCP's IAM policies. Human users authenticate through Cloud Identity or Google Workspace accounts, which integrate with Databricks using OIDC (OpenID Connect) for single sign-on.

GCP-Specific Integrations

Google Cloud Pub/Sub — GCP's managed messaging service for streaming data ingestion into Databricks
BigQuery — The Databricks-BigQuery connector reads and writes BigQuery tables, enabling workflows that use both services
Vertex AI — Models trained in Databricks can deploy to Vertex AI endpoints, and Vertex AI services can integrate with Databricks data pipelines
Cloud Key Management — Customer-managed encryption keys for data at rest in GCS
Cloud Logging and Cloud Monitoring — Cluster logs and metrics integrate with GCP's observability stack

Choosing a Cloud: Key Decision Factors

Organizations evaluating which cloud to run Databricks on consider several factors beyond Databricks itself.

Existing Cloud Commitment

The most practical factor for most organizations is where their existing data lives. If an organization already stores petabytes of data in S3 and runs hundreds of AWS services, moving that data to Azure for Databricks creates unnecessary cost and complexity. Databricks works best when deployed on the same cloud where your data already lives.

Microsoft Ecosystem Alignment

Organizations heavily invested in Microsoft products — Microsoft 365, Azure Active Directory, Power BI, Teams, Azure Synapse — benefit disproportionately from Azure Databricks. The SSO integration means zero additional credential management. Power BI reports on Databricks data work seamlessly. Azure Pipelines and Azure Data Factory orchestrate Databricks jobs naturally.

Google AI and ML Services

Organizations building deeply on Google's AI services — Vertex AI, Google's pre-trained APIs, or Google's AI infrastructure — find natural synergy running Databricks on GCP. The data stays in the same cloud ecosystem, reducing data movement and leveraging GCP's AI-optimized networking.

Enterprise Agreements and Pricing

Organizations with significant Azure Enterprise Agreements (EA) or AWS Enterprise Discount Programs (EDP) can apply existing commitments to Databricks usage, reducing effective costs. Azure Databricks consumed under an EA counts toward commitment spend, which matters for organizations with large Microsoft contracts.

Regulatory Requirements

Some regulated industries have specific cloud provider certifications that influence the decision. Healthcare organizations following HIPAA requirements verify each cloud's compliance certifications. Financial services organizations in specific jurisdictions may be required to use clouds with local data centers in approved countries. All three clouds have extensive compliance portfolios, but specific certifications vary by region and service.

Cross-Cloud Unity Catalog

Large organizations sometimes run Databricks on multiple clouds simultaneously. A company might run primary analytics on Azure but use AWS for certain regional operations, or may have started on AWS and is migrating workloads to GCP.

Unity Catalog supports cross-cloud governance. A single metastore can govern data assets across multiple Databricks workspaces on different clouds, providing consistent access control policies regardless of which cloud a workspace runs on. Data Sharing via the Delta Sharing protocol allows sharing datasets across clouds without data replication — a user on the GCP workspace can access a dataset stored in S3 on the AWS workspace through Delta Sharing.

Workspace Configuration: Regional Deployment Decisions

Databricks workspaces deploy in specific cloud regions. Region selection affects latency (deploy close to users and data sources), cost (cloud pricing varies by region), and data residency (some regulations require data to stay within specific geographies).

Best practice: deploy the Databricks workspace in the same region as the cloud storage where data lives. Cross-region data transfer is slow and expensive. A Databricks workspace in US-East-1 reading data from an S3 bucket in EU-West-1 incurs cross-region data transfer charges on every query and suffers from the added network latency.

Disaster Recovery and Multi-Region Deployments

Production data platforms need plans for cloud region outages. Databricks recommends a warm standby approach for mission-critical workloads:

Deploy a primary Databricks workspace in Region A
Replicate Delta tables to Region B using cross-region storage replication (S3 Cross-Region Replication, Azure Storage geo-redundancy, or GCS multi-region buckets)
Maintain a secondary Databricks workspace in Region B in a stopped state
If Region A fails, activate the Region B workspace and point it at the replicated storage

For less critical workloads, the recovery time from a region outage (measured in hours) is acceptable. For mission-critical workloads like fraud detection or real-time dashboards, active-active deployments across two regions provide continuous availability at significantly higher cost.

Cost Management Across Clouds

Databricks charges for DBUs (Databricks Units) — a measure of compute usage that varies by cluster type and workload. On top of DBU charges, the cloud provider charges for the underlying compute instances, cloud storage, and data transfer. Understanding both charge types is essential for cost management.

Common cost optimization strategies that apply across all three clouds:

Spot/Preemptible/Spot VM instances — All three clouds offer discounted compute instances that can be interrupted when demand increases. For batch jobs that can tolerate occasional restarts, using spot instances (AWS), spot VMs (GCP), or Azure Spot VMs reduces compute costs by 60-80%.
Auto-termination — Configure clusters to terminate after a period of inactivity. A cluster idle for 30 minutes with auto-termination set to 30 minutes stops incurring costs. Without auto-termination, forgotten idle clusters run indefinitely.
Cluster policies — Administrators define policies limiting the maximum cluster size, instance types, and auto-termination settings that users can select. Policies prevent users from accidentally launching expensive large clusters for small jobs.
SQL Warehouses for analytics — Databricks SQL Warehouses (serverless and provisioned) scale compute resources to SQL query demand more efficiently than all-purpose clusters for business intelligence and reporting workloads.

Real-World Deployment Scenario: Financial Services Firm

A financial services firm operates across North America and Europe. They have an existing Azure enterprise agreement, use Azure Active Directory for all employee accounts, and their trading data is already stored in Azure Data Lake Storage Gen2. Their compliance team requires all European customer data to remain within the European Economic Area.

The architecture decision is clear: Azure Databricks in two regions. A primary workspace in Azure's East US region processes North American operations data. A separate workspace in Azure's West Europe region processes European customer data. Both workspaces connect to the same Unity Catalog metastore, which maintains consistent governance policies across both regions.

European customer data never leaves the West Europe region, satisfying the regulatory data residency requirement. North American analysts access only the East US workspace. European analysts access only the West Europe workspace. Unity Catalog column-level security ensures that even within the East US workspace, European customer data that must flow there for cross-border analytics appears with sensitive fields masked.

Power BI reports for executive dashboards connect to both Databricks SQL Warehouses, using Azure AD credentials for SSO. No separate login is required. All infrastructure costs appear on the single Azure bill under the enterprise agreement, simplifying financial reporting.

Key Points Summary

Databricks uses a control plane (managed by Databricks) and data plane (running inside the customer's cloud account) architecture to ensure customer data never leaves the customer's control.
On AWS, primary storage is S3, compute uses EC2 instances, and access control leverages IAM roles and instance profiles.
Azure Databricks is the most tightly integrated cloud deployment, using ADLS Gen2 for storage, Azure VMs for compute, and Azure Active Directory for identity management with automatic SSO.
On GCP, storage uses Google Cloud Storage, compute uses Compute Engine, and identity uses service accounts with Workload Identity Federation.
The primary factor in cloud selection is where existing data lives — minimizing data movement reduces cost and complexity.
Each cloud has native service integrations: Kinesis on AWS, Event Hubs on Azure, Pub/Sub on GCP for streaming; each cloud's respective key management, monitoring, and identity services connect directly.
Unity Catalog supports governance across multiple workspaces on different clouds from a single metastore.
Cost management strategies — spot instances, auto-termination, cluster policies, and SQL Warehouses — apply across all three clouds.
Region selection should co-locate the Databricks workspace with the cloud storage where data resides to minimize latency and cross-region data transfer costs.

Previous lesson

Back to course