Databricks Unity Catalog
Imagine a large hospital. Thousands of doctors, nurses, administrators, and researchers work there every day. Each group needs access to different records. Doctors need patient files. Accountants need billing data. Researchers need anonymized statistics. But no one should access data they are not supposed to see. Unity Catalog works exactly like the hospital's security system — it decides who can see what data, tracks every action, and keeps everything organized in one place.
Unity Catalog is Databricks' centralized governance solution. It lets organizations manage all their data assets — tables, files, machine learning models, dashboards — from a single control point. Instead of setting permissions separately in every workspace, every team, and every project, Unity Catalog gives one unified place to define rules, enforce security, and track data usage.
Why Organizations Need Data Governance
Before Unity Catalog, organizations using Databricks faced a real problem. Each workspace had its own set of permissions. If a company had ten workspaces, an administrator had to manage permissions in all ten places separately. When someone left the company, removing their access meant visiting ten different locations. If an auditor asked "who accessed this table last month?" — the answer required piecing together logs from multiple places.
This approach breaks down quickly. Data governance failures lead to data breaches, compliance violations, and costly regulatory fines. Industries like banking, healthcare, and insurance operate under strict laws — GDPR in Europe, HIPAA in the United States, and others — that require organizations to prove exactly who accessed what data and when.
Unity Catalog solves this by acting as the single source of truth for all data access decisions across every Databricks workspace in an organization.
The Three-Level Namespace: Catalog → Schema → Table
Unity Catalog organizes data in a three-level hierarchy. Think of it like a filing cabinet system in a law firm.
- The filing room is the Catalog. It holds everything related to one department or project.
- Each drawer inside the filing room is a Schema. It groups related folders together.
- Each folder inside the drawer is a Table, View, or other data object.
In practical terms, a company might create a catalog called finance_data. Inside that catalog, they create schemas like accounts_payable, accounts_receivable, and payroll. Inside each schema, they store the actual tables with numbers and records.
When a user wants to query a table, they write the full path: finance_data.accounts_payable.vendor_invoices. This three-part name — catalog.schema.table — makes it clear exactly where every piece of data lives.
The benefit of this structure is precision. Permissions get assigned at any level. You can give a user access to an entire catalog, or restrict them to one specific schema, or even limit them to reading a single table while blocking write access.
How Unity Catalog Controls Access: The GRANT and REVOKE System
Unity Catalog uses a permission system based on SQL commands — the same language data professionals already know. To give someone access, you run a GRANT statement. To remove access, you run REVOKE.
Here is a concrete example. A data analyst named Priya joins the marketing team. She needs to read customer purchase data but must not modify it. An administrator runs:
GRANT SELECT ON TABLE marketing_data.customers.purchases TO priya@company.com;Now Priya can read the purchases table. If she leaves the marketing team, the administrator runs:
REVOKE SELECT ON TABLE marketing_data.customers.purchases FROM priya@company.com;Her access disappears immediately across all workspaces. One command. No need to visit ten different places.
The available privilege types in Unity Catalog cover different actions:
- SELECT — Read data from a table
- MODIFY — Insert, update, or delete rows
- CREATE — Create new tables or schemas
- USE CATALOG — Access a catalog and its contents
- USE SCHEMA — Access a schema and its contents
- ALL PRIVILEGES — Full control over an object
Metastore: The Brain of Unity Catalog
At the center of Unity Catalog sits the metastore. The metastore stores all metadata — information about data, not the data itself. It knows the names of every table, every schema, and every catalog. It records who owns each object. It stores the permission rules. It keeps the physical storage locations of all managed and external tables.
Think of the metastore as the index card system in an old library. The cards do not hold the books themselves. They tell you the book's title, author, location number, and who has borrowed it. When you want a book, you consult the index cards first.
Each organization gets one metastore per region. All Databricks workspaces in that region connect to the same metastore, which is why permissions set in one workspace automatically apply everywhere else. The metastore enforces consistency.
Managed Tables vs. External Tables
Unity Catalog handles two kinds of tables differently, and understanding the difference matters for data management decisions.
Managed Tables
Managed tables are fully controlled by Unity Catalog. When you create a managed table, Databricks decides where to store the data — typically in a cloud storage location that Unity Catalog manages. When you drop a managed table, Databricks deletes both the table definition and the underlying data files.
This is like renting a furnished apartment. The landlord (Unity Catalog) owns the furniture. When you leave, the furniture stays — or in this case, gets removed with you.
External Tables
External tables point to data that already exists somewhere — an S3 bucket in AWS, a container in Azure, or a folder in Google Cloud Storage. Unity Catalog registers the table and enforces access control, but the actual data files live outside Unity Catalog's ownership. If you drop an external table, only the registration disappears. The data files remain untouched.
This is like storing your belongings in a warehouse you own. The security system (Unity Catalog) controls who enters the warehouse, but you own everything inside. If you cancel the security system, your belongings stay in the warehouse.
Row-Level and Column-Level Security
Basic permission systems control access at the table level — either you can see the whole table or you cannot. Unity Catalog goes further with row-level and column-level security.
Column-Level Security
A table might contain both public and sensitive columns. Consider an employee table with columns for employee ID, department, salary, and bank account number. HR managers need salary data. Department managers need to see who works in their department but not salary figures. No one outside HR should see bank account numbers.
Column masking solves this. An administrator creates a masking policy that shows the bank account column as ****XXXX to everyone except the payroll team. The data exists in the table, but most users see only the masked version.
Row-Level Security
Row filters restrict which rows a user sees. A sales manager in the North region should see only sales records from the North region, not the South or West. Instead of creating three separate tables, one table holds all records. A row filter checks the user's identity and returns only the rows that match their region.
This approach reduces data duplication and simplifies maintenance. One table serves all regions while each manager sees only their relevant data.
Data Lineage: Tracing the Life of Data
Unity Catalog automatically tracks data lineage — the complete history of how data moves and transforms across an organization. When a dashboard shows a sales number, lineage tracking answers the question: where did that number come from?
Picture a water filtration plant. Water comes in from a river, passes through several filtration stages, and comes out as clean drinking water. Lineage tracking maps every stage of that journey. If the drinking water tests bad, engineers can trace back through every stage to find the source of contamination.
In data terms, if a business report shows a suspicious revenue figure, a data engineer can open the lineage graph and see: this number came from table A, which was processed by notebook B, which pulled raw data from source C. The investigation takes minutes instead of days.
Unity Catalog captures lineage automatically for SQL queries, notebooks, jobs, and pipelines. No manual configuration is required.
Audit Logs: Every Action Leaves a Record
Audit logs record every action taken against data objects in Unity Catalog. Every SELECT query, every GRANT command, every table creation, every schema modification — all of it gets logged with a timestamp and the identity of the person or system that performed the action.
For regulated industries, this is not optional. GDPR requires organizations to prove that personal data is accessed only by authorized parties. HIPAA requires hospitals to show exactly which employees viewed patient records. Unity Catalog's audit logs provide this proof automatically.
Audit logs integrate with cloud monitoring services. Organizations can send logs to AWS CloudWatch, Azure Monitor, or Google Cloud Logging for long-term storage and alerting. If an unauthorized access attempt occurs at 2:00 AM, security teams receive an alert immediately.
Groups: Managing Permissions at Scale
Granting permissions one user at a time does not scale in large organizations. A company with five hundred data engineers cannot afford to update five hundred individual permissions every time a policy changes.
Unity Catalog uses groups to solve this. An administrator creates a group called data_engineers and adds all five hundred engineers to it. Permissions get granted to the group, not to individuals. When a new engineer joins, the administrator adds them to the group, and they instantly inherit all the group's permissions.
Groups connect to existing identity systems. Most organizations already use Microsoft Azure Active Directory, Okta, or Google Workspace to manage employee accounts. Unity Catalog syncs with these identity providers. When an employee's account is disabled in Azure Active Directory because they left the company, their Unity Catalog access automatically disappears. No separate cleanup step is needed.
Delta Sharing: Sharing Data Across Organizations
Unity Catalog includes Delta Sharing, an open protocol for sharing data with external parties. A company might want to share a dataset with a business partner, a regulator, or a research institution without giving them access to the internal Databricks environment.
Delta Sharing works like a secure file transfer, but smarter. Instead of copying data to an email attachment or a shared drive, the data owner creates a share in Unity Catalog. The recipient gets a credential token. They use that token to access the shared data directly from their own tools — Spark, Pandas, Power BI, or even plain SQL — without needing a Databricks account.
The data owner retains full control. They can revoke access at any time. They can see exactly how often the recipient queries the data. The original data never leaves the owner's storage.
Service Principals: Automating Secure Access
Automated systems — scheduled jobs, pipelines, and applications — also need to access data. Assigning a human user's credentials to an automated system creates security risks. If that employee leaves, the automated system breaks. If the employee's account gets compromised, the attacker gains access to everything the automated system can touch.
Service principals solve this. A service principal is an identity created specifically for a non-human system. A data pipeline gets its own service principal with only the permissions it needs — nothing more. The pipeline authenticates using the service principal, not a human account. This principle — giving each system the minimum permissions required — is called least-privilege access.
Setting Up Unity Catalog: The Key Steps
Organizations enabling Unity Catalog for the first time follow a standard sequence:
- Create the Metastore — An account administrator creates one metastore per cloud region from the Databricks account console.
- Assign Workspaces — The administrator connects existing Databricks workspaces to the metastore.
- Create Storage Credentials — Unity Catalog needs permission to read and write to cloud storage. Storage credentials are created in the metastore to establish this trust.
- Create External Locations — Specific cloud storage paths get registered as external locations, giving Unity Catalog visibility into where data lives.
- Create Catalogs and Schemas — Administrators build the three-level namespace that fits the organization's structure.
- Grant Permissions — Users and groups receive the privileges they need to do their work.
Common Real-World Scenarios
Scenario 1: The Compliance Audit
A bank's internal audit team asks: "Show us every time customer account data was accessed in the last 90 days." With Unity Catalog, the answer takes ten minutes. The team queries the audit log table, filters by the relevant tables and time window, and exports the results. Without Unity Catalog, the same question takes weeks of manual log collection across multiple systems.
Scenario 2: The New Data Science Team
A company hires a team of ten data scientists to work on customer churn prediction. The data these scientists need sits across three schemas in two catalogs. An administrator creates a group called churn_team, grants SELECT on the relevant schemas to the group, and adds all ten scientists. All ten have exactly the access they need within minutes. The sensitive financial columns in those tables remain masked because the churn team's group does not have unmasking privileges.
Scenario 3: Sharing with a Research Partner
A pharmaceutical company wants to share anonymized clinical trial data with a university research team. Using Delta Sharing, they create a share containing the anonymized tables. The university researchers receive a token and connect their own analysis tools directly to the shared data. The pharmaceutical company monitors query activity and can revoke access when the research project ends.
Key Points Summary
- Unity Catalog provides centralized governance for all Databricks workspaces in one organization.
- The three-level namespace (Catalog → Schema → Table) organizes all data assets with precision.
- The metastore stores all metadata and permission rules, enforcing consistency across workspaces.
- GRANT and REVOKE commands control access using familiar SQL syntax.
- Column masking and row filters provide fine-grained security within tables.
- Data lineage tracking shows exactly how data moves and transforms through an organization.
- Audit logs capture every access event, supporting compliance with GDPR, HIPAA, and other regulations.
- Groups enable scalable permission management for large teams.
- Delta Sharing allows secure external data sharing without granting Databricks access.
- Service principals provide secure, automated system access using the least-privilege principle.
