What is Databricks?

The Data Problem

Why organizations ended up with two systems for one job

Data Warehouses

Strengths

Structured, schema-enforced queries
Optimized for BI and reporting
Strong SQL support and governance

Limitations

Rigid schemas; slow to adapt
Expensive to scale storage
Poor support for unstructured data

Data Lakes

Strengths

Stores any data type at any scale
Low-cost object storage
Flexible for ML and data science

Limitations

No schema enforcement ("data swamp" risk)
No ACID transactions
Difficult to query reliably for BI

The two-system problem: Organizations ended up running both a data warehouse (for BI teams) and a data lake (for data scientists and engineers). Same data, stored twice, governed differently, with constant sync headaches between the two. This is the problem the Lakehouse was designed to solve.

Data Silos

The same data copied across warehouse, lake, and staging areas. Every copy drifts over time.

Governance Gaps

Different access controls in the warehouse vs. the lake. Who has the "right" version?

Cost Explosion

Paying for storage twice, compute twice, and ETL pipelines to keep everything in sync.

**Data Warehouses.** Structured and reliable, but rigid and expensive. **Microsoft equivalent**: Synapse Dedicated SQL Pool, or classic Azure SQL DW.

**Data Lakes.** Flexible and cheap, but no governance or ACID transactions. **Microsoft equivalent**: ADLS Gen2 with raw Parquet/JSON/CSV files sitting there with no query engine on top.

The two-system pattern is **extremely common**. Most Microsoft shops run: - Synapse + ADLS, or - Fabric Lakehouse + Warehouse The three problems below are universal complaints from those shops.

**Data Silos.** Same data copied across warehouse, lake, and staging. Every copy drifts over time. Industry slang for the worst case: **"data swamp"** (an ungoverned lake that became unusable).

**Governance Gaps.** Different worlds, different controls: - Different access controls in warehouse vs lake - Different audit trails - Different lineage No single source of truth on who can see what.

**Cost Explosion.** You're paying three times: - Storage twice (once per system) - Compute twice - ETL pipelines to keep both sides in sync

The Lakehouse

Lake flexibility with warehouse reliability

Data Lake

Any File Format

Schema on Read

Cost Effective

Weak Governance

Data Warehouse

Structured Tables

Schema on Write

Fast Queries

Rigid & Costly

Data Lakehouse

Any Format

ACID

Fast Queries

Low Cost

Traditional Lake

Partial Writes Possible

Read Consistency Unstable

Rollback None

Versioning Manual

Pre-Delta

vs

Delta Lake

Partial Writes Prevented

Read Consistency Guaranteed

Rollback Built-in

Versioning Built-in

ACID Compliant

A

Atomicity

All or nothing: transactions complete fully or not at all

C

Consistency

Data always moves from one valid state to another

I

Isolation

Concurrent operations don't interfere with each other

D

Durability

Once committed, data persists even through failures

Go deeper: Databricks popularized the Lakehouse pattern, built on top of Delta Lake format (open-source, columnar, ACID-compliant). For a look at how data flows through Bronze, Silver, and Gold layers in a Lakehouse, see our Medallion Architecture guide.

**Data Lakes.** Flexible and cheap, but no governance. One half of the old paradigm.

**Data Warehouses.** Structured and reliable, but rigid and expensive. The other half.

**The Lakehouse combines both.** Any format, ACID transactions, fast queries, low cost. **Did Databricks invent it?** Yes, with a small asterisk: - Coined the term in its lakehouse research paper - Built **Delta Lake**, later open-sourced, which is the format that makes it possible Fabric uses Delta now too, so it isn't proprietary, but it started here.

**Traditional lakes had real problems.** No transactions: - Jobs could die halfway through and leave half-written data - Two readers could see different versions of the same table - No rollback, no version history, no schema enforcement

**Delta solves all of that** with a transaction log on top of Parquet files: - Atomic writes - Consistent reads - Rollback to any prior state - Versioning built in That **ACID guarantee** is what makes the lake safe for business use. *Skip the partial-writes detail on camera unless someone asks; this summary covers it.*

**Atomicity.** Writes complete fully or not at all. No half-written tables.

**Consistency.** Data always moves from one valid state to another. Schema enforced.

**Isolation.** Concurrent reads and writes don't interfere with each other.

**Durability.** Once committed, data survives failures. **Time travel** lets you query any prior version.

Links to deep-dive guides on **Delta** and **Medallion Architecture**. **Parquet** is the columnar file format underneath. Every modern platform uses it: Fabric, Databricks, Snowflake, BigQuery.

What Databricks Actually Is

A platform layer, not a cloud replacement

Workspace

Notebooks

SQL Editor

Jobs

Repos

Dashboards

Compute

All-Purpose Clusters

Job Clusters

SQL Warehouses

Storage & Governance

Delta Lake

Unity Catalog

Volumes

Your Cloud Provider

Azure

AWS

GCP

Databricks orchestrates the workspace, compute, and governance layers. Your storage stays in your cloud account, while compute is decoupled so you can scale it independently.

Key insight: Databricks is a platform layer that runs on your cloud provider. It does not replace Azure, AWS, or GCP. The storage layer stays in your cloud account. The compute layer is decoupled, so you choose whether it runs on your own account (Classic) or on Databricks-managed infrastructure (Serverless).

What This Means in Practice

Your Data, Your Storage

Data lives in your own ADLS Gen2 (Azure), S3 (AWS), or GCS (Google) storage. Databricks reads and writes to it, but never takes ownership of it.

Compute Scales Independently

Spin up clusters when you need processing power, shut them down when you don't. Storage and compute are fully decoupled.

Two Scopes, One Product

Storage is yours, sitting in your cloud account. Databricks runs the workspace, compute, and governance on top.

Azure Databricks is the Azure-native deployment of the Databricks platform. Same core product, integrated with Microsoft Entra ID (formerly Azure Active Directory), deployed into your Azure subscription, and billed through Azure. If you already work in the Microsoft ecosystem, this is typically where you start.

**Workspace layer.** Fabric equivalents: - **Notebooks** -> Fabric Notebooks - **SQL Editor** -> Fabric SQL Endpoint of a Lakehouse - **Lakeflow Jobs** (orchestration) -> Fabric Data Pipelines - **Repos** (Git integration) -> Fabric Git - **Dashboards** -> native simple visualizations Dashboards are **NOT** a Power BI competitor. Most shops still use Power BI on top.

**Compute layer.** Three types, this is where the confusion is: - **All-Purpose Clusters** (Spark): general compute for notebooks, pay per second - **Job Clusters** (Spark): spun up for a job, torn down right after, cheaper - **SQL Warehouses** (SQL-only): what Power BI talks to SQL Warehouses come in three flavors (all three include Photon): - **Classic**: VMs in your cloud account - **Pro**: adds Predictive IO - **Serverless**: VMs in Databricks' account, adds Intelligent Workload Management, sub-second startup

**Storage and governance layer.** - **Delta Lake**: open-source storage format Databricks invented (Parquet plus a transaction log) - **Unity Catalog**: governance layer. Three-level namespace (catalog.schema.table), row/column security, full lineage - **Volumes**: where non-tabular data lives. Files, images, ML models Fabric equivalent for Volumes: the "Files" section of a Fabric Lakehouse.

**Cloud provider layer.** Azure, AWS, GCP. Databricks runs on top of all three. Same platform, different infrastructure underneath.

**Key architectural point.** Databricks is a platform layer that runs *on* your cloud, not a replacement for it. - Storage layer stays in your cloud account - Compute layer is decoupled. **Classic** runs on VMs in your account; **Serverless** runs on Databricks-managed infrastructure You choose.

Three practical implications coming up. Each one builds on the storage/compute decoupling we just covered.

**Your Data, Your Storage.** Data lives in *your* storage: - **Azure**: ADLS Gen2 - **AWS**: S3 - **Google**: GCS Databricks reads and writes to it but never takes ownership. This is the **storage** half of the decoupling.

**Compute Scales Independently.** Spin up clusters when you need processing power, shut them down when you don't. This is the **compute** half. It's why you can run a heavy training job for an hour and pay nothing the next day.

**Two Scopes, One Product.** Clear ownership boundary, single product experience. - **Your scope**: storage, sitting in your cloud account - **Databricks scope**: workspace, compute, governance running on top Same idea as the storage/compute split, just framed as ownership.

**Azure Databricks** is a native Azure resource. - Deployed via Marketplace - Billed through your Azure subscription - Auth through **Microsoft Entra ID** (was Azure AD) - Plays natively with **Key Vault**, **Event Hubs**, **ADLS**, **Power BI**, **Purview**

Four Service Areas

What you can actually do on the Databricks platform

Databricks organizes its capabilities into four pillars. Most organizations start with one or two, then expand as their data maturity grows.

Data Engineering

Languages Python, SQL, Scala

Key Tool Lakeflow Spark Declarative Pipelines

Also Lakeflow Jobs, Auto Loader

Output Governed Delta tables

SQL Analytics

Languages SQL

Key Tool SQL Warehouses

Also Editor, Dashboards

Output BI query endpoints

Machine Learning

Languages Python, R, Scala

Key Tool MLflow

Also Feature Store, AutoML

Output Models + endpoints

Streaming

Languages Python, SQL, Scala

Key Tool Structured Streaming

Also Lakeflow streaming mode

Output Live Delta tables

**Four pillars.** Most organizations start with one or two and expand over time.

**Data Engineering.** The pipeline-building side. **Flagship**: Lakeflow Spark Declarative Pipelines (was **Delta Live Tables** before the 2025 rebrand). You describe the target tables; Databricks figures out orchestration, error handling, and data quality. Python or SQL. **Key terms**: - **Lakeflow** -> umbrella name for Databricks' ETL stack - **Lakeflow Jobs** -> scheduling and orchestration - **Spark** -> the underlying open-source distributed processing engine - **Auto Loader** -> watches a folder in cloud storage and auto-ingests new files

**SQL Analytics.** The BI-friendly side. - **SQL Warehouses** -> the compute - **SQL Editor** -> where you write queries - **Dashboards** -> native simple visualizations Entry point for anyone who'd rather not write Python. Most Power BI folks land here first. Serverless SQL Warehouses connect to Power BI.

**Machine Learning.** **MLflow** handles the lifecycle: - Tracks every experiment - Registers model versions - Packages them for deployment Databricks invented MLflow, open-sourced it. *Fabric uses it now too.* **Adjacent tools**: - **Feature Store** -> centralized engineered features for reuse across models - **AutoML** -> tries dozens of model types automatically, picks the best - **Model Serving** -> deploys a trained model as a REST endpoint

**Streaming.** Same Spark engine that runs batch can run streaming. - **Structured Streaming** -> Spark's stream processing API. Code looks like batch; Spark handles the streaming complexity - **Lakeflow streaming mode** -> Lakeflow pipelines can run continuous You don't maintain two stacks for batch and real-time.

How It Works With Cloud Providers

Databricks runs on your cloud, not instead of it

Databricks

Azure

AWS

GCP

Both/and, not either/or. Your source systems can stay where they are with Databricks querying them live, or you can bring data in and serve it as Delta tables from your lakehouse. Most architectures use a mix. The platform doesn't force one mode over the other.

Area	Databricks Provides	Your Cloud Provides
Development	Workspace + notebooks	Object storage
Compute	Optimized Spark + Photon	Virtual machines
Governance	Unity Catalog	Identity provider
Pipelines	Lakeflow Spark Declarative Pipelines	Encryption + keys
Cost	Platform fee (DBUs)	Infrastructure billing

**Same Databricks on any cloud.** The accurate version of *"your data never leaves your cloud"*: - **Classic compute**: data lives in your storage, processes on VMs in your account - **Serverless compute**: data still lives in your storage, but processes briefly on Databricks-managed VMs (in-region) Storage layer stays in your cloud account either way. Compute is decoupled.

**Both/and, not either/or.** This is the reframe of the whole section. Two valid patterns, and most real architectures use both: - Source systems stay where they are; Databricks queries them live, OR - You bring data in and serve it as Delta tables from your lakehouse The platform doesn't force one mode over the other. This replaces the older "data never leaves your cloud" oversimplification.

**Clear division of labor.** - **Cloud** -> physical infrastructure (storage, VMs, networking, identity) - **Databricks** -> platform on top (Spark, SQL engines, Unity Catalog, Lakeflow) You're not choosing between Databricks and your cloud. You're using them together.

Better Together

Both/And, not Either/Or

Databricks and Microsoft Fabric are complementary platforms. Many organizations use both: Databricks for heavy engineering and ML, Fabric and Power BI for analytics and reporting. The real question is not "which one?" but "where does each fit?"

Where Databricks Excels

Large-scale ETL and data engineering

ML model training and serving

Multi-cloud data governance

Real-time streaming pipelines

Open-source toolchain (Spark, MLflow, Delta)

Where Microsoft Fabric Excels

Power BI semantic models and reporting

OneLake unified storage with shortcuts

T-SQL analytics (familiar for SQL Server teams)

Low-code data prep (Dataflow Gen2)

Microsoft 365 and Teams integration

Databricks

Processing + ML

OneLake Shortcuts

DirectQuery to SQL Warehouse

Unity Catalog Mirroring

Fabric

OneLake + Analytics

Power BI

Reporting + Dashboards

Explore the Microsoft side: Our Direct Lake guide covers how Fabric connects to data (Import, DirectQuery, and Direct Lake modes). The Fabric Engines & Items guide explains OneLake shortcuts and how Fabric's own compute engines compare. In practice, many organizations use Databricks for heavy ETL and ML workloads, then surface results through Fabric and Power BI for business consumption.

**Does Databricks compete with Fabric?** Short answer: **no.** They overlap in places, but each is stronger at different jobs. Most mature organizations use both. *This isn't a fight. It's two platforms that happen to land on the same Parquet file.*

**Databricks is stronger at:** - Heavy data engineering - Machine learning at scale - Real-time streaming - Multi-cloud - Open-source orientation

**Fabric is stronger at:** - Power BI semantic models - OneLake - T-SQL analytics - Low-code data prep (Dataflow Gen2) - Microsoft 365 integration

**Concrete integration paths:** - **OneLake Shortcuts** -> pointers in OneLake to data living elsewhere (Databricks tables, S3) without copying - **DirectQuery to Databricks SQL** -> Power BI's live mode hitting SQL Warehouses - **Direct Lake** -> Power BI storage mode that reads Parquet/Delta straight from OneLake. No copy, no DirectQuery overhead - **Mirroring** -> Fabric mirrors Unity Catalog tables into OneLake; the mirror updates as data changes in Databricks

Links to our Fabric guides for the audience that wants to go deeper on the Microsoft side.

Who Uses Databricks

Four personas, one platform

Different roles interact with different parts of the platform. Here are the four primary personas you will encounter in a Databricks environment.

Data Engineer

Builds and orchestrates pipelines using notebooks, Lakeflow Spark Declarative Pipelines, and Lakeflow Jobs. Python and SQL are the primary languages.

SQL Analyst

Queries data through SQL Warehouses and the SQL Editor. Builds dashboards and connects BI tools. No Python required.

Data Scientist

Trains and deploys ML models using MLflow, Feature Store, and notebook experiments. Uses GPU clusters for deep learning.

Platform Admin

Manages Unity Catalog, workspace access, cluster policies, and cost controls. The person who keeps the platform secure and efficient.

You do not need to be a Python developer

This is one of the most common misconceptions about Databricks. SQL-first users can work entirely within SQL Warehouses and the SQL Editor. If you are comfortable writing T-SQL in SQL Server or Fabric, you already have the foundation to query data in Databricks.

**Four primary personas.** Most people fall into one of these.

**Data Engineer.** - Builds the pipelines - Designs the medallion architecture - Heavy notebook user - Python and SQL Lives in **Lakeflow**.

**Analytics Engineer / SQL Analyst.** - Connects to SQL Warehouses - Writes transformations - Builds the **Gold-layer** tables Power BI reports run against Pure SQL gets you most of the way. You don't need to be a Python developer for this role.

**Data Scientist.** - Trains models - Lives in **MLflow** and notebooks - Python and R - Deploys through **Model Serving** when something's ready for production

**Platform Admin.** - Owns **Unity Catalog**, cluster policies, networking, cost - The SRE of your data platform - Fewer per org but critical

**Key takeaway** for the Power BI / Fabric audience: you do **NOT** need to know Python to use Databricks. The SQL Analyst path covers most of it.

Example Data Flow

A common Databricks architecture from source systems to business consumption

Source

Data source systems

ERP

CRM

Files

Data Engineering

Ingest, transform, and store with Databricks

Ingest
Layer

Lakeflow Connect

Managed Ingestion

Auto Loader

Streaming Ingest

Transform
Layer

Lakeflow

Declarative Pipelines

Notebooks

Custom Transforms

Store
Layer

Bronze

Delta Lake

Silver

Delta Lake

Gold

Delta Lake

Serving Layer

Query the data

Unity
Catalog

Databricks
SQL

Analytics and Reporting

Business and data science consumption

Data
Science

MLflow

Notebook

AI

Genie

NL Queries

Mosaic AI

Agents

Reporting

Power BI

Import, Live

Dashboards

AI/BI Dashboards

Unity Catalog: Governance and Lifecycle

Unified data governance across all layers

Discover

Lineage

Access Control

Audit

Quality

Compliance

**Sources.** ERP, CRM, flat files feed the ingestion layer. Same as any data platform.

**Data Engineering with Medallion Architecture.** Industry standard now, not Databricks-specific. *Fabric uses it too.* **Metallurgy metaphor**: increasing purity through each layer. - **Bronze** -> raw ingested data, schema-on-read, as the source gave it - **Silver** -> cleaned, deduplicated, conformed, joined with reference data - **Gold** -> business-ready, aggregated, shaped for a specific consumer (often a single Power BI semantic model) All stored as Delta tables.

**Serving Layer.** - **Unity Catalog** -> governs access, primary namespace for all consumers - **Databricks SQL** -> query endpoint for Power BI via Import or Live (DirectQuery) - **Direct Lake** -> reads Delta files straight from OneLake when data is mirrored

**Analytics and Reporting.** - **MLflow** -> ML lifecycle - **Notebooks** -> experimentation - **Genie** -> natural language queries - **Mosaic AI** -> agent framework - **Power BI** -> via Import or Live - **AI/BI Dashboards** -> native simple reporting

**Unity Catalog spans the entire architecture:** - Discovery - **Lineage tracking** (table-to-table AND column-to-column) - Access control - Audit logging - Data quality monitoring - Compliance Three-level namespace: **catalog.schema.table**.

Up Next

The Databricks Fundamentals series

Part 1

Platform fundamentals, the Lakehouse, service areas, and Microsoft integration.

You Are Here

Part 2

Unity Catalog & Governance

The three-level namespace, access control, data lineage, and the metastore model, on a worked example.

Coming Soon

Part 3

Databricks + Microsoft Fabric

Deep integration: OneLake shortcuts, DirectQuery, catalog mirroring, and reference architectures.

Coming Soon

Part 4

Databricks for the Power BI Pro

Concept mapping from Power BI to Databricks. Connecting, querying, and building dashboards.

Coming Soon

Four-part series overview. This guide is Part 1.

What is Databricks?

The Data Problem

Data Warehouses

Strengths

Limitations

Data Lakes

Strengths

Limitations

Data Silos

Governance Gaps

Cost Explosion

The Lakehouse

Traditional Lake

Delta Lake

What Databricks Actually Is

What This Means in Practice

Your Data, Your Storage

Compute Scales Independently

Two Scopes, One Product

Four Service Areas

How It Works With Cloud Providers

Better Together

Where Databricks Excels

Where Microsoft Fabric Excels

Databricks

Fabric

Power BI

Who Uses Databricks

Data Engineer

SQL Analyst

Data Scientist

Platform Admin

You do not need to be a Python developer

Example Data Flow

Source

Data Engineering

Serving Layer

Analytics and Reporting

Unity Catalog: Governance and Lifecycle

Up Next

What is Databricks?

Unity Catalog & Governance

Databricks + Microsoft Fabric

Databricks for the Power BI Pro

Ready to explore how Databricks fits into your data platform?

Discussion