~12 min read

What is Databricks?

The Unified Data Platform, Explained

A practical introduction for data professionals coming from the Microsoft ecosystem

Part 1 of 5: Databricks Fundamentals
Scroll to explore

The Data Problem

Why organizations ended up with two systems for one job

Data Warehouses

Strengths

  • Structured, schema-enforced queries
  • Optimized for BI and reporting
  • Strong SQL support and governance

Limitations

  • Rigid schemas; slow to adapt
  • Expensive to scale storage
  • Poor support for unstructured data

Data Lakes

Strengths

  • Stores any data type at any scale
  • Low-cost object storage
  • Flexible for ML and data science

Limitations

  • No schema enforcement ("data swamp" risk)
  • No ACID transactions
  • Difficult to query reliably for BI
The two-system problem: Organizations ended up running both a data warehouse (for BI teams) and a data lake (for data scientists and engineers). Same data, stored twice, governed differently, with constant sync headaches between the two. This is the problem the Lakehouse was designed to solve.

Data Silos

The same data copied across warehouse, lake, and staging areas. Every copy drifts over time.

Governance Gaps

Different access controls in the warehouse vs. the lake. Who has the "right" version?

Cost Explosion

Paying for storage twice, compute twice, and ETL pipelines to keep everything in sync.

**Data Warehouses.** Structured and reliable, but rigid and expensive. **Microsoft equivalent**: Synapse Dedicated SQL Pool, or classic Azure SQL DW.
**Data Lakes.** Flexible and cheap, but no governance or ACID transactions. **Microsoft equivalent**: ADLS Gen2 with raw Parquet/JSON/CSV files sitting there with no query engine on top.
The two-system pattern is **extremely common**. Most Microsoft shops run: - Synapse + ADLS, or - Fabric Lakehouse + Warehouse The three problems below are universal complaints from those shops.
**Data Silos.** Same data copied across warehouse, lake, and staging. Every copy drifts over time. Industry slang for the worst case: **"data swamp"** (an ungoverned lake that became unusable).
**Governance Gaps.** Different worlds, different controls: - Different access controls in warehouse vs lake - Different audit trails - Different lineage No single source of truth on who can see what.
**Cost Explosion.** You're paying three times: - Storage twice (once per system) - Compute twice - ETL pipelines to keep both sides in sync

The Lakehouse

Lake flexibility with warehouse reliability

Data Lake
Any File Format
Schema on Read
Cost Effective
Weak Governance
Data Warehouse
Structured Tables
Schema on Write
Fast Queries
Rigid & Costly
Data Lakehouse
Any Format
ACID
Fast Queries
Low Cost

Traditional Lake

Partial Writes Possible
Read Consistency Unstable
Rollback None
Versioning Manual
Pre-Delta
vs

Delta Lake

Partial Writes Prevented
Read Consistency Guaranteed
Rollback Built-in
Versioning Built-in
ACID Compliant
A
Atomicity
All or nothing: transactions complete fully or not at all
C
Consistency
Data always moves from one valid state to another
I
Isolation
Concurrent operations don't interfere with each other
D
Durability
Once committed, data persists even through failures
Go deeper: Databricks popularized the Lakehouse pattern, built on top of Delta Lake format (open-source, columnar, ACID-compliant). For a look at how data flows through Bronze, Silver, and Gold layers in a Lakehouse, see our Medallion Architecture guide.
**Data Lakes.** Flexible and cheap, but no governance. One half of the old paradigm.
**Data Warehouses.** Structured and reliable, but rigid and expensive. The other half.
**The Lakehouse combines both.** Any format, ACID transactions, fast queries, low cost. **Did Databricks invent it?** Yes, with a small asterisk: - Coined the term in the 2020 CIDR paper - Built **Delta Lake** (open-sourced 2019) which is the format that makes it possible Fabric uses Delta now too, so it isn't proprietary, but it started here.
**Traditional lakes had real problems.** No transactions: - Jobs could die halfway through and leave half-written data - Two readers could see different versions of the same table - No rollback, no version history, no schema enforcement
**Delta solves all of that** with a transaction log on top of Parquet files: - Atomic writes - Consistent reads - Rollback to any prior state - Versioning built in That **ACID guarantee** is what makes the lake safe for business use. *Skip the partial-writes detail on camera unless someone asks; this summary covers it.*
**Atomicity.** Writes complete fully or not at all. No half-written tables.
**Consistency.** Data always moves from one valid state to another. Schema enforced.
**Isolation.** Concurrent reads and writes don't interfere with each other.
**Durability.** Once committed, data survives failures. **Time travel** lets you query any prior version.
Links to deep-dive guides on **Delta** and **Medallion Architecture**. **Parquet** is the columnar file format underneath. Every modern platform uses it: Fabric, Databricks, Snowflake, BigQuery.

What Databricks Actually Is

A platform layer, not a cloud replacement

Workspace
Notebooks
SQL Editor
Jobs
Repos
Dashboards
Compute
All-Purpose Clusters
Job Clusters
SQL Warehouses
Storage & Governance
Delta Lake
Unity Catalog
Volumes
Your Cloud Provider
Azure
AWS
GCP

Databricks orchestrates the workspace, compute, and governance layers. Your storage stays in your cloud account, while compute is decoupled so you can scale it independently.

Key insight: Databricks is a platform layer that runs on your cloud provider. It does not replace Azure, AWS, or GCP. The storage layer stays in your cloud account. The compute layer is decoupled, so you choose whether it runs on your own account (Classic) or on Databricks-managed infrastructure (Serverless).

What This Means in Practice

Your Data, Your Storage

Data lives in your own ADLS Gen2 (Azure), S3 (AWS), or GCS (Google) storage. Databricks reads and writes to it, but never takes ownership of it.

Compute Scales Independently

Spin up clusters when you need processing power, shut them down when you don't. Storage and compute are fully decoupled.

Two Scopes, One Product

Storage is yours, sitting in your cloud account. Databricks runs the workspace, compute, and governance on top.

Azure Databricks
Azure Databricks is the Azure-native deployment of the Databricks platform. Same core product, integrated with Microsoft Entra ID (formerly Azure Active Directory), deployed into your Azure subscription, and billed through Azure. If you already work in the Microsoft ecosystem, this is typically where you start.
**Workspace layer.** Fabric equivalents: - **Notebooks** -> Fabric Notebooks - **SQL Editor** -> Fabric SQL Endpoint of a Lakehouse - **Lakeflow Jobs** (orchestration) -> Fabric Data Pipelines - **Repos** (Git integration) -> Fabric Git - **Dashboards** -> native simple visualizations Dashboards are **NOT** a Power BI competitor. Most shops still use Power BI on top.
**Compute layer.** Three types, this is where the confusion is: - **All-Purpose Clusters** (Spark): general compute for notebooks, pay per second - **Job Clusters** (Spark): spun up for a job, torn down right after, cheaper - **SQL Warehouses** (SQL-only): what Power BI talks to SQL Warehouses come in three flavors: - **Classic**: VMs in your cloud account - **Pro**: same plus Photon engine - **Serverless**: VMs in Databricks' account, sub-second startup
**Storage and governance layer.** - **Delta Lake**: open-source storage format Databricks invented (Parquet plus a transaction log) - **Unity Catalog**: governance layer. Three-level namespace (catalog.schema.table), row/column security, full lineage - **Volumes**: where non-tabular data lives. Files, images, ML models Fabric equivalent for Volumes: the "Files" section of a Fabric Lakehouse.
**Cloud provider layer.** Azure, AWS, GCP. Databricks runs on top of all three. Same platform, different infrastructure underneath.
**Key architectural point.** Databricks is a platform layer that runs *on* your cloud, not a replacement for it. - Storage layer stays in your cloud account - Compute layer is decoupled. **Classic** runs on VMs in your account; **Serverless** runs on Databricks-managed infrastructure You choose.
Three practical implications coming up. Each one builds on the storage/compute decoupling we just covered.
**Your Data, Your Storage.** Data lives in *your* storage: - **Azure**: ADLS Gen2 - **AWS**: S3 - **Google**: GCS Databricks reads and writes to it but never takes ownership. This is the **storage** half of the decoupling.
**Compute Scales Independently.** Spin up clusters when you need processing power, shut them down when you don't. This is the **compute** half. It's why you can run a heavy training job for an hour and pay nothing the next day.
**Two Scopes, One Product.** Clear ownership boundary, single product experience. - **Your scope**: storage, sitting in your cloud account - **Databricks scope**: workspace, compute, governance running on top Same idea as the storage/compute split, just framed as ownership.
**Azure Databricks** is a native Azure resource. - Deployed via Marketplace - Billed through your Azure subscription - Auth through **Microsoft Entra ID** (was Azure AD) - Plays natively with **Key Vault**, **Event Hubs**, **ADLS**, **Power BI**, **Purview**

Four Service Areas

What you can actually do on the Databricks platform

Databricks organizes its capabilities into four pillars. Most organizations start with one or two, then expand as their data maturity grows.

Data Engineering
Languages Python, SQL, Scala
Key Tool Lakeflow Spark Declarative Pipelines
Also Lakeflow Jobs, Auto Loader
Output Governed Delta tables
SQL Analytics
Languages SQL
Key Tool SQL Warehouses
Also Editor, Dashboards
Output BI query endpoints
Machine Learning
Languages Python, R, Scala
Key Tool MLflow
Also Feature Store, AutoML
Output Models + endpoints
Streaming
Languages Python, SQL, Scala
Key Tool Structured Streaming
Also Lakeflow streaming mode
Output Live Delta tables
**Four pillars.** Most organizations start with one or two and expand over time.
**Data Engineering.** The pipeline-building side. **Flagship**: Lakeflow Spark Declarative Pipelines (was **Delta Live Tables** before the 2025 rebrand). You describe the target tables; Databricks figures out orchestration, error handling, and data quality. Python or SQL. **Key terms**: - **Lakeflow** -> umbrella name for Databricks' ETL stack - **Lakeflow Jobs** -> scheduling and orchestration - **Spark** -> the underlying open-source distributed processing engine - **Auto Loader** -> watches a folder in cloud storage and auto-ingests new files
**SQL Analytics.** The BI-friendly side. - **SQL Warehouses** -> the compute - **SQL Editor** -> where you write queries - **Dashboards** -> native simple visualizations Entry point for anyone who'd rather not write Python. Most Power BI folks land here first. Serverless SQL Warehouses connect to Power BI.
**Machine Learning.** **MLflow** handles the lifecycle: - Tracks every experiment - Registers model versions - Packages them for deployment Databricks invented MLflow, open-sourced it. *Fabric uses it now too.* **Adjacent tools**: - **Feature Store** -> centralized engineered features for reuse across models - **AutoML** -> tries dozens of model types automatically, picks the best - **Model Serving** -> deploys a trained model as a REST endpoint
**Streaming.** Same Spark engine that runs batch can run streaming. - **Structured Streaming** -> Spark's stream processing API. Code looks like batch; Spark handles the streaming complexity - **Lakeflow streaming mode** -> Lakeflow pipelines can run continuous You don't maintain two stacks for batch and real-time.

How It Works With Cloud Providers

Databricks runs on your cloud, not instead of it

Databricks Databricks
Azure
AWS
GCP
Both/and, not either/or. Your source systems can stay where they are with Databricks querying them live, or you can bring data in and serve it as Delta tables from your lakehouse. Most architectures use a mix. The platform doesn't force one mode over the other.
Area Databricks Provides Your Cloud Provides
Development Workspace + notebooks Object storage
Compute Optimized Spark + Photon Virtual machines
Governance Unity Catalog Identity provider
Pipelines Lakeflow Spark Declarative Pipelines Encryption + keys
Cost Platform fee (DBUs) Infrastructure billing
**Same Databricks on any cloud.** The accurate version of *"your data never leaves your cloud"*: - **Classic compute**: data lives in your storage, processes on VMs in your account - **Serverless compute**: data still lives in your storage, but processes briefly on Databricks-managed VMs (in-region) Storage layer stays in your cloud account either way. Compute is decoupled.
**Both/and, not either/or.** This is the reframe of the whole section. Two valid patterns, and most real architectures use both: - Source systems stay where they are; Databricks queries them live, OR - You bring data in and serve it as Delta tables from your lakehouse The platform doesn't force one mode over the other. This replaces the older "data never leaves your cloud" oversimplification.
**Clear division of labor.** - **Cloud** -> physical infrastructure (storage, VMs, networking, identity) - **Databricks** -> platform on top (Spark, SQL engines, Unity Catalog, Lakeflow) You're not choosing between Databricks and your cloud. You're using them together.

Better Together

Both/And, not Either/Or

Databricks and Microsoft Fabric are complementary platforms. Many organizations use both: Databricks for heavy engineering and ML, Fabric and Power BI for analytics and reporting. The real question is not "which one?" but "where does each fit?"

Where Databricks Excels

Large-scale ETL and data engineering
ML model training and serving
Multi-cloud data governance
Real-time streaming pipelines
Open-source toolchain (Spark, MLflow, Delta)

Where Microsoft Fabric Excels

Power BI semantic models and reporting
OneLake unified storage with shortcuts
T-SQL analytics (familiar for SQL Server teams)
Low-code data prep (Dataflow Gen2)
Microsoft 365 and Teams integration
Databricks

Databricks

Processing + ML

OneLake Shortcuts
DirectQuery to SQL Warehouse
Unity Catalog Mirroring
Fabric

Fabric

OneLake + Analytics

Power BI

Reporting + Dashboards

Explore the Microsoft side: Our Direct Lake guide covers how Fabric connects to data (Import, DirectQuery, and Direct Lake modes). The Fabric Engines & Items guide explains OneLake shortcuts and how Fabric's own compute engines compare. In practice, many organizations use Databricks for heavy ETL and ML workloads, then surface results through Fabric and Power BI for business consumption.
**Does Databricks compete with Fabric?** Short answer: **no.** They overlap in places, but each is stronger at different jobs. Most mature organizations use both. *This isn't a fight. It's two platforms that happen to land on the same Parquet file.*
**Databricks is stronger at:** - Heavy data engineering - Machine learning at scale - Real-time streaming - Multi-cloud - Open-source orientation
**Fabric is stronger at:** - Power BI semantic models - OneLake - T-SQL analytics - Low-code data prep (Dataflow Gen2) - Microsoft 365 integration
**Concrete integration paths:** - **OneLake Shortcuts** -> pointers in OneLake to data living elsewhere (Databricks tables, S3) without copying - **DirectQuery to Databricks SQL** -> Power BI's live mode hitting SQL Warehouses - **Direct Lake** -> Power BI storage mode that reads Parquet/Delta straight from OneLake. No copy, no DirectQuery overhead - **Mirroring** -> Fabric mirrors Unity Catalog tables into OneLake; the mirror updates as data changes in Databricks
Links to our Fabric guides for the audience that wants to go deeper on the Microsoft side.

Who Uses Databricks

Four personas, one platform

Different roles interact with different parts of the platform. Here are the four primary personas you will encounter in a Databricks environment.

Data Engineer

Builds and orchestrates pipelines using notebooks, Lakeflow Spark Declarative Pipelines, and Lakeflow Jobs. Python and SQL are the primary languages.

SQL Analyst

Queries data through SQL Warehouses and the SQL Editor. Builds dashboards and connects BI tools. No Python required.

Data Scientist

Trains and deploys ML models using MLflow, Feature Store, and notebook experiments. Leverages GPU clusters for deep learning.

Platform Admin

Manages Unity Catalog, workspace access, cluster policies, and cost controls. The person who keeps the platform secure and efficient.

You do not need to be a Python developer

This is one of the most common misconceptions about Databricks. SQL-first users can work entirely within SQL Warehouses and the SQL Editor. If you are comfortable writing T-SQL in SQL Server or Fabric, you already have the foundation to query data in Databricks.

**Four primary personas.** Most people fall into one of these.
**Data Engineer.** - Builds the pipelines - Designs the medallion architecture - Heavy notebook user - Python and SQL Lives in **Lakeflow**.
**Analytics Engineer / SQL Analyst.** - Connects to SQL Warehouses - Writes transformations - Builds the **Gold-layer** tables Power BI reports run against Pure SQL gets you most of the way. You don't need to be a Python developer for this role.
**Data Scientist.** - Trains models - Lives in **MLflow** and notebooks - Python and R - Deploys through **Model Serving** when something's ready for production
**Platform Admin.** - Owns **Unity Catalog**, cluster policies, networking, cost - The SRE of your data platform - Fewer per org but critical
**Key takeaway** for the Power BI / Fabric audience: you do **NOT** need to know Python to use Databricks. The SQL Analyst path covers most of it.

Example Data Flow

A common Databricks architecture from source systems to business consumption

Source

Data source systems

ERP
CRM
Files

Data Engineering

Ingest, transform, and store with Databricks

Ingest
Layer
Data Factory
Orchestration
Auto Loader
Streaming Ingest
Transform
Layer
Lakeflow
Declarative Pipelines
Notebooks
Custom Transforms
Store
Layer
Bronze
Delta Lake
Silver
Delta Lake
Gold
Delta Lake

Serving Layer

Query the data

Unity
Catalog
Databricks
SQL

Analytics and Reporting

Business and data science consumption

Data
Science
Notebook
Notebook
AI
Genie
NL Queries
Mosaic AI
Agents
Reporting
Power BI
Import, Live
Dashboards
AI/BI Dashboards

Unity Catalog: Governance and Lifecycle

Unified data governance across all layers

Discover
Lineage
Access Control
Audit
Quality
Compliance
**Sources.** ERP, CRM, flat files feed the ingestion layer. Same as any data platform.
**Data Engineering with Medallion Architecture.** Industry standard now, not Databricks-specific. *Fabric uses it too.* **Metallurgy metaphor**: increasing purity through each layer. - **Bronze** -> raw ingested data, schema-on-read, as the source gave it - **Silver** -> cleaned, deduplicated, conformed, joined with reference data - **Gold** -> business-ready, aggregated, shaped for a specific consumer (often a single Power BI semantic model) All stored as Delta tables.
**Serving Layer.** - **Unity Catalog** -> governs access, primary namespace for all consumers - **Databricks SQL** -> query endpoint for Power BI via Import or Live (DirectQuery) - **Direct Lake** -> reads Delta files straight from OneLake when data is mirrored
**Analytics and Reporting.** - **MLflow** -> ML lifecycle - **Notebooks** -> experimentation - **Genie** -> natural language queries - **Mosaic AI** -> agent framework - **Power BI** -> via Import or Live - **AI/BI Dashboards** -> native simple reporting
**Unity Catalog spans the entire architecture:** - Discovery - **Lineage tracking** (table-to-table AND column-to-column) - Access control - Audit logging - Data quality monitoring - Compliance Three-level namespace: **catalog.schema.table**.

Ready to explore how Databricks fits into your data platform?

Whether you are evaluating Databricks alongside Microsoft Fabric or planning a hybrid architecture, we can help you design the right approach for your organization.

Schedule a Call Our Offerings

Discussion

Loading comments...
1 / 9