Analytic Endeavors Design - Copyright 2024-2025 Analytic Endeavors Inc. Unauthorized use prohibited.
~14 min read
Fabric Engines & Items
Understanding the Building Blocks of Microsoft Fabric
From compute engines to data movers and storage -- the pieces that power your analytics
The Building Blocks of Fabric
Scroll to explore
Compute Engines
Four engines that power Microsoft Fabric
Fabric provides different compute engines for different workloads. Each engine is optimized for a specific type of processing, but they all read from the same unified storage layer -- OneLake.
Spark handles large-scale batch and streaming.
T-SQL covers relational analytics.
KQL targets time-series and log data.
Analysis Services powers semantic modeling for Power BI.
All engines query the same data in OneLake -- no copying needed.
Engine Profiles
Each engine is optimized for different workloads -- and every Fabric item runs on one of these under the hood.
Spark
LanguagesPySpark, Scala, R
WorkflowNotebook-first
Best ForBig data + ML
Powers
LakehouseNotebook
T-SQL
LanguagesT-SQL
WorkflowSchema-first
Best ForBI + Analytics
Powers
WarehouseSQL Endpoint
KQL
LanguagesKQL
WorkflowReal-time
Best ForStreaming + Logs
Powers
Eventhouse
Analysis Services
LanguagesDAX
WorkflowIn-memory
Best ForSemantic layer
Powers
Semantic Model
Data Movers
Three tools for moving and transforming data
Dataflow Gen2
Low-Code
Visual ETL using Power Query (M language). Drag-and-drop transforms -- no code required.
Interface
Drag & Drop
Transforms
Merge, Filter, Pivot
Best For
Simple ETL
Pipeline
Orchestration
Workflow engine for multi-step data movement. Schedules, coordinates, and monitors jobs.
Interface
Visual Canvas
Actions
Copy, Branch, Loop
Best For
Multi-Step Workflows
Notebook
Code-Driven
Code-first data engineering with full programmatic control. Supports Python, PySpark, Scala, and SparkSQL in interactive notebooks.
Interface
Code Cells
Languages
Python, Spark, SQL
Best For
Complex Transforms
Each tool handles a different slice of the data movement problem.
Dataflow for visual transforms. Pipeline for orchestration. Notebook for code.
They're designed to work together -- a Pipeline can trigger a Notebook or Dataflow as a step in a larger workflow.
There's also Mirroring for continuous, real-time replication from external databases.
How They Overlap
Shared capabilities across data movers
Dataflow
Pipeline
Notebook
UI-Driven
Dataflow + Pipeline
Data Shaping
Dataflow + Notebook
Code-Driven
Pipeline + Notebook
Copies Data
All Three
Orchestration
Pipeline Only
Dataflow
Pipeline
Notebook
UI-Driven
Code-Driven
Data Shaping
Copies Data
Orchestration
The overlaps are intentional -- Microsoft designed these tools to share capabilities so teams can pick the mover that fits their skill set without losing functionality. Pipeline stands alone with orchestration because that's its primary job: coordinating the other two.
Shortcuts
Reference external data without copying it
ADLS Gen2
Azure Data Lake Storage
AWS S3
Amazon S3 Buckets
Other Workspaces
Fabric Items + More
Shortcut
Data MovementNone
FreshnessAlways Current
Storage CostNone
AccessRead-Only
Pointer
vs
Copy
Data MovementFull Transfer
FreshnessAs of Last Run
Storage CostDuplicated
AccessRead + Write
Full Control
Shortcuts are ideal when data already lives in a well-managed source. Copy when you need to transform, enrich, or own the data lifecycle.
ACID
Why traditional data lakes break -- and how Delta fixes it
Traditional Lake
Partial WritesPossible
Read ConsistencyUnstable
RollbackNone
VersioningManual
Pre-Delta
vs
Delta Lake
Partial WritesPrevented
Read ConsistencyGuaranteed
RollbackBuilt-in
VersioningBuilt-in
ACID Compliant
Delta Lake means you no longer have to choose between fresh data and trustworthy data.
A
Atomicity
All or nothing -- transactions complete fully or not at all
C
Consistency
Data always moves from one valid state to another
I
Isolation
Concurrent operations don't interfere with each other
D
Durability
Once committed, data persists even through failures
Delta Lake brings warehouse-grade reliability to the Lakehouse.
No more choosing between fresh data and trustworthy data -- ACID compliance means you get both.
Copy Data Activity
Where your data lands depends on the tool
Dataflow
Transform
Tables (Delta)
Pipeline
Files
Tables (Delta)
Notebook
Transform
Files
Tables (Delta)
Dataflows load transformed data into Delta tables. Pipelines copy raw data as files or tables. Notebooks can transform and write to either destination.
Delta Tables
ACIDYes
QueryableSQL + Spark
Best ForAnalytics
Files
ACIDNo
QueryableLimited
Best ForStaging
Lakehouse
The best of both worlds
How the Lakehouse Works
1
Delta Lake Storage
Your tables are files -- open-format Parquet in OneLake, readable by any engine.
2
Auto SQL Endpoint
Run T-SQL queries against Delta tables -- auto-generated for every Lakehouse.
3
Dual Engines
Spark + T-SQL both reading the same underlying data. Pick your tool.
4
Zero Data Copies
One storage layer, multiple engines. No ETL between lake and warehouse.
i
One storage format, multiple access patterns. That's the Lakehouse promise.
Lakehouse vs Warehouse
Same storage, different write patterns
Lakehouse
Write EngineSpark
Read EngineSpark + Auto SQL
StorageFiles + Tables
SchemaOn Read (Flexible)
LanguagesPython, PySpark, Scala
Code-Driven
vs
Warehouse
Write EngineT-SQL
Read EngineT-SQL
StorageTables Only
SchemaOn Write (Strict)
LanguagesStored Procs, Views, DDL
SQL-Driven
Choose Lakehouse When
Code-first workflows
Raw files + structured tables
ML, data science, experimentation
Choose Warehouse When
SQL-first teams with T-SQL
Strict governance + auditing
Traditional BI + dashboards
Either works for Power BI. Both support Direct Lake. And every Lakehouse auto-generates a SQL Analytics Endpoint, so SQL queries work in both -- pick whichever matches your team.
From Storage to Report
Completing the data journey
In Fabric, Direct Lake mode connects OneLake to Power BI without copies or compromises. It's the payoff of the entire Delta / OneLake architecture. For a deep dive, see our Direct Lake guide.
Title slide. Welcome the audience, introduce the guide topic.
*"Today we're looking at the building blocks of Microsoft Fabric -- the engines, the data movers, and the storage items that tie it all together."*
**On screen:** Intro text naming the four engines + stacked architecture diagram (Compute Engines frame above OneLake Storage frame).
- Set the frame: no single engine can handle batch, streaming, relational, AND in-memory BI
- Fabric gives you **one engine per workload**, all sharing OneLake storage
- Point out the diagram: **Compute Engines** frame on top, **Storage** frame below
- Emphasize: all four engines read from the **same OneLake data** -- no ETL between them
*Transition: "Let's look at each engine's profile."*
**On screen:** Spark profile card (orange) -- Languages, Workflow, Best For, Powers badges.
- **Apache Spark**: large-scale batch processing and streaming
- Languages: Python, PySpark, Scala, SparkSQL, R
- Powers the **Lakehouse** and **Notebook** items
- Best for: big data transforms, ML pipelines, ad-hoc data exploration
*Transition: "Next, the relational engine."*
**On screen:** T-SQL profile card (teal) -- the relational analytics engine.
- Familiar **SQL Server** syntax: SELECT, JOIN, stored procs, views
- Powers the **Warehouse** and **SQL Analytics Endpoint**
- Best for: structured queries, governed analytics, teams with SQL experience
*Transition: "Now the real-time engine."*
**On screen:** KQL profile card (blue) -- time-series and log analytics.
- **Kusto Query Language**: purpose-built for fast queries over streaming and time-series data
- Powers the **Eventhouse** (formerly KQL Database)
- Best for: IoT telemetry, application logs, real-time dashboards
*Transition: "And the last engine -- the one closest to Power BI."*
**On screen:** Analysis Services profile card (purple) -- the semantic modeling engine.
- The **VertiPaq** in-memory engine that powers every Power BI semantic model
- DAX measures, relationships, calculation groups -- all run here
- Powers the **Semantic Model** and **Power BI Report**
- *Audience check: "Which of these engines does your team use most today?"*
*Key message: every Fabric item runs on one of these four engines under the hood.*
**On screen:** Dataflow Gen2 card -- teal, "Low-Code" badge.
- **Power Query Online** -- same drag-and-drop as Power BI Desktop, now cloud-native
- Interface: visual transforms (merge, filter, pivot). No code required.
- Best for: **straightforward ETL** where a business analyst can self-serve
*Transition: "Next, the orchestrator."*
**On screen:** Pipeline card -- blue, "Orchestration" badge.
- Inherited from **Azure Data Factory** -- visual canvas for multi-step workflows
- Copy Activity moves data from 90+ sources. ForEach, If-Else, Switch for branching.
- Best for: **scheduling and coordinating** other tools (trigger a Notebook, chain a Dataflow)
*Transition: "And for full programmatic control..."*
**On screen:** Notebook card -- orange, "Code-Driven" badge.
- Interactive cells: **PySpark, Scala, SparkSQL, Python**
- Full library access, ML frameworks, visualizations in-cell
- Best for: **complex transforms**, ML feature engineering, ad-hoc exploration
*Transition: "So when do you pick which tool?"*
**On screen:** Summary callout -- tools work together.
- Dataflow = visual transforms. Pipeline = orchestration. Notebook = code.
- They **compose together**: a Pipeline can trigger a Notebook or Dataflow as a step
- Mention **Mirroring** for continuous real-time replication (no scheduling needed)
*Key message: pick the mover that matches your team's skillset, not the other way around.*
**On screen:** Overlap bands (colored bars) showing shared capabilities.
- Walk top to bottom: UI-Driven (DF+PL), Data Shaping (DF+NB), Code-Driven (PL+NB)
- **All three** can copy data from external sources -- that's the center band
- **Pipeline only** has orchestration -- its unique role is coordinating the other two
*Transition: "Let's see this as a feature matrix."*
**On screen:** Feature matrix grid -- checkmarks for each tool across five capabilities.
- Point out the matrix grid: checkmarks make the overlaps concrete
- Copies Data row: all three check -- that's the most common overlap
- Orchestration row: only Pipeline -- its unique differentiator
*Transition: "The overlaps are by design."*
**On screen:** Summary callout explaining the overlap philosophy.
- Microsoft **intentionally** built shared capabilities so teams aren't locked in
- Pipeline stands alone for orchestration because that IS its purpose
*Key message: Fabric gives you choice based on skillset, not artificial constraints.*
**On screen:** Shortcut flow diagram -- 3 sources (ADLS Gen2, AWS S3, Other Workspaces) with animated dashed arrows to OneLake.
- Dashed lines = **pointers, not physical copies**. Data stays at the source.
- Cross-cloud: even **AWS S3** can be shortcutted into OneLake
- "Other Workspaces" covers Fabric items, Dataverse, Google Cloud Storage
*Transition: "But when should you shortcut vs. copy?"*
**On screen:** Shortcut comparison panel -- the "pointer" approach.
- **No data movement**: data stays where it is, OneLake just references it
- **Always current**: no stale copies to worry about
- **No storage cost**: no duplication in OneLake
- Tradeoff: **read-only** access -- you can't write back through a shortcut
**On screen:** Copy comparison panel appears alongside the Shortcut panel.
- **Full transfer**: data physically moves into OneLake
- **Freshness depends on schedule**: only as fresh as the last pipeline run
- **Storage duplicated**: additional OneLake capacity cost
- Upside: **full read + write** -- you own the data lifecycle
- Decision rule: shortcut when the source is well-managed; copy when you need to transform or own the lifecycle
**On screen:** Summary callout.
- Shortcuts are the **alternative to data movers** -- reference data in place
- Ideal when the source is already well-governed and you just need to read
*Key message: not every piece of data needs to be physically copied into OneLake.*
**On screen:** Traditional Lake comparison panel (red) -- four problem rows.
- Start with the **problem**: traditional data lakes have no transaction guarantees
- **Partial writes**: a pipeline failure leaves behind corrupted data
- **Unstable reads**: queries during writes return mixed old/new rows
- **No rollback, manual versioning**: mistakes are expensive to fix
*Transition: "Delta Lake changes all of this."*
**On screen:** Delta Lake comparison panel (green) appears next to Traditional Lake.
- Every row flips from red to green: Prevented, Guaranteed, Built-in, Built-in
- Emphasize the italic line below: *"you no longer have to choose between fresh data and trustworthy data"*
- Delta Lake adds a **transaction log** on top of Parquet files -- that's the magic
*Transition: "Let's unpack what ACID actually means."*
**On screen:** Atomicity card (purple "A").
- **All or nothing** -- a write completes fully or rolls back entirely
- No more partial files from failed pipeline runs
**On screen:** Consistency card (purple "C").
- Data always moves from one **valid state** to another
- Schema enforcement prevents bad data from sneaking in
**On screen:** Isolation card (purple "I").
- Concurrent readers and writers see **consistent snapshots**
- An analyst querying during an ETL run gets stable results -- no interference
**On screen:** Durability card (purple "D").
- Once committed, data **survives crashes**, power loss, hardware failure
- The transaction log is the guarantee
**On screen:** Summary callout -- warehouse-grade reliability for the Lakehouse.
- Delta Lake brings the same ACID guarantees that traditional warehouses have always had
- No more choosing between fresh data and trustworthy data -- you get both
*Key message: Delta Lake is what makes the Lakehouse possible. Without ACID, it's just a data swamp.*
**On screen:** Three-row flow diagram -- Dataflow (teal), Pipeline (blue), Notebook (orange), each flowing to destinations.
- **Dataflow** transforms then lands into **Delta tables** (its natural target)
- **Pipeline** copies raw data to **Files or Tables** -- no built-in transform step
- **Notebook** can transform AND write to **either** destination (most flexible)
- Note the "Transform" gear icon on Dataflow and Notebook rows; Pipeline has none
*Transition: "Let's compare those two landing zones."*
**On screen:** Two-card comparison -- Delta Tables vs Files.
- **Delta Tables**: ACID-compliant, queryable by SQL + Spark, best for **analytics**
- **Files**: no ACID, limited queryability, best for **staging** raw/unstructured content
- Rule of thumb: land raw in Files, promote to Delta Tables once cleaned
*Key message: Delta Tables are the analytics format; Files are the staging zone.*
**On screen:** Data Lake frame (teal) -- Any File Format, Schema on Read, Cost Effective, Weak Governance.
- Start with the **lake**: flexible, cheap, stores anything
- Three strengths (any format, schema on read, low cost) and one weakness: **weak governance**
- Traditional lakes lack the structure needed for reliable analytics
**On screen:** Data Warehouse frame (blue) appears alongside the Data Lake.
- Now the **warehouse**: structured, governed, fast -- but rigid and expensive
- Three strengths (structured tables, schema on write, fast queries) and one weakness: **rigid and costly**
- For decades, orgs chose one or the other -- or maintained both at great expense
*Transition: "The Lakehouse combines the best of both."*
**On screen:** Converging arrows merge into a gold "Data Lakehouse" frame with combined attributes.
- Point out the converging arrows -- this is the **architecture evolution**
- The Lakehouse frame shows: Any Format + ACID + Fast Queries + Low Cost
- All of the strengths, neither of the weaknesses
- This isn't marketing -- it's what Delta Lake + OneLake actually enable
*Transition: "How does this actually work in practice?"*
**On screen:** Four numbered process cards + info callout.
- **1. Delta Lake Storage**: tables ARE files -- Parquet in OneLake
- **2. Auto SQL Endpoint**: every Lakehouse auto-generates a T-SQL endpoint, zero config
- **3. Dual Engines**: Spark + T-SQL on the same data, pick your tool
- **4. Zero Data Copies**: one storage layer, multiple engines, no ETL between them
*Key message: "One storage format, multiple access patterns." That's the Lakehouse promise.*
**On screen:** Lakehouse comparison panel (teal) -- Write Engine, Read Engine, Storage, Schema, Languages.
- **Lakehouse**: Spark writes, Spark + auto SQL reads, Files + Tables, schema on read, code-first
- Languages: Python, PySpark, Scala
- The "Code-Driven" badge is the key signal -- this is for data engineering teams
**On screen:** Warehouse comparison panel (blue) appears alongside -- same rows, different values.
- **Warehouse**: T-SQL writes, T-SQL reads, Tables only, schema on write, SQL-first
- Languages: stored procs, views, DDL
- Same foundation (OneLake + Delta/Parquet), different interfaces
*Transition: "So how do you choose?"*
**On screen:** OneLake Storage bar + "Choose Lakehouse When" / "Choose Warehouse When" cards + summary callout.
- Point out the **OneLake Storage** fieldset: both items share the same underlying storage
- **Lakehouse**: code-first workflows, raw files + tables, ML/data science
- **Warehouse**: SQL-first teams, strict governance, traditional BI
- **Critical**: either works for Power BI! Both support Direct Lake.
- Every Lakehouse auto-generates a SQL Analytics Endpoint, so SQL queries work in both
*Key message: pick based on team comfort, not capability limitations.*
**On screen:** Three-node flow diagram -- OneLake --> Semantic Model --> Power BI Report, plus Direct Lake callout.
- **OneLake**: where data lives (Delta/Parquet in unified storage)
- **Semantic Model**: the analytical brain (VertiPaq, DAX measures, relationships)
- **Power BI Report**: what end users see and interact with
- **Direct Lake** reads Delta tables from OneLake directly into VertiPaq -- no import copies, no DirectQuery latency
- This is the **payoff** of the entire Delta/OneLake architecture
- Point to the link: *"We have a dedicated Direct Lake guide for the deep dive."*
*Closing: "That's the full picture -- engines, movers, storage, and the path to reporting."*