Cloud  /  Azure

Microsoft Azure 26 guides · updated 2026

Practical guides to Azure compute, networking, storage, and data services — built for engineers running production workloads on Microsoft's cloud.

Azure Synapse Analytics: Unified Analytics Platform Bridging SQL, Spark, and Data Integration

Azure Synapse Analytics brings four capabilities under one workspace: a massively parallel processing (MPP) SQL engine for data warehousing (Dedicated SQL Pool), a serverless SQL engine for querying files directly (Serverless SQL Pool), Apache Spark for data engineering and machine learning, and Azure Data Factory-style pipelines for orchestration. Previously you needed separate services with separate credentials and separate monitoring; Synapse consolidates them behind one interface, one set of permissions, and one billing envelope.

This matters for large organisations that have both SQL analysts and Python data engineers working on the same data. Instead of maintaining two separate tools, both groups work in the same Synapse Studio environment, against the same ADLS Gen2 data lake, with shared datasets and lineage tracked in Microsoft Purview.


Real-World Scenario

A retail chain runs 800 stores worldwide and wants to analyse daily sales data. The raw POS data arrives in ADLS Gen2 as JSON files every hour. A Synapse Spark notebook cleans and transforms the data to Delta format. A Synapse Pipeline schedules the notebook nightly and then loads the aggregated results into a Dedicated SQL Pool. Analysts use SQL in Synapse Studio to run ad-hoc reports, and Power BI queries the SQL Pool via DirectQuery for live dashboards. The entire flow lives in one Synapse workspace, with role-based access controlling who can run Spark jobs versus who can only query the SQL pool.


Workspace Architecture

Synapse Workspace
------------------
ADLS Gen2 (primary storage account)
/raw/ <- landing zone for ingest
/cleaned/ <- Spark-processed Delta tables
/curated/ <- SQL Pool external tables
Dedicated SQL Pool (MPP warehouse)
Distribution: HASH / ROUND_ROBIN / REPLICATE
DWU: 100 - 30,000 DWU (scale up/down in minutes)
Serverless SQL Pool (query-in-place)
No data loaded; queries files directly
Pay per TB scanned
Apache Spark Pool(s)
Auto-pause when idle
Multiple pools for different team sizes
Synapse Pipelines
Copy Activity, Notebook Activity, Data Flow
Same feature set as Azure Data Factory
Synapse Link
Zero-ETL connection to Cosmos DB / Dataverse
Analytical store queried without impacting OLTP

Dedicated SQL Pool: Distribution Strategies

The MPP engine distributes table data across 60 distributions (compute nodes). Choosing the right distribution determines whether queries run fast or wait on data movement:

Distribution Type | When to Use
------------------|------------------------------------------
HASH(column) | Large tables joined on the same column
| Eliminates data movement for that join
| Avoid high-cardinality datetime columns
ROUND_ROBIN | Staging tables, tables with no clear join key
| Even distribution but data movement on joins
REPLICATE | Small dimension tables (< 2 GB)
| Full copy on each compute node
| Zero data movement on joins
Example decision:
FactSales: HASH(CustomerId) <- large fact table
DimProduct: REPLICATE <- small dimension
FactSales JOIN DimProduct:
No shuffle needed (replicated dim available on each node)
DimProduct joins happen locally

Serverless SQL Pool: Query Files In Place

Serverless SQL lets analysts query CSV, JSON, and Parquet files in ADLS Gen2 without loading them into a warehouse:

-- Query Parquet files from a data lake path
SELECT
Region,
SUM(SalesAmount) AS TotalSales,
COUNT_BIG(*) AS TransactionCount
FROM
OPENROWSET(
BULK 'https://myaccount.dfs.core.windows.net/cleaned/sales/2024/**',
FORMAT = 'PARQUET'
) AS sales
GROUP BY Region
ORDER BY TotalSales DESC;

Billing is $5 per TB of data scanned. For ad-hoc exploration this is often far cheaper than loading data into a Dedicated SQL Pool. For repeated queries on the same data, the dedicated pool with proper indexing and distribution is faster and more cost-effective.


Synapse Spark: Data Engineering and ML

Synapse Spark pools are managed Apache Spark clusters with Azure integration (ADLS Gen2 access, Synapse Link, managed tables in the Synapse catalog). Pools auto-pause after idle timeout:

# Synapse Spark notebook: clean POS data and write Delta
from pyspark.sql import functions as F
from pyspark.sql.types import DoubleType
# Read raw JSON from lake
df = spark.read.json(
"abfss://raw@myaccount.dfs.core.windows.net/pos/2024/06/15/"
)
# Clean and transform
clean = (
df
.filter(F.col("transaction_id").isNotNull())
.withColumn("amount", F.col("amount").cast(DoubleType()))
.withColumn("date", F.to_date("timestamp"))
)
# Write Delta format to cleaned zone
clean.write.format("delta").mode("overwrite").save(
"abfss://cleaned@myaccount.dfs.core.windows.net/sales/date=2024-06-15/"
)
print(f"Rows written: {clean.count()}")

Synapse Link creates a continuous, automatically updated analytical store from Cosmos DB or Microsoft Dataverse. The analytical store is column-oriented (separate from the OLTP row store) and does not consume Cosmos DB RU/s:

Synapse Link: Cosmos DB -> Synapse
-----------------------------------
Cosmos DB container (OLTP row store)
Writes: 10,000 RU/s (normal OLTP traffic)
|
Cosmos DB Analytical Store (auto-synced, column-store)
No RU/s consumed from OLTP provisioning
|
Synapse Serverless SQL or Spark Pool
SELECT * FROM openrowset(... analytical store endpoint ...)
-> Analytical queries with no impact on production Cosmos DB

Synapse vs. Databricks

Capability | Azure Synapse Analytics | Azure Databricks
---------------------|--------------------------|---------------------------
SQL warehouse (MPP) | Yes (Dedicated SQL Pool) | Yes (Databricks SQL)
Serverless SQL | Yes (Serverless Pool) | Limited
Apache Spark | Yes (managed pools) | Yes (best-in-class)
Data pipelines | Built-in (ADF-compatible)| Delta Live Tables, Workflows
Unity Catalog | Synapse catalog | Unity Catalog (superior)
ML / MLflow | Azure ML integration | Built-in MLflow
Delta Lake | Yes | Native (originated there)
Streaming | Spark Structured Streaming| Spark + Delta Live Tables
Best for | SQL-heavy teams + lake | Spark-heavy / ML teams

Many organisations use both: Databricks for heavy Spark and ML workloads, Synapse for SQL analytics and Power BI integration.


Key Interview Points


Best Practices