Azure Synapse Analytics: Unified Analytics Platform Bridging SQL, Spark, and Data Integration
Azure Synapse Analytics brings four capabilities under one workspace: a massively parallel processing (MPP) SQL engine for data warehousing (Dedicated SQL Pool), a serverless SQL engine for querying files directly (Serverless SQL Pool), Apache Spark for data engineering and machine learning, and Azure Data Factory-style pipelines for orchestration. Previously you needed separate services with separate credentials and separate monitoring; Synapse consolidates them behind one interface, one set of permissions, and one billing envelope.
This matters for large organisations that have both SQL analysts and Python data engineers working on the same data. Instead of maintaining two separate tools, both groups work in the same Synapse Studio environment, against the same ADLS Gen2 data lake, with shared datasets and lineage tracked in Microsoft Purview.
Real-World Scenario
A retail chain runs 800 stores worldwide and wants to analyse daily sales data. The raw POS data arrives in ADLS Gen2 as JSON files every hour. A Synapse Spark notebook cleans and transforms the data to Delta format. A Synapse Pipeline schedules the notebook nightly and then loads the aggregated results into a Dedicated SQL Pool. Analysts use SQL in Synapse Studio to run ad-hoc reports, and Power BI queries the SQL Pool via DirectQuery for live dashboards. The entire flow lives in one Synapse workspace, with role-based access controlling who can run Spark jobs versus who can only query the SQL pool.
Workspace Architecture
Synapse Workspace------------------ADLS Gen2 (primary storage account) /raw/ <- landing zone for ingest /cleaned/ <- Spark-processed Delta tables /curated/ <- SQL Pool external tables
Dedicated SQL Pool (MPP warehouse) Distribution: HASH / ROUND_ROBIN / REPLICATE DWU: 100 - 30,000 DWU (scale up/down in minutes)
Serverless SQL Pool (query-in-place) No data loaded; queries files directly Pay per TB scanned
Apache Spark Pool(s) Auto-pause when idle Multiple pools for different team sizes
Synapse Pipelines Copy Activity, Notebook Activity, Data Flow Same feature set as Azure Data Factory
Synapse Link Zero-ETL connection to Cosmos DB / Dataverse Analytical store queried without impacting OLTPDedicated SQL Pool: Distribution Strategies
The MPP engine distributes table data across 60 distributions (compute nodes). Choosing the right distribution determines whether queries run fast or wait on data movement:
Distribution Type | When to Use------------------|------------------------------------------HASH(column) | Large tables joined on the same column | Eliminates data movement for that join | Avoid high-cardinality datetime columns
ROUND_ROBIN | Staging tables, tables with no clear join key | Even distribution but data movement on joins
REPLICATE | Small dimension tables (< 2 GB) | Full copy on each compute node | Zero data movement on joins
Example decision: FactSales: HASH(CustomerId) <- large fact table DimProduct: REPLICATE <- small dimension FactSales JOIN DimProduct: No shuffle needed (replicated dim available on each node) DimProduct joins happen locallyServerless SQL Pool: Query Files In Place
Serverless SQL lets analysts query CSV, JSON, and Parquet files in ADLS Gen2 without loading them into a warehouse:
-- Query Parquet files from a data lake pathSELECT Region, SUM(SalesAmount) AS TotalSales, COUNT_BIG(*) AS TransactionCountFROM OPENROWSET( BULK 'https://myaccount.dfs.core.windows.net/cleaned/sales/2024/**', FORMAT = 'PARQUET' ) AS salesGROUP BY RegionORDER BY TotalSales DESC;Billing is $5 per TB of data scanned. For ad-hoc exploration this is often far cheaper than loading data into a Dedicated SQL Pool. For repeated queries on the same data, the dedicated pool with proper indexing and distribution is faster and more cost-effective.
Synapse Spark: Data Engineering and ML
Synapse Spark pools are managed Apache Spark clusters with Azure integration (ADLS Gen2 access, Synapse Link, managed tables in the Synapse catalog). Pools auto-pause after idle timeout:
# Synapse Spark notebook: clean POS data and write Deltafrom pyspark.sql import functions as Ffrom pyspark.sql.types import DoubleType
# Read raw JSON from lakedf = spark.read.json( "abfss://raw@myaccount.dfs.core.windows.net/pos/2024/06/15/")
# Clean and transformclean = ( df .filter(F.col("transaction_id").isNotNull()) .withColumn("amount", F.col("amount").cast(DoubleType())) .withColumn("date", F.to_date("timestamp")))
# Write Delta format to cleaned zoneclean.write.format("delta").mode("overwrite").save( "abfss://cleaned@myaccount.dfs.core.windows.net/sales/date=2024-06-15/")
print(f"Rows written: {clean.count()}")Synapse Link: Zero-ETL Analytics
Synapse Link creates a continuous, automatically updated analytical store from Cosmos DB or Microsoft Dataverse. The analytical store is column-oriented (separate from the OLTP row store) and does not consume Cosmos DB RU/s:
Synapse Link: Cosmos DB -> Synapse-----------------------------------Cosmos DB container (OLTP row store) Writes: 10,000 RU/s (normal OLTP traffic) | Cosmos DB Analytical Store (auto-synced, column-store) No RU/s consumed from OLTP provisioning | Synapse Serverless SQL or Spark Pool SELECT * FROM openrowset(... analytical store endpoint ...) -> Analytical queries with no impact on production Cosmos DBSynapse vs. Databricks
Capability | Azure Synapse Analytics | Azure Databricks---------------------|--------------------------|---------------------------SQL warehouse (MPP) | Yes (Dedicated SQL Pool) | Yes (Databricks SQL)Serverless SQL | Yes (Serverless Pool) | LimitedApache Spark | Yes (managed pools) | Yes (best-in-class)Data pipelines | Built-in (ADF-compatible)| Delta Live Tables, WorkflowsUnity Catalog | Synapse catalog | Unity Catalog (superior)ML / MLflow | Azure ML integration | Built-in MLflowDelta Lake | Yes | Native (originated there)Streaming | Spark Structured Streaming| Spark + Delta Live TablesBest for | SQL-heavy teams + lake | Spark-heavy / ML teamsMany organisations use both: Databricks for heavy Spark and ML workloads, Synapse for SQL analytics and Power BI integration.
Key Interview Points
- DWU scaling: Dedicated SQL Pool DWUs can be scaled up before a heavy load and scaled down afterward to save cost. Scaling pauses queries briefly but does not lose data.
- Serverless vs. Dedicated cost model: Serverless pays per-scan (cheap for exploration, expensive for repeated queries over large data). Dedicated pays per DWU-hour (higher fixed cost, cheaper per query for frequent access to the same data).
- Synapse vs. Azure Data Factory: Synapse Pipelines and ADF share the same underlying engine. If you already have ADF, Synapse Pipelines adds no new pipeline capability but provides a unified workspace experience. Organisations already invested in ADF typically keep it running alongside Synapse.
- Pause and resume: Dedicated SQL Pool can be paused to stop compute billing (storage billing continues). Useful for non-production warehouses that are only used during business hours.
- PolyBase and COPY INTO: Loading data into a Dedicated SQL Pool uses either the COPY INTO statement (simpler, preferred) or PolyBase external tables. Direct load from ADLS Gen2 Parquet files is orders of magnitude faster than row-by-row INSERT statements.
Best Practices
- Use HASH distribution on large fact tables and join them on the hash column to avoid data movement operations.
- Partition large tables in the Dedicated SQL Pool by date to enable partition elimination in time-range queries.
- Implement column-store indexes (default for Synapse) rather than row-store indexes for the MPP engine.
- Use Workload Management (workload groups, classifiers) to prioritise executive dashboards over batch ETL in the same Dedicated SQL Pool.
- Monitor with Synapse Studio’s built-in query activity logs and connect to Azure Monitor for long-term retention and alerting on slow queries or pool saturation.