Azure Synapse Analytics: Unified Analytics Platform Bridging SQL, Spark, and Data Integration

Azure Synapse Analytics brings four capabilities under one workspace: a massively parallel processing (MPP) SQL engine for data warehousing (Dedicated SQL Pool), a serverless SQL engine for querying files directly (Serverless SQL Pool), Apache Spark for data engineering and machine learning, and Azure Data Factory-style pipelines for orchestration. Previously you needed separate services with separate credentials and separate monitoring; Synapse consolidates them behind one interface, one set of permissions, and one billing envelope.

This matters for large organisations that have both SQL analysts and Python data engineers working on the same data. Instead of maintaining two separate tools, both groups work in the same Synapse Studio environment, against the same ADLS Gen2 data lake, with shared datasets and lineage tracked in Microsoft Purview.

Real-World Scenario

A retail chain runs 800 stores worldwide and wants to analyse daily sales data. The raw POS data arrives in ADLS Gen2 as JSON files every hour. A Synapse Spark notebook cleans and transforms the data to Delta format. A Synapse Pipeline schedules the notebook nightly and then loads the aggregated results into a Dedicated SQL Pool. Analysts use SQL in Synapse Studio to run ad-hoc reports, and Power BI queries the SQL Pool via DirectQuery for live dashboards. The entire flow lives in one Synapse workspace, with role-based access controlling who can run Spark jobs versus who can only query the SQL pool.

Workspace Architecture

Synapse Workspace
------------------
ADLS Gen2 (primary storage account)
  /raw/       <- landing zone for ingest
  /cleaned/   <- Spark-processed Delta tables
  /curated/   <- SQL Pool external tables

  Dedicated SQL Pool (MPP warehouse)
    Distribution: HASH / ROUND_ROBIN / REPLICATE
    DWU: 100 - 30,000 DWU (scale up/down in minutes)

  Serverless SQL Pool (query-in-place)
    No data loaded; queries files directly
    Pay per TB scanned

  Apache Spark Pool(s)
    Auto-pause when idle
    Multiple pools for different team sizes

  Synapse Pipelines
    Copy Activity, Notebook Activity, Data Flow
    Same feature set as Azure Data Factory

  Synapse Link
    Zero-ETL connection to Cosmos DB / Dataverse
    Analytical store queried without impacting OLTP

Dedicated SQL Pool: Distribution Strategies

The MPP engine distributes table data across 60 distributions (compute nodes). Choosing the right distribution determines whether queries run fast or wait on data movement:

Distribution Type | When to Use
------------------|------------------------------------------
HASH(column)      | Large tables joined on the same column
                  | Eliminates data movement for that join
                  | Avoid high-cardinality datetime columns

ROUND_ROBIN       | Staging tables, tables with no clear join key
                  | Even distribution but data movement on joins

REPLICATE         | Small dimension tables (< 2 GB)
                  | Full copy on each compute node
                  | Zero data movement on joins

Example decision:
  FactSales: HASH(CustomerId)        <- large fact table
  DimProduct: REPLICATE              <- small dimension
  FactSales JOIN DimProduct:
    No shuffle needed (replicated dim available on each node)
    DimProduct joins happen locally

Serverless SQL Pool: Query Files In Place

Serverless SQL lets analysts query CSV, JSON, and Parquet files in ADLS Gen2 without loading them into a warehouse:

-- Query Parquet files from a data lake path
SELECT
    Region,
    SUM(SalesAmount) AS TotalSales,
    COUNT_BIG(*) AS TransactionCount
FROM
    OPENROWSET(
        BULK 'https://myaccount.dfs.core.windows.net/cleaned/sales/2024/**',
        FORMAT = 'PARQUET'
    ) AS sales
GROUP BY Region
ORDER BY TotalSales DESC;

Billing is $5 per TB of data scanned. For ad-hoc exploration this is often far cheaper than loading data into a Dedicated SQL Pool. For repeated queries on the same data, the dedicated pool with proper indexing and distribution is faster and more cost-effective.

Synapse Spark: Data Engineering and ML

Synapse Spark pools are managed Apache Spark clusters with Azure integration (ADLS Gen2 access, Synapse Link, managed tables in the Synapse catalog). Pools auto-pause after idle timeout:

# Synapse Spark notebook: clean POS data and write Delta
from pyspark.sql import functions as F
from pyspark.sql.types import DoubleType

# Read raw JSON from lake
df = spark.read.json(
    "abfss://raw@myaccount.dfs.core.windows.net/pos/2024/06/15/"
)

# Clean and transform
clean = (
    df
    .filter(F.col("transaction_id").isNotNull())
    .withColumn("amount", F.col("amount").cast(DoubleType()))
    .withColumn("date", F.to_date("timestamp"))
)

# Write Delta format to cleaned zone
clean.write.format("delta").mode("overwrite").save(
    "abfss://cleaned@myaccount.dfs.core.windows.net/sales/date=2024-06-15/"
)

print(f"Rows written: {clean.count()}")

Synapse Link: Zero-ETL Analytics

Synapse Link creates a continuous, automatically updated analytical store from Cosmos DB or Microsoft Dataverse. The analytical store is column-oriented (separate from the OLTP row store) and does not consume Cosmos DB RU/s:

Synapse Link: Cosmos DB -> Synapse
-----------------------------------
Cosmos DB container (OLTP row store)
  Writes: 10,000 RU/s (normal OLTP traffic)
      |
  Cosmos DB Analytical Store (auto-synced, column-store)
  No RU/s consumed from OLTP provisioning
      |
  Synapse Serverless SQL or Spark Pool
  SELECT * FROM openrowset(... analytical store endpoint ...)
  -> Analytical queries with no impact on production Cosmos DB

Synapse vs. Databricks

Capability           | Azure Synapse Analytics  | Azure Databricks
---------------------|--------------------------|---------------------------
SQL warehouse (MPP)  | Yes (Dedicated SQL Pool) | Yes (Databricks SQL)
Serverless SQL       | Yes (Serverless Pool)    | Limited
Apache Spark         | Yes (managed pools)      | Yes (best-in-class)
Data pipelines       | Built-in (ADF-compatible)| Delta Live Tables, Workflows
Unity Catalog        | Synapse catalog          | Unity Catalog (superior)
ML / MLflow          | Azure ML integration     | Built-in MLflow
Delta Lake           | Yes                      | Native (originated there)
Streaming            | Spark Structured Streaming| Spark + Delta Live Tables
Best for             | SQL-heavy teams + lake   | Spark-heavy / ML teams

Many organisations use both: Databricks for heavy Spark and ML workloads, Synapse for SQL analytics and Power BI integration.

Key Interview Points

DWU scaling: Dedicated SQL Pool DWUs can be scaled up before a heavy load and scaled down afterward to save cost. Scaling pauses queries briefly but does not lose data.
Serverless vs. Dedicated cost model: Serverless pays per-scan (cheap for exploration, expensive for repeated queries over large data). Dedicated pays per DWU-hour (higher fixed cost, cheaper per query for frequent access to the same data).
Synapse vs. Azure Data Factory: Synapse Pipelines and ADF share the same underlying engine. If you already have ADF, Synapse Pipelines adds no new pipeline capability but provides a unified workspace experience. Organisations already invested in ADF typically keep it running alongside Synapse.
Pause and resume: Dedicated SQL Pool can be paused to stop compute billing (storage billing continues). Useful for non-production warehouses that are only used during business hours.
PolyBase and COPY INTO: Loading data into a Dedicated SQL Pool uses either the COPY INTO statement (simpler, preferred) or PolyBase external tables. Direct load from ADLS Gen2 Parquet files is orders of magnitude faster than row-by-row INSERT statements.

Best Practices

Use HASH distribution on large fact tables and join them on the hash column to avoid data movement operations.
Partition large tables in the Dedicated SQL Pool by date to enable partition elimination in time-range queries.
Implement column-store indexes (default for Synapse) rather than row-store indexes for the MPP engine.
Use Workload Management (workload groups, classifiers) to prioritise executive dashboards over batch ETL in the same Dedicated SQL Pool.
Monitor with Synapse Studio’s built-in query activity logs and connect to Azure Monitor for long-term retention and alerting on slow queries or pool saturation.