Technology  /  Apache Spark

Apache Spark 49 guides · updated 2026

Distributed data processing at scale — RDDs, DataFrames, Structured Streaming, and the tuning techniques that keep Spark jobs fast and cheap.

Spark Shuffling

A shuffle is the process of redistributing data across the cluster so that records with the same key end up in the same partition. It’s the most expensive operation in Spark — involving disk I/O, serialization, and network transfer. Understanding when shuffles happen and how to minimize them is essential for writing fast Spark jobs.


When Shuffles Occur

Any wide transformation triggers a shuffle:

OperationWhy it Shuffles
groupByKeyCollects all values for each key in one place
reduceByKeyCombines locally first, then sends per-key subtotals
joinBrings matching keys together
distinctDeduplicates across the cluster
repartition(n)Redistributes data to n new partitions
groupBy().agg(...)Aggregation after grouping
orderBy / sortSorts across partitions

Narrow transformations (map, filter, flatMap) never shuffle.


The Shuffle Process in Detail

Stage 1 (Map Phase):
Partition 0 → sort/spill records by key → write to shuffle files
Partition 1 → sort/spill records by key → write to shuffle files
Partition 2 → sort/spill records by key → write to shuffle files
--- Shuffle Boundary (disk + network) ---
Stage 2 (Reduce Phase):
New Partition 0 ← fetch all records for keys [0-100] from Stage 1 shuffle files
New Partition 1 ← fetch all records for keys [101-200] from Stage 1 shuffle files

The shuffle write is local disk I/O. The shuffle read crosses the network. Both are expensive at scale.


Measuring Shuffle Cost

from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.appName("Shuffle Demo").getOrCreate()
df = spark.read.parquet("transactions.parquet")
# This groupBy triggers a shuffle
result = df.groupBy("customer_id").agg(F.sum("amount").alias("total"))
# View in Spark UI: Jobs → Stage → "Shuffle Read/Write" columns
# Programmatically via the query plan:
result.explain(mode="formatted")
# Look for "Exchange hashpartitioning" — that's a shuffle operator

Reducing Shuffle Cost

1. Use reduceByKey Instead of groupByKey

# BAD: groupByKey collects ALL values per key in memory before aggregating
pairs.groupByKey().mapValues(sum)
# GOOD: reduceByKey combines within each partition first, then shuffles subtotals
pairs.reduceByKey(lambda a, b: a + b)
# Sends far less data across the network

2. Tune spark.sql.shuffle.partitions

# Default: 200 — too high for small data, too low for large joins
spark.conf.set("spark.sql.shuffle.partitions", "50") # Small dataset
spark.conf.set("spark.sql.shuffle.partitions", "400") # Large dataset
# Adaptive Query Execution (Spark 3.x) tunes this automatically
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

3. Broadcast Small Tables

from pyspark.sql.functions import broadcast
# BAD: both sides shuffle for a sort-merge join
large_df.join(small_df, "customer_id")
# GOOD: small_df is broadcast to each executor — no shuffle needed
large_df.join(broadcast(small_df), "customer_id")
# Spark auto-broadcasts tables smaller than this threshold:
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "52428800") # 50 MB

4. Pre-partition Data by Key

# If you join the same large table repeatedly, persist it partitioned by join key
df_customers = spark.read.parquet("customers.parquet") \
.repartition(200, F.col("customer_id")) \
.persist()
df_customers.join(df_orders, "customer_id")

5. Avoid Unnecessary distinct

# BAD: distinct on the full dataset
df.distinct().count()
# GOOD: deduplicate only on needed columns
df.dropDuplicates(["customer_id", "order_date"]).count()

Shuffle Configuration Reference

spark.conf.set("spark.sql.shuffle.partitions", "200")
spark.conf.set("spark.shuffle.compress", "true")
spark.conf.set("spark.io.compression.codec", "lz4")
spark.conf.set("spark.shuffle.service.enabled", "true") # Required for dynamic allocation