Technology  /  Apache Spark

Apache Spark 49 guides · updated 2026

Distributed data processing at scale — RDDs, DataFrames, Structured Streaming, and the tuning techniques that keep Spark jobs fast and cheap.

Spark Broadcast Variables

A broadcast variable distributes a read-only dataset to every executor node exactly once. Without broadcasting, Spark ships a Python object with every task — potentially thousands of times. Broadcasting sends it once per executor, saving enormous network overhead when that object is large.


The Problem: Per-Task Shipping

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Broadcast Demo").getOrCreate()
sc = spark.sparkContext
# 50 MB lookup table in driver memory
country_codes = {
"US": "United States", "GB": "United Kingdom",
"DE": "Germany", "JP": "Japan",
# ... 200 more entries
}
rdd = sc.parallelize(["US", "GB", "DE", "JP"] * 10000, numSlices=200)
# BAD: country_codes is pickled and shipped with every one of 200 tasks
result = rdd.map(lambda code: country_codes.get(code, "Unknown"))

Broadcasting to Each Executor Once

# GOOD: ship once per executor (typically 10-50 executors, not 1000+ tasks)
broadcast_codes = sc.broadcast(country_codes)
result = rdd.map(lambda code: broadcast_codes.value.get(code, "Unknown"))
# Access the value via .value inside tasks
broadcast_codes.value["US"] # "United States"

Broadcast Joins — Avoiding Shuffle

The most impactful use of broadcast variables: eliminating join shuffles.

from pyspark.sql import functions as F
# Large fact table: 500 million rows
df_transactions = spark.read.parquet("s3://bucket/transactions/")
# Small dimension table: 500 rows
df_products = spark.read.parquet("s3://bucket/products/")
# BAD: triggers sort-merge join (shuffle on both sides)
result = df_transactions.join(df_products, "product_id")
# GOOD: broadcasts df_products to every executor — no shuffle
from pyspark.sql.functions import broadcast
result = df_transactions.join(broadcast(df_products), "product_id")
# Auto-broadcast threshold (default 10 MB):
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "52428800") # 50 MB

RDD API with Broadcast

product_category_map = {
"p001": "Electronics", "p002": "Clothing", "p003": "Books",
}
bc_map = sc.broadcast(product_category_map)
orders = sc.parallelize([
{"order_id": 1, "product_id": "p001", "amount": 150},
{"order_id": 2, "product_id": "p003", "amount": 30},
])
enriched = orders.map(lambda order: {
**order,
"category": bc_map.value.get(order["product_id"], "Unknown")
})
enriched.collect()

Lifecycle Management

# Unpersist — removes from executor memory; can be re-fetched if needed
broadcast_codes.unpersist()
# Destroy — permanent removal from memory and BlockManager
broadcast_codes.destroy()

Broadcast Performance Guidelines

Data SizeStrategy
< 10 MBAuto-broadcast via threshold
10 MB – 200 MBManual broadcast() hint
> 200 MBSort-merge join — broadcasting too large
# Force broadcast plan even if Spark chose sort-merge
df_transactions.hint("broadcast").join(df_products, "product_id")
# Disable auto-broadcast for reproducibility testing
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")