Technology  /  Apache Spark

Apache Spark 49 guides · updated 2026

Distributed data processing at scale — RDDs, DataFrames, Structured Streaming, and the tuning techniques that keep Spark jobs fast and cheap.

Spark reduceByKey()

reduceByKey() aggregates values for each key in a key-value RDD by applying a binary commutative and associative function. It’s one of Spark’s most important aggregation transformations — more efficient than groupByKey() because it performs partial aggregation within each partition before shuffling data across the network.


Syntax and Basic Example

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("reduceByKey").getOrCreate()
sc = spark.sparkContext
# Signature: rdd.reduceByKey(func, numPartitions=None)
# func(a, b) → must be commutative and associative
pairs = sc.parallelize([
("apple", 3), ("banana", 5), ("apple", 2),
("cherry", 1), ("banana", 3), ("apple", 1)
])
totals = pairs.reduceByKey(lambda a, b: a + b)
totals.collect()
# [("apple", 6), ("banana", 8), ("cherry", 1)]

Why reduceByKey is Better Than groupByKey

# groupByKey — sends ALL values across the network, then aggregates
# Memory-intensive: collects all values for each key into one partition
pairs.groupByKey().mapValues(sum).collect()
# Network: ["apple", [3, 2, 1]] — transfers entire list
# reduceByKey — aggregates WITHIN each partition first, then shuffles subtotals
# Sends only one value per key per partition — much less network traffic
pairs.reduceByKey(lambda a, b: a + b).collect()
# Network: ["apple", 5] → ["apple", 1] → merge to ["apple", 6]
# For large datasets, reduceByKey can be 10-100× faster than groupByKey

Real-World Examples

Sales Revenue Aggregation

sales = sc.parallelize([
("Electronics", 1200), ("Clothing", 450), ("Electronics", 800),
("Books", 120), ("Clothing", 300), ("Electronics", 600),
])
revenue_by_category = sales.reduceByKey(lambda a, b: a + b)
revenue_by_category.sortBy(lambda x: x[1], ascending=False).collect()
# [("Electronics", 2600), ("Clothing", 750), ("Books", 120)]

Counting Events

events = sc.parallelize([
("page_view", 1), ("click", 1), ("page_view", 1),
("purchase", 1), ("click", 1), ("page_view", 1),
])
event_counts = events.reduceByKey(lambda a, b: a + b)
event_counts.collect()
# [("page_view", 3), ("click", 2), ("purchase", 1)]

Finding Max Value per Key

scores = sc.parallelize([
("Alice", 85), ("Bob", 72), ("Alice", 91), ("Bob", 88), ("Alice", 79)
])
max_score = scores.reduceByKey(lambda a, b: max(a, b))
max_score.collect()
# [("Alice", 91), ("Bob", 88)]

String Concatenation per Key

tags = sc.parallelize([
("post1", "python"), ("post2", "spark"), ("post1", "spark"),
("post2", "bigdata"), ("post1", "databricks"),
])
tags_per_post = tags.reduceByKey(lambda a, b: f"{a},{b}")
tags_per_post.collect()
# [("post1", "python,spark,databricks"), ("post2", "spark,bigdata")]

Controlling Output Partitions

# Default: uses spark.default.parallelism partitions
pairs.reduceByKey(lambda a, b: a + b)
# Specify output partitions explicitly
pairs.reduceByKey(lambda a, b: a + b, numPartitions=10)

Limitations

reduceByKey requires a commutative and associative function — the result must be the same regardless of combination order:

# OK: sum is commutative and associative
pairs.reduceByKey(lambda a, b: a + b)
# OK: max, min
pairs.reduceByKey(lambda a, b: max(a, b))
# PROBLEMATIC: subtraction is NOT commutative
pairs.reduceByKey(lambda a, b: a - b) # Result depends on processing order
# For non-commutative operations, use aggregateByKey instead