Apache Spark
- Apache Spark: Big Data Processing & Analytics
- Spark DataFrames: Features, Use Cases & Optimization for Big Data
- Spark Architecture
- Dataframe create from file
- Dataframe Pyspark create from collections
- Spark Dataframe save as csv
- Dataframe save as parquet
- Dataframe show() between take() methods
- Apache SparkSession
- Understanding the RDD of Apache Spark
- Spark RDD creation from collection
- Different method to print data from rdd
- Practical use of unionByName method
- Creating Spark DataFrames: Methods & Examples
- Setup Spark in PyCharm
- Apache Spark all APIs
- Spark for the word count program
- Spark Accumulators
- aggregateByKey in Apache Spark
- Spark Broadcast with Examples
- Spark combineByKey
- Apache Spark Using countByKey
- Spark CrossJoin know all
- Optimizing Spark groupByKey: Usage, Best Practices, and Examples
- Mastering Spark Joins: Inner, Outer, Left, Right & Semi Joins Explained
- Apache Spark: Local Mode vs Cluster Mode - Key Differences & Examples
- Spark map vs flatMap: Key Differences with Examples
- Efficient Data Processing with Spark mapPartitionsWithIndex
- Spark reduceByKey with 5 Real-World Examples
- Spark Union vs UnionAll vs Union Available – Key Differences & Examples
Understanding Spark groupByKey
with Examples and Use Cases
Apache Spark’s groupByKey
is a transformation applied to key-value RDDs (Resilient Distributed Datasets), which groups values for each key into an iterable collection. Unlike reduceByKey
, groupByKey
does not aggregate data before shuffling, making it less efficient in large-scale distributed computations.
How groupByKey
Works
- It groups all values for a key together and returns
(K, Iterable<V>)
. - It involves full data shuffling, making it expensive in terms of memory and performance.
- It is useful when aggregation is not required but grouping is needed.
Example 1: Basic groupByKey
Usage
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("groupByKeyExample").getOrCreate()
sc = spark.sparkContext
rdd = sc.parallelize([("A", 1), ("B", 2), ("A", 3), ("B", 4), ("C", 5)])
grouped_rdd = rdd.groupByKey()
for key, values in grouped_rdd.collect():
print(f"{key}: {list(values)}")
Output:
A: [1, 3]
B: [2, 4]
C: [5]
Explanation:
- The data is grouped based on the key.
- The values are combined into an iterable.
Example 2: Using groupByKey
to Group Student Scores
data = [("Alice", 90), ("Bob", 85), ("Alice", 95), ("Bob", 88), ("Charlie", 78)]
rdd = sc.parallelize(data)
grouped_scores = rdd.groupByKey()
for student, scores in grouped_scores.collect():
print(student, list(scores))
Output:
Alice: [90, 95]
Bob: [85, 88]
Charlie: [78]
Use Case:
- When raw scores need to be grouped without aggregation.
Example 3: Word Grouping using groupByKey
words = [("spark", 1), ("hadoop", 1), ("spark", 1), ("bigdata", 1)]
rdd = sc.parallelize(words)
grouped_words = rdd.groupByKey()
for word, counts in grouped_words.collect():
print(word, list(counts))
Output:
spark: [1, 1]
hadoop: [1]
bigdata: [1]
Use Case:
- Useful in text processing for grouping word occurrences.
Example 4: Grouping Employees by Department
employees = [("HR", "Alice"), ("IT", "Bob"), ("HR", "Charlie"), ("IT", "David"), ("Finance", "Eve")]
rdd = sc.parallelize(employees)
grouped_dept = rdd.groupByKey()
for dept, employees in grouped_dept.collect():
print(dept, list(employees))
Output:
HR: ["Alice", "Charlie"]
IT: ["Bob", "David"]
Finance: ["Eve"]
Use Case:
- Organizing records where aggregation is not required.
Example 5: Alternative using reduceByKey
for Optimization
Instead of groupByKey
, reduceByKey
should be used when aggregation is needed:
rdd = sc.parallelize([("A", 1), ("B", 2), ("A", 3), ("B", 4), ("C", 5)])
# Using reduceByKey instead of groupByKey
reduced_rdd = rdd.reduceByKey(lambda x, y: x + y)
print(reduced_rdd.collect())
Output:
[("A", 4), ("B", 6), ("C", 5)]
Performance Note:
reduceByKey
reduces data before shuffling, making it more efficient.
When to Use groupByKey
?
- When aggregation is NOT needed, but grouping of all values per key is required.
- When working with non-numeric or non-aggregable data types.
- When you need access to all values under a key.
When NOT to Use groupByKey
?
- When aggregation is required → Use
reduceByKey
oraggregateByKey
instead. - When working with large datasets, as it causes high data shuffling.