Apache Spark
- Apache Spark: Big Data Processing & Analytics
- Spark DataFrames: Features, Use Cases & Optimization for Big Data
- Spark Architecture
- Dataframe create from file
- Dataframe Pyspark create from collections
- Spark Dataframe save as csv
- Dataframe save as parquet
- Dataframe show() between take() methods
- Apache SparkSession
- Understanding the RDD of Apache Spark
- Spark RDD creation from collection
- Different method to print data from rdd
- Practical use of unionByName method
- Creating Spark DataFrames: Methods & Examples
- Setup Spark in PyCharm
- Apache Spark all APIs
- Spark for the word count program
- Spark Accumulators
- aggregateByKey in Apache Spark
- Spark Broadcast with Examples
- Spark combineByKey
- Apache Spark Using countByKey
- Spark CrossJoin know all
- Optimizing Spark groupByKey: Usage, Best Practices, and Examples
- Mastering Spark Joins: Inner, Outer, Left, Right & Semi Joins Explained
- Apache Spark: Local Mode vs Cluster Mode - Key Differences & Examples
- Spark map vs flatMap: Key Differences with Examples
- Efficient Data Processing with Spark mapPartitionsWithIndex
- Spark reduceByKey with 5 Real-World Examples
- Spark Union vs UnionAll vs Union Available – Key Differences & Examples
Understanding Spark Joins with Examples and Use Cases
Apache Spark provides powerful join operations to combine datasets efficiently. Joins in Spark work similarly to SQL joins, allowing us to merge two DataFrames or RDDs based on a common key. Spark supports inner, left, right, outer, semi, and anti joins, enabling a variety of use cases in big data processing, ETL pipelines, and analytics.
Types of Joins in Spark
Join Type | Description |
---|---|
Inner Join | Returns matching rows from both datasets. |
Left Join | Returns all rows from the left dataset and matching rows from the right. |
Right Join | Returns all rows from the right dataset and matching rows from the left. |
Full Outer Join | Returns all rows from both datasets, filling missing values with null . |
Left Semi Join | Returns rows from the left dataset that have matches in the right dataset. |
Left Anti Join | Returns rows from the left dataset that do not have matches in the right dataset. |
Cross Join | Returns the Cartesian product of two datasets (i.e., all possible combinations). |
Example 1: Inner Join
Scenario:
Suppose we have employees and departments datasets that share a common column dept_id
. We want to get the department name for each employee.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.appName("SparkJoins").getOrCreate()
employees = spark.createDataFrame([
(1, "Alice", 101),
(2, "Bob", 102),
(3, "Charlie", 103),
(4, "David", 101)
], ["emp_id", "name", "dept_id"])
departments = spark.createDataFrame([
(101, "HR"),
(102, "IT"),
(103, "Finance")
], ["dept_id", "dept_name"])
inner_join_df = employees.join(departments, "dept_id", "inner")
inner_join_df.show()
Output:
+-------+-----+------+
|dept_id|name |dept_name |
+-------+-----+----------+
|101 |Alice |HR |
|101 |David |HR |
|102 |Bob |IT |
|103 |Charlie |Finance|
+-------+-----+----------+
When to Use?
- To get only matching records between two datasets.
Example 2: Left Join
Scenario:
We want to keep all employees even if they don’t have a department assigned.
left_join_df = employees.join(departments, "dept_id", "left")
left_join_df.show()
Output:
+-------+-----+------+
|dept_id|name |dept_name |
+-------+-----+----------+
|101 |Alice |HR |
|101 |David |HR |
|102 |Bob |IT |
|103 |Charlie |Finance|
+-------+-----+----------+
When to Use?
- When you want all records from the left dataset and related data from the right.
Example 3: Right Join
Scenario:
We want to keep all departments even if there are no employees assigned.
right_join_df = employees.join(departments, "dept_id", "right")
right_join_df.show()
Output:
+-------+-----+------+
|dept_id|name |dept_name |
+-------+-----+----------+
|101 |Alice |HR |
|101 |David |HR |
|102 |Bob |IT |
|103 |Charlie |Finance|
+-------+-----+----------+
When to Use?
- When you want all records from the right dataset and related data from the left.
Example 4: Full Outer Join
Scenario:
We want to get all employees and all departments even if there are no matches.
full_outer_join_df = employees.join(departments, "dept_id", "outer")
full_outer_join_df.show()
Output:
+-------+-----+----------+
|dept_id|name |dept_name |
+-------+-----+----------+
|101 |Alice |HR |
|101 |David |HR |
|102 |Bob |IT |
|103 |Charlie |Finance|
+-------+-----+----------+
When to Use?
- When you need all records from both datasets, filling missing values with
null
.
Example 5: Left Semi & Left Anti Join
Scenario:
- Left Semi Join: Returns employees who have a department.
- Left Anti Join: Returns employees who don’t have a department.
semi_join_df = employees.join(departments, "dept_id", "left_semi")
semi_join_df.show()
anti_join_df = employees.join(departments, "dept_id", "left_anti")
anti_join_df.show()
Output (Left Semi Join):
+-------+-----+
|dept_id|name |
+-------+-----+
|101 |Alice |
|101 |David |
|102 |Bob |
|103 |Charlie |
+-------+-----+
Output (Left Anti Join):
+-------+-----+
|dept_id|name |
+-------+-----+
|105 |Eve |
+-------+-----+
When to Use?
- Semi Join: When you only need left dataset records that have a match.
- Anti Join: When you need records in the left dataset that have no match.
Performance Considerations
- Avoid Skewed Joins: If one dataset is significantly larger, consider broadcast joins to optimize.
- Use Partitioning: When joining large datasets, partitioning helps improve parallel processing.
- Prefer
reduceByKey
oraggregateByKey
if only aggregation is needed instead of joins.