Understanding Spark Joins with Examples and Use Cases

Apache Spark provides powerful join operations to combine datasets efficiently. Joins in Spark work similarly to SQL joins, allowing us to merge two DataFrames or RDDs based on a common key. Spark supports inner, left, right, outer, semi, and anti joins, enabling a variety of use cases in big data processing, ETL pipelines, and analytics.


Types of Joins in Spark

Join TypeDescription
Inner JoinReturns matching rows from both datasets.
Left JoinReturns all rows from the left dataset and matching rows from the right.
Right JoinReturns all rows from the right dataset and matching rows from the left.
Full Outer JoinReturns all rows from both datasets, filling missing values with null.
Left Semi JoinReturns rows from the left dataset that have matches in the right dataset.
Left Anti JoinReturns rows from the left dataset that do not have matches in the right dataset.
Cross JoinReturns the Cartesian product of two datasets (i.e., all possible combinations).

Example 1: Inner Join

Scenario:

Suppose we have employees and departments datasets that share a common column dept_id. We want to get the department name for each employee.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("SparkJoins").getOrCreate()

employees = spark.createDataFrame([
    (1, "Alice", 101),
    (2, "Bob", 102),
    (3, "Charlie", 103),
    (4, "David", 101)
], ["emp_id", "name", "dept_id"])

departments = spark.createDataFrame([
    (101, "HR"),
    (102, "IT"),
    (103, "Finance")
], ["dept_id", "dept_name"])

inner_join_df = employees.join(departments, "dept_id", "inner")
inner_join_df.show()

Output:

+-------+-----+------+
|dept_id|name |dept_name |
+-------+-----+----------+
|101    |Alice |HR       |
|101    |David |HR       |
|102    |Bob   |IT       |
|103    |Charlie |Finance|
+-------+-----+----------+

When to Use?

  • To get only matching records between two datasets.

Example 2: Left Join

Scenario:

We want to keep all employees even if they don’t have a department assigned.

left_join_df = employees.join(departments, "dept_id", "left")
left_join_df.show()

Output:

+-------+-----+------+
|dept_id|name |dept_name |
+-------+-----+----------+
|101    |Alice |HR       |
|101    |David |HR       |
|102    |Bob   |IT       |
|103    |Charlie |Finance|
+-------+-----+----------+

When to Use?

  • When you want all records from the left dataset and related data from the right.

Example 3: Right Join

Scenario:

We want to keep all departments even if there are no employees assigned.

right_join_df = employees.join(departments, "dept_id", "right")
right_join_df.show()

Output:

+-------+-----+------+
|dept_id|name |dept_name |
+-------+-----+----------+
|101    |Alice |HR       |
|101    |David |HR       |
|102    |Bob   |IT       |
|103    |Charlie |Finance|
+-------+-----+----------+

When to Use?

  • When you want all records from the right dataset and related data from the left.

Example 4: Full Outer Join

Scenario:

We want to get all employees and all departments even if there are no matches.

full_outer_join_df = employees.join(departments, "dept_id", "outer")
full_outer_join_df.show()

Output:

+-------+-----+----------+
|dept_id|name |dept_name |
+-------+-----+----------+
|101    |Alice |HR       |
|101    |David |HR       |
|102    |Bob   |IT       |
|103    |Charlie |Finance|
+-------+-----+----------+

When to Use?

  • When you need all records from both datasets, filling missing values with null.

Example 5: Left Semi & Left Anti Join

Scenario:

  • Left Semi Join: Returns employees who have a department.
  • Left Anti Join: Returns employees who don’t have a department.
semi_join_df = employees.join(departments, "dept_id", "left_semi")
semi_join_df.show()

anti_join_df = employees.join(departments, "dept_id", "left_anti")
anti_join_df.show()

Output (Left Semi Join):

+-------+-----+
|dept_id|name |
+-------+-----+
|101    |Alice |
|101    |David |
|102    |Bob   |
|103    |Charlie |
+-------+-----+

Output (Left Anti Join):

+-------+-----+
|dept_id|name |
+-------+-----+
|105    |Eve  |
+-------+-----+

When to Use?

  • Semi Join: When you only need left dataset records that have a match.
  • Anti Join: When you need records in the left dataset that have no match.

Performance Considerations

  1. Avoid Skewed Joins: If one dataset is significantly larger, consider broadcast joins to optimize.
  2. Use Partitioning: When joining large datasets, partitioning helps improve parallel processing.
  3. Prefer reduceByKey or aggregateByKey if only aggregation is needed instead of joins.