**Apache Spark: Union vs UnionAll vs Union Available **

Apache Spark is a powerful distributed computing framework used for processing large-scale data. One of the most common operations in Spark is combining datasets, and Spark provides multiple methods to achieve this, including union(), unionAll(), and union available.

In this article, we will explore the differences between union, unionAll, and union available, understand how they work, and analyze their usage with five real-world examples. By the end, you will know when to use each method and how to implement them efficiently in your Spark applications.


1. Understanding Union in Spark

What is union()?

The union() operation in Spark is used to combine two DataFrames or RDDs with the same schema. It works similar to SQL’s UNION ALL, meaning it does not remove duplicates.

Key Features of union()

Combines two or more DataFrames/RDDs.
Does not perform duplicate removal (unlike SQL’s UNION).
✔ The schemas of both datasets must match.
✔ Operates in distributed mode, making it efficient for big data processing.

Syntax of union() in Spark

df1.union(df2)

This merges df1 and df2, keeping duplicate records.


2. What is unionAll()?

The unionAll() function was used in older versions of Spark (before Spark 2.0) to combine datasets without removing duplicates. However, in Spark 2.0 and later, unionAll() has been deprecated and replaced by union().

Key Features of unionAll()

Similar to union() but deprecated in Spark 2.0+.
Does not remove duplicates.
Was replaced by union().

Syntax of unionAll() in older Spark versions

df1.unionAll(df2)

If you are using Spark 2.0 or later, replace unionAll() with union().


3. What is Union Available?

The term “Union Available” is not an official Spark function but rather a concept that refers to checking whether a union operation is feasible. This means:

  • Checking if the DataFrames/RDDs have the same schema.
  • Ensuring the union() operation does not cause conflicts due to mismatched columns.
  • Handling null values and schema differences properly before performing union().

How to Check if Union is Available?

Before applying union(), check if two DataFrames have the same schema using:

df1.schema == df2.schema

If the schemas are different, use selectExpr() or withColumn() to align them.


4. Five Real-World Examples of Union in Spark

Example 1: Combining Two DataFrames Without Removing Duplicates

Let’s assume we have two DataFrames containing employee data.

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Initialize Spark Session
spark = SparkSession.builder.appName("UnionExample").getOrCreate()

# Define schema
schema = StructType([
    StructField("ID", IntegerType(), True),
    StructField("Name", StringType(), True),
    StructField("Department", StringType(), True)
])

# Create DataFrames
df1 = spark.createDataFrame([(1, "Alice", "HR"), (2, "Bob", "IT")], schema=schema)
df2 = spark.createDataFrame([(3, "Charlie", "Finance"), (2, "Bob", "IT")], schema=schema)

# Apply union
df_union = df1.union(df2)

df_union.show()

Output:

+---+-------+----------+
| ID|  Name |Department|
+---+-------+----------+
|  1| Alice | HR       |
|  2|  Bob  | IT       |
|  3|Charlie| Finance  |
|  2|  Bob  | IT       |  <-- Duplicate is not removed
+---+-------+----------+

Analysis:

  • union() combines both DataFrames but retains duplicates.
  • Useful in scenarios where we need raw combined data before applying further transformations.

Example 2: Removing Duplicates After Union

To remove duplicates after union(), use distinct().

df_union_distinct = df1.union(df2).distinct()
df_union_distinct.show()

Output:

+---+-------+----------+
| ID|  Name |Department|
+---+-------+----------+
|  1| Alice | HR       |
|  2|  Bob  | IT       |
|  3|Charlie| Finance  |
+---+-------+----------+

Analysis:

  • distinct() eliminates duplicate rows.
  • Useful when data integrity requires unique records.

Example 3: Union with Schema Mismatch (Handling Issues)

If two DataFrames have different column names, union() will fail. Let’s see an example.

df3 = spark.createDataFrame([(4, "David", "Marketing")], ["ID", "Name", "Dept"])  # "Dept" instead of "Department"

# This will throw an error
# df_union = df1.union(df3) 

# Fix by renaming column
df3_fixed = df3.withColumnRenamed("Dept", "Department")
df_union_fixed = df1.union(df3_fixed)
df_union_fixed.show()

Output:

+---+------+-----------+
| ID| Name |Department |
+---+------+-----------+
|  1|Alice | HR        |
|  2| Bob  | IT        |
|  4|David | Marketing |
+---+------+-----------+

Analysis:

  • Always align schema before using union().
  • Use withColumnRenamed() or selectExpr() to rename columns.

Example 4: Unioning DataFrames with Missing Columns

If a column exists in one DataFrame but not the other, Spark will throw an error.

df4 = spark.createDataFrame([(5, "Eve")], ["ID", "Name"])  # Missing "Department"

# Fix by adding missing column with null values
from pyspark.sql.functions import lit

df4_fixed = df4.withColumn("Department", lit(None))
df_union_fixed = df1.union(df4_fixed)
df_union_fixed.show()

Output:

+---+------+-----------+
| ID| Name |Department |
+---+------+-----------+
|  1|Alice | HR        |
|  2| Bob  | IT        |
|  5| Eve  | NULL      |
+---+------+-----------+

Analysis:

  • Missing columns should be added explicitly before performing union().
  • Use lit(None) to create null values for missing fields.

Example 5: Unioning RDDs Instead of DataFrames

rdd1 = spark.sparkContext.parallelize([("A", 1), ("B", 2)])
rdd2 = spark.sparkContext.parallelize([("C", 3), ("B", 2)])

# Perform union
rdd_union = rdd1.union(rdd2)

print(rdd_union.collect())

Output:

[('A', 1), ('B', 2), ('C', 3), ('B', 2)]

Analysis:

  • union() works similarly in RDDs, keeping duplicates.

5. When to Use union() and When Not to Use?

ScenarioUse union()?
Merging two DataFrames of the same schema✅ Yes
Removing duplicates❌ Use distinct() after union()
DataFrames have different column names❌ Rename columns first
One DataFrame has missing columns❌ Add missing columns using lit(None)

Understanding union() and its alternatives is essential for efficient data merging in Apache Spark. By handling schema mismatches, using distinct() when needed, and knowing when not to use union(), you can optimize performance and avoid common errors. 🚀