Technology  /  Apache Spark

Apache Spark 49 guides · updated 2026

Distributed data processing at scale — RDDs, DataFrames, Structured Streaming, and the tuning techniques that keep Spark jobs fast and cheap.

Spark DataFrame unionByName()

unionByName() combines two DataFrames by matching column names rather than column positions. This is safer than union() when the two DataFrames have the same columns but in different orders, or when they have slightly different schemas.


union() vs unionByName()

from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.appName("UnionByName").getOrCreate()
df1 = spark.createDataFrame([(1, "Alice", 95000)], ["id", "name", "salary"])
df2 = spark.createDataFrame([(2, 72000, "Bob")], ["id", "salary", "name"])
# union() — matches by POSITION, not name
df1.union(df2).show()
# +---+-----+-----+
# | id| name|salary|
# +---+-----+-----+
# | 1|Alice|95000|
# | 2|72000| Bob| ← WRONG! salary and name are swapped
# +---+-----+-----+
# unionByName() — matches by NAME
df1.unionByName(df2).show()
# +---+-----+------+
# | id| name|salary|
# +---+-----+------+
# | 1|Alice| 95000|
# | 2| Bob| 72000| ← CORRECT!
# +---+-----+------+

Handling Missing Columns with allowMissingColumns

# Spark 3.1+ supports allowMissingColumns=True
# Missing columns are filled with null
df_q1 = spark.createDataFrame(
[(1, "Alice", 95000)],
["id", "name", "q1_sales"]
)
df_q2 = spark.createDataFrame(
[(2, "Bob", 80000, "Enterprise")],
["id", "name", "q2_sales", "segment"]
)
# This would fail without allowMissingColumns:
# AnalysisException: Resolved attribute(s) q2_sales#..., segment#... missing from...
combined = df_q1.unionByName(df_q2, allowMissingColumns=True)
combined.show()
# +---+-----+--------+--------+-----------+
# | id| name|q1_sales|q2_sales| segment|
# +---+-----+--------+--------+-----------+
# | 1|Alice| 95000| null| null|
# | 2| Bob| null| 80000| Enterprise|
# +---+-----+--------+--------+-----------+

Practical Example: Stacking Monthly Reports

import os
months = ["2025-01", "2025-02", "2025-03", "2025-04"]
dfs = [
spark.read.parquet(f"s3://bucket/reports/month={m}/")
for m in months
]
# Combine all months into one DataFrame, matching by column name
from functools import reduce
combined = reduce(lambda a, b: a.unionByName(b, allowMissingColumns=True), dfs)
combined.show()

Schema Alignment Before Combining

If you can’t use allowMissingColumns, align schemas manually:

from pyspark.sql.types import StringType
# Get all unique columns across both DataFrames
all_cols = list(set(df1.columns) | set(df2.columns))
def align_schema(df, all_columns):
for col in all_columns:
if col not in df.columns:
df = df.withColumn(col, F.lit(None).cast(StringType()))
return df.select(sorted(all_columns))
df1_aligned = align_schema(df1, all_cols)
df2_aligned = align_schema(df2, all_cols)
combined = df1_aligned.union(df2_aligned)

Performance Notes

unionByName() does NOT deduplicate — it’s a bag union (UNION ALL in SQL):

# Remove duplicates after union
df1.unionByName(df2).distinct()
# Or: use SQL UNION (deduplicates) vs UNION ALL (keeps duplicates)
df1.createOrReplaceTempView("q1")
df2.createOrReplaceTempView("q2")
spark.sql("SELECT * FROM q1 UNION SELECT * FROM q2") # Deduplicates
spark.sql("SELECT * FROM q1 UNION ALL SELECT * FROM q2") # Keeps all rows