Spark DataFrame unionByName()
unionByName() combines two DataFrames by matching column names rather than column positions. This is safer than union() when the two DataFrames have the same columns but in different orders, or when they have slightly different schemas.
union() vs unionByName()
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.appName("UnionByName").getOrCreate()
df1 = spark.createDataFrame([(1, "Alice", 95000)], ["id", "name", "salary"])df2 = spark.createDataFrame([(2, 72000, "Bob")], ["id", "salary", "name"])
# union() — matches by POSITION, not namedf1.union(df2).show()# +---+-----+-----+# | id| name|salary|# +---+-----+-----+# | 1|Alice|95000|# | 2|72000| Bob| ← WRONG! salary and name are swapped# +---+-----+-----+
# unionByName() — matches by NAMEdf1.unionByName(df2).show()# +---+-----+------+# | id| name|salary|# +---+-----+------+# | 1|Alice| 95000|# | 2| Bob| 72000| ← CORRECT!# +---+-----+------+Handling Missing Columns with allowMissingColumns
# Spark 3.1+ supports allowMissingColumns=True# Missing columns are filled with null
df_q1 = spark.createDataFrame( [(1, "Alice", 95000)], ["id", "name", "q1_sales"])df_q2 = spark.createDataFrame( [(2, "Bob", 80000, "Enterprise")], ["id", "name", "q2_sales", "segment"])
# This would fail without allowMissingColumns:# AnalysisException: Resolved attribute(s) q2_sales#..., segment#... missing from...
combined = df_q1.unionByName(df_q2, allowMissingColumns=True)combined.show()# +---+-----+--------+--------+-----------+# | id| name|q1_sales|q2_sales| segment|# +---+-----+--------+--------+-----------+# | 1|Alice| 95000| null| null|# | 2| Bob| null| 80000| Enterprise|# +---+-----+--------+--------+-----------+Practical Example: Stacking Monthly Reports
import os
months = ["2025-01", "2025-02", "2025-03", "2025-04"]dfs = [ spark.read.parquet(f"s3://bucket/reports/month={m}/") for m in months]
# Combine all months into one DataFrame, matching by column namefrom functools import reducecombined = reduce(lambda a, b: a.unionByName(b, allowMissingColumns=True), dfs)combined.show()Schema Alignment Before Combining
If you can’t use allowMissingColumns, align schemas manually:
from pyspark.sql.types import StringType
# Get all unique columns across both DataFramesall_cols = list(set(df1.columns) | set(df2.columns))
def align_schema(df, all_columns): for col in all_columns: if col not in df.columns: df = df.withColumn(col, F.lit(None).cast(StringType())) return df.select(sorted(all_columns))
df1_aligned = align_schema(df1, all_cols)df2_aligned = align_schema(df2, all_cols)
combined = df1_aligned.union(df2_aligned)Performance Notes
unionByName() does NOT deduplicate — it’s a bag union (UNION ALL in SQL):
# Remove duplicates after uniondf1.unionByName(df2).distinct()
# Or: use SQL UNION (deduplicates) vs UNION ALL (keeps duplicates)df1.createOrReplaceTempView("q1")df2.createOrReplaceTempView("q2")
spark.sql("SELECT * FROM q1 UNION SELECT * FROM q2") # Deduplicatesspark.sql("SELECT * FROM q1 UNION ALL SELECT * FROM q2") # Keeps all rows