Spark unionByName method explained

Returns a new DataFrame containing union of rows in this and another DataFrame. by
as per spark documentation
https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.unionByName.html

## Combine datasets

We have various datasets with different metrics about contracts. In each dataset there is a unique key identifying the contract.

**Premiums**

| Contract Key | Premium |
| ------------ | ----------- |
| A | $100,000 |
| B | $10,000 |

**Losses**

| Contract Key | Loss |
| ------------ | ----------- |
| A | $70,000 |
| B | $50,000 |

**Costs**

| Contract Key | Cost |
| ------------ | ------ |
| A | $5,000 |
| B | $1,000 |

We need to create one flat table based on all these tables.

**Result**
| Contract Key | Premium | Loss | Cost |
| ------------ | ---------- | ------- | ------ |
| A | $100,000 | $70,000 | $5,000 |
| B | $10,000 | $50,000 | $1,000 |

Code :

pyspark.sql.DataFrame.unionByName

from pyspark.sql import SparkSession
from pyspark.sql import Row

# Create a Spark session
spark = SparkSession.builder \
.appName("UnionByNameExample") \
.getOrCreate()

Premiums = spark.createDataFrame([['A', 100000],['B', 10000]], ["Key", "Premium"])
Losses = spark.createDataFrame([['A', 700000],['B', 500000]], ["Key", "Losses"])
Costs = spark.createDataFrame([['A', 4000],['B', 30000]], ["Key", "Costs"])

# Perform union by name
uniondf= Premiums.unionByName(Losses, allowMissingColumns=True).unionByName(Costs, allowMissingColumns=True)

# Show the result
uniondf.show()

# Stop the Spark session
spark.stop()

Output :

+---+-------+------+-----+
|Key|Premium|Losses|Costs|
+---+-------+------+-----+
| A| 100000| null| null|
| B| 10000| null| null|
| A| null|700000| null|
| B| null|500000| null|
| A| null| null| 4000|
| B| null| null|30000|
+---+-------+------+-----+

Core Apache Spark Concepts

Apache Spark

Spark unionByName method explained