Apache Spark
- Apache Spark: Big Data Processing & Analytics
- Spark DataFrames: Features, Use Cases & Optimization for Big Data
- Spark Architecture
- Dataframe create from file
- Dataframe Pyspark create from collections
- Spark Dataframe save as csv
- Dataframe save as parquet
- Dataframe show() between take() methods
- Apache SparkSession
- Understanding the RDD of Apache Spark
- Spark RDD creation from collection
- Different method to print data from rdd
- Practical use of unionByName method
- Creating Spark DataFrames: Methods & Examples
- Setup Spark in PyCharm
- Apache Spark all APIs
- Spark for the word count program
- Spark Accumulators
- aggregateByKey in Apache Spark
- Spark Broadcast with Examples
- Spark combineByKey
- Apache Spark Using countByKey
- Spark CrossJoin know all
- Optimizing Spark groupByKey: Usage, Best Practices, and Examples
- Mastering Spark Joins: Inner, Outer, Left, Right & Semi Joins Explained
- Apache Spark: Local Mode vs Cluster Mode - Key Differences & Examples
- Spark map vs flatMap: Key Differences with Examples
- Efficient Data Processing with Spark mapPartitionsWithIndex
- Spark reduceByKey with 5 Real-World Examples
- Spark Union vs UnionAll vs Union Available – Key Differences & Examples
Spark unionByName method explained
Returns a new DataFrame containing union of rows in this and another DataFrame. by
as per spark documentation
https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.unionByName.html
## Combine datasets
We have various datasets with different metrics about contracts. In each dataset there is a unique key identifying the contract.
**Premiums**
| Contract Key | Premium |
| ------------ | ----------- |
| A | $100,000 |
| B | $10,000 |
**Losses**
| Contract Key | Loss |
| ------------ | ----------- |
| A | $70,000 |
| B | $50,000 |
**Costs**
| Contract Key | Cost |
| ------------ | ------ |
| A | $5,000 |
| B | $1,000 |
We need to create one flat table based on all these tables.
**Result**
| Contract Key | Premium | Loss | Cost |
| ------------ | ---------- | ------- | ------ |
| A | $100,000 | $70,000 | $5,000 |
| B | $10,000 | $50,000 | $1,000 |
Code :
pyspark.sql.DataFrame.unionByName
from pyspark.sql import SparkSession
from pyspark.sql import Row
# Create a Spark session
spark = SparkSession.builder \
.appName("UnionByNameExample") \
.getOrCreate()
Premiums = spark.createDataFrame([['A', 100000],['B', 10000]], ["Key", "Premium"])
Losses = spark.createDataFrame([['A', 700000],['B', 500000]], ["Key", "Losses"])
Costs = spark.createDataFrame([['A', 4000],['B', 30000]], ["Key", "Costs"])
# Perform union by name
uniondf= Premiums.unionByName(Losses, allowMissingColumns=True).unionByName(Costs, allowMissingColumns=True)
# Show the result
uniondf.show()
# Stop the Spark session
spark.stop()
Output :
+---+-------+------+-----+
|Key|Premium|Losses|Costs|
+---+-------+------+-----+
| A| 100000| null| null|
| B| 10000| null| null|
| A| null|700000| null|
| B| null|500000| null|
| A| null| null| 4000|
| B| null| null|30000|
+---+-------+------+-----+