Core Apache Spark Concepts
- Resilient Distributed Dataset (RDD)
- DataFrames
- Datasets
- Transformations
- Actions
- Lazy Evaluation
- SparkSession
- SparkContext
- Partitions
- Shuffling
- Persistence & Caching
- Lineage Graphs
- Jobs
- Stages
- Tasks
Apache Spark
- Apache Spark: Big Data Processing & Analytics
- Spark DataFrames: Features, Use Cases & Optimization for Big Data
- Spark Architecture
- Dataframe create from file
- Dataframe Pyspark create from collections
- Spark Dataframe save as csv
- Dataframe save as parquet
- Dataframe show() between take() methods
- Apache SparkSession
- Understanding the RDD of Apache Spark
- Spark RDD creation from collection
- Different method to print data from rdd
- Practical use of unionByName method
- Creating Spark DataFrames: Methods & Examples
- Setup Spark in PyCharm
- Apache Spark all APIs
- Spark for the word count program
- Spark Accumulators
- aggregateByKey in Apache Spark
- Spark Broadcast with Examples
- Spark combineByKey
- Apache Spark Using countByKey
- Spark CrossJoin know all
- Optimizing Spark groupByKey: Usage, Best Practices, and Examples
- Mastering Spark Joins: Inner, Outer, Left, Right & Semi Joins Explained
- Apache Spark: Local Mode vs Cluster Mode - Key Differences & Examples
- Spark map vs flatMap: Key Differences with Examples
- Efficient Data Processing with Spark mapPartitionsWithIndex
- Spark reduceByKey with 5 Real-World Examples
- Spark Union vs UnionAll vs Union Available – Key Differences & Examples
Spark unionByName method explained
Returns a new DataFrame containing union of rows in this and another DataFrame. by 
 as per spark documentation 
https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.unionByName.html
## Combine datasets
 We have various datasets with different metrics about contracts. In each dataset there is a unique key identifying the contract.
**Premiums**
 | Contract Key | Premium     |
 | ------------ | ----------- |
 | A            | $100,000    |
 | B            | $10,000     |
**Losses**
 | Contract Key | Loss        |
 | ------------ | ----------- |
 | A            | $70,000     |
 | B            | $50,000     |
**Costs**
 | Contract Key | Cost   |
 | ------------ | ------ |
 | A            | $5,000 |
 | B            | $1,000 |
 We need to create one flat table based on all these tables.
 **Result**
 | Contract Key | Premium    | Loss    | Cost   |
 | ------------ | ---------- | ------- | ------ |
 | A            | $100,000   | $70,000 | $5,000 |
 | B            | $10,000    | $50,000 | $1,000 |
Code :
pyspark.sql.DataFrame.unionByName
from pyspark.sql import SparkSession
from pyspark.sql import Row
# Create a Spark session
spark = SparkSession.builder \
    .appName("UnionByNameExample") \
    .getOrCreate()
Premiums = spark.createDataFrame([['A', 100000],['B', 10000]], ["Key", "Premium"])
Losses = spark.createDataFrame([['A', 700000],['B', 500000]], ["Key", "Losses"])
Costs = spark.createDataFrame([['A', 4000],['B', 30000]], ["Key", "Costs"])
# Perform union by name
uniondf= Premiums.unionByName(Losses, allowMissingColumns=True).unionByName(Costs, allowMissingColumns=True)
# Show the result
uniondf.show()
# Stop the Spark session
spark.stop()
Output :
+---+-------+------+-----+
|Key|Premium|Losses|Costs|
+---+-------+------+-----+
|  A| 100000|  null| null|
|  B|  10000|  null| null|
|  A|   null|700000| null|
|  B|   null|500000| null|
|  A|   null|  null| 4000|
|  B|   null|  null|30000|
+---+-------+------+-----+