Apache Spark
- Apache Spark: Big Data Processing & Analytics
- Spark DataFrames: Features, Use Cases & Optimization for Big Data
- Spark Architecture
- Dataframe create from file
- Dataframe Pyspark create from collections
- Spark Dataframe save as csv
- Dataframe save as parquet
- Dataframe show() between take() methods
- Apache SparkSession
- Understanding the RDD of Apache Spark
- Spark RDD creation from collection
- Different method to print data from rdd
- Practical use of unionByName method
- Creating Spark DataFrames: Methods & Examples
- Setup Spark in PyCharm
- Apache Spark all APIs
- Spark for the word count program
- Spark Accumulators
- aggregateByKey in Apache Spark
- Spark Broadcast with Examples
- Spark combineByKey
- Apache Spark Using countByKey
- Spark CrossJoin know all
- Optimizing Spark groupByKey: Usage, Best Practices, and Examples
- Mastering Spark Joins: Inner, Outer, Left, Right & Semi Joins Explained
- Apache Spark: Local Mode vs Cluster Mode - Key Differences & Examples
- Spark map vs flatMap: Key Differences with Examples
- Efficient Data Processing with Spark mapPartitionsWithIndex
- Spark reduceByKey with 5 Real-World Examples
- Spark Union vs UnionAll vs Union Available – Key Differences & Examples
Spark DataFrame Creation – Examples and Use Cases
Apache Spark’s DataFrame
is a distributed collection of data organized into named columns, similar to a table in a relational database or a Pandas DataFrame. It is optimized for big data processing and supports SQL queries, data transformations, and analytics.
Where to Use Spark DataFrames?
- ETL Pipelines: Efficient data extraction, transformation, and loading.
- Data Analytics: Performing aggregations and statistical analysis.
- Machine Learning: Preparing large datasets for ML models.
- Real-time Processing: Handling structured streaming data.
- Data Lake Processing: Querying and processing data stored in HDFS, S3, etc.
Example 1: Creating a DataFrame from a List of Tuples
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("DataFrameExample").getOrCreate()
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
df.show()
Output:
+-------+---+
| Name|Age|
+-------+---+
| Alice| 25|
| Bob| 30|
|Charlie| 35|
+-------+---+
✅ Use case: When you need to create small DataFrames quickly from in-memory data.
Example 2: Creating a DataFrame from a Pandas DataFrame
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PandasToSpark").getOrCreate()
pdf = pd.DataFrame({"Name": ["David", "Emma"], "Age": [28, 22]})
df = spark.createDataFrame(pdf)
df.show()
✅ Use case: When transitioning from Pandas to Spark for scalability.
Example 3: Creating a DataFrame from a JSON File
df = spark.read.json("data.json")
df.show()
✅ Use case: Reading structured data stored in JSON format, commonly used in web applications and APIs.
Example 4: Creating a DataFrame from a CSV File
df = spark.read.csv("data.csv", header=True, inferSchema=True)
df.show()
✅ Use case: Processing large CSV datasets efficiently for analytics.
Example 5: Creating a DataFrame from an RDD
rdd = spark.sparkContext.parallelize([("John", 40), ("Doe", 35)])
df = rdd.toDF(["Name", "Age"])
df.show()
✅ Use case: When working with RDDs but requiring DataFrame functionalities like SQL operations.
Conclusion
Spark DataFrames offer scalability, efficiency, and ease of use for large-scale data processing. Whether loading from structured files, databases, or in-memory lists, DataFrames provide a flexible foundation for data engineering and analytics.