Spark DataFrame Creation – Examples and Use Cases

Apache Spark’s DataFrame is a distributed collection of data organized into named columns, similar to a table in a relational database or a Pandas DataFrame. It is optimized for big data processing and supports SQL queries, data transformations, and analytics.


Where to Use Spark DataFrames?

  • ETL Pipelines: Efficient data extraction, transformation, and loading.
  • Data Analytics: Performing aggregations and statistical analysis.
  • Machine Learning: Preparing large datasets for ML models.
  • Real-time Processing: Handling structured streaming data.
  • Data Lake Processing: Querying and processing data stored in HDFS, S3, etc.

Example 1: Creating a DataFrame from a List of Tuples

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DataFrameExample").getOrCreate()

data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)
df.show()

Output:

+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
+-------+---+

Use case: When you need to create small DataFrames quickly from in-memory data.


Example 2: Creating a DataFrame from a Pandas DataFrame

import pandas as pd
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("PandasToSpark").getOrCreate()

pdf = pd.DataFrame({"Name": ["David", "Emma"], "Age": [28, 22]})
df = spark.createDataFrame(pdf)
df.show()

Use case: When transitioning from Pandas to Spark for scalability.


Example 3: Creating a DataFrame from a JSON File

df = spark.read.json("data.json")
df.show()

Use case: Reading structured data stored in JSON format, commonly used in web applications and APIs.


Example 4: Creating a DataFrame from a CSV File

df = spark.read.csv("data.csv", header=True, inferSchema=True)
df.show()

Use case: Processing large CSV datasets efficiently for analytics.


Example 5: Creating a DataFrame from an RDD

rdd = spark.sparkContext.parallelize([("John", 40), ("Doe", 35)])
df = rdd.toDF(["Name", "Age"])
df.show()

Use case: When working with RDDs but requiring DataFrame functionalities like SQL operations.


Conclusion

Spark DataFrames offer scalability, efficiency, and ease of use for large-scale data processing. Whether loading from structured files, databases, or in-memory lists, DataFrames provide a flexible foundation for data engineering and analytics.