Creating a PySpark DataFrame from Collections
Creating DataFrames from in-memory Python collections is the foundation of PySpark unit testing and quick prototyping. It lets you build and verify your transformation logic before connecting to production data sources.
From a List of Tuples
from pyspark.sql import SparkSessionfrom pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, BooleanType
spark = SparkSession.builder.appName("CollectionDemo").getOrCreate()
# Tuples with column names inferred from positiondata = [ ("Alice", "Engineering", 95000), ("Bob", "Marketing", 72000), ("Carol", "Engineering", 110000),]df = spark.createDataFrame(data, ["name", "department", "salary"])df.show()# +-----+------------+------+# | name| department|salary|# +-----+------------+------+# |Alice| Engineering| 95000|# | Bob| Marketing| 72000|# |Carol| Engineering|110000|# +-----+------------+------+With an Explicit Schema
Always provide an explicit schema in production code — it’s faster and avoids type inference surprises:
schema = StructType([ StructField("name", StringType(), nullable=False), StructField("department", StringType(), nullable=True), StructField("salary", IntegerType(), nullable=True),])
df = spark.createDataFrame(data, schema)df.printSchema()# root# |-- name: string (nullable = false)# |-- department: string (nullable = true)# |-- salary: integer (nullable = true)From a List of Dictionaries
records = [ {"product": "Laptop", "price": 1299.99, "in_stock": True, "quantity": 50}, {"product": "Mouse", "price": 29.99, "in_stock": True, "quantity": 200}, {"product": "Monitor", "price": 399.99, "in_stock": False, "quantity": 0},]
df = spark.createDataFrame(records)df.printSchema()# root# |-- in_stock: boolean (nullable = true)# |-- price: double (nullable = true)# |-- product: string (nullable = true)# |-- quantity: long (nullable = true)From Row Objects
Row allows named field access and is useful when working with heterogeneous data:
from pyspark.sql import Row
employees = [ Row(name="Alice", department="Engineering", salary=95000, active=True), Row(name="Bob", department="Marketing", salary=72000, active=False),]
df = spark.createDataFrame(employees)df.show()df.filter(df.active == True).select("name", "salary").show()Handling Nulls in Collections
data_with_nulls = [ ("Alice", "Engineering", 95000), ("Bob", None, 72000), # None → null in Spark ("Carol", "Engineering", None), # None → null for salary]
schema = StructType([ StructField("name", StringType(), nullable=False), StructField("department", StringType(), nullable=True), StructField("salary", IntegerType(), nullable=True),])
df = spark.createDataFrame(data_with_nulls, schema)df.show()df.filter(df.department.isNull()).show()df.filter(df.salary.isNotNull()).show()Nested Structures
from pyspark.sql.types import ArrayType, MapType
nested_schema = StructType([ StructField("name", StringType()), StructField("scores", ArrayType(IntegerType())), StructField("tags", MapType(StringType(), StringType())),])
nested_data = [ ("Alice", [90, 85, 92], {"level": "senior", "team": "backend"}), ("Bob", [75, 80], {"level": "junior", "team": "frontend"}),]
df = spark.createDataFrame(nested_data, nested_schema)df.show(truncate=False)df.select("name", df.scores[0].alias("first_score")).show()Quick Reference
| Source | Method | Best For |
|---|---|---|
| List of tuples | createDataFrame(data, col_names) | Simple prototyping |
| Explicit schema | createDataFrame(data, schema) | Production, type safety |
| List of dicts | createDataFrame(records) | Irregular structures |
| Row objects | createDataFrame(rows) | Named field access |
| Pandas DataFrame | createDataFrame(pdf) | Data science workflows |