Technology  /  Apache Spark

Apache Spark 49 guides · updated 2026

Distributed data processing at scale — RDDs, DataFrames, Structured Streaming, and the tuning techniques that keep Spark jobs fast and cheap.

Saving a PySpark DataFrame as CSV

DataFrame.write.csv() writes data to CSV files — one file per partition by default. Use write options to control headers, delimiters, compression, and the number of output files.


Basic CSV Write

from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.appName("WriteCSV").getOrCreate()
data = [("Alice", "Engineering", 95000), ("Bob", "Marketing", 72000)]
df = spark.createDataFrame(data, ["name", "department", "salary"])
# Write CSV with header — creates a directory with multiple part files
df.write \
.mode("overwrite") \
.option("header", "true") \
.csv("output/employees/")

Output directory structure:

output/employees/
├── _SUCCESS
├── part-00000-xxxx.csv
└── part-00001-xxxx.csv

Write Modes

# overwrite — replace existing data
df.write.mode("overwrite").csv("output/")
# append — add to existing data
df.write.mode("append").csv("output/")
# ignore — no-op if directory exists
df.write.mode("ignore").csv("output/")
# error / errorIfExists (default) — raises error if directory exists
df.write.mode("error").csv("output/")

CSV Write Options

df.write \
.mode("overwrite") \
.option("header", "true") \ # Write column names in first row
.option("sep", "\t") \ # Tab-delimited (TSV)
.option("quote", '"') \ # Quote character
.option("escape", "\\") \ # Escape character
.option("nullValue", "NULL") \ # String for null values
.option("dateFormat", "yyyy-MM-dd") \ # Date formatting
.option("timestampFormat","yyyy-MM-dd HH:mm:ss") \
.option("encoding", "UTF-8") \
.csv("output/report.csv")

Compression

df.write \
.mode("overwrite") \
.option("header", "true") \
.option("compression", "gzip") \ # gzip, bzip2, lz4, snappy, deflate
.csv("output/compressed/")
# Output: part-00000-xxxx.csv.gz

Writing a Single File

By default, Spark writes one file per partition. For a single output file:

# coalesce(1) — merges partitions to driver, then writes single file
df.coalesce(1) \
.write \
.mode("overwrite") \
.option("header", "true") \
.csv("output/single-file/")
# ⚠️ For large datasets, avoid coalesce(1) — it's a driver bottleneck
# Use it only when downstream systems require a single file

Writing to S3

# S3 write (AWS credentials must be configured)
df.write \
.mode("overwrite") \
.option("header", "true") \
.option("compression", "snappy") \
.csv("s3://my-bucket/data/employees/")

Partitioned CSV Output

# Partition by year and month for efficient downstream reads
df.withColumn("year", F.year(F.col("hire_date"))) \
.withColumn("month", F.month(F.col("hire_date"))) \
.write \
.mode("overwrite") \
.option("header", "true") \
.partitionBy("year", "month") \
.csv("output/partitioned/")
# Output:
# output/partitioned/year=2024/month=1/part-00000.csv
# output/partitioned/year=2025/month=3/part-00000.csv

Reading Back the Output

df_back = spark.read \
.option("header", "true") \
.option("inferSchema", "true") \
.csv("output/employees/")
df_back.show()