Technology  /  Apache Spark

Apache Spark 49 guides · updated 2026

Distributed data processing at scale — RDDs, DataFrames, Structured Streaming, and the tuning techniques that keep Spark jobs fast and cheap.

Spark Serialization

Serialization converts Java/Python objects into a byte stream for network transmission (during shuffles) or disk storage (caching). The serialization format directly affects shuffle speed, memory usage, and caching efficiency. A poorly chosen serializer can make jobs 2-5× slower than necessary.


When Spark Serializes Data

  1. Shuffles — records move between executors. Serialized, written to disk, transferred over the network, deserialized on the receiving end.
  2. Persistence — when using MEMORY_AND_DISK and data spills, or MEMORY_ONLY_SER (store as serialized bytes).
  3. Broadcast variables — serialized on the driver, shipped to all executors.

Java Serialization (Default)

Java serialization is the default for JVM types. Every class that implements java.io.Serializable works out of the box. However, it’s slow and verbose — it embeds full class metadata in every serialized object.


Kryo Serialization

Kryo is 5-10× faster than Java serialization and produces 2-5× smaller output:

from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Kryo Demo") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.config("spark.kryo.registrationRequired", "false") \
.getOrCreate()
// Scala — register domain classes for maximum benefit
val conf = new SparkConf()
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.registerKryoClasses(Array(
classOf[MyEvent],
classOf[TransactionRecord]
))
val spark = SparkSession.builder().config(conf).getOrCreate()

Kryo vs Java Comparison

AspectJavaKryo
SpeedBaseline5-10× faster
Output sizeLarge2-5× smaller
SetupNoneRegister classes for max performance
CompatibilityUniversalMost types; some edge cases
When to useSimple jobsProduction, large shuffles/caches

Python Serialization (PySpark)

PySpark uses pickle for Python objects (UDFs, lambdas, RDD operations):

# BAD: large object in closure — pickled per task
huge_dict = {str(i): i for i in range(1_000_000)}
rdd.map(lambda x: huge_dict.get(x, 0)) # Pickled for each task
# GOOD: broadcast large objects
bc_huge_dict = sc.broadcast(huge_dict)
rdd.map(lambda x: bc_huge_dict.value.get(x, 0)) # Sent once per executor

Configuration Reference

# Enable Kryo
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
spark.conf.set("spark.kryoserializer.buffer", "64k")
spark.conf.set("spark.kryoserializer.buffer.max", "512m")
# Compress shuffle data
spark.conf.set("spark.shuffle.compress", "true")
spark.conf.set("spark.rdd.compress", "true")
spark.conf.set("spark.io.compression.codec", "lz4")

Compression Codecs

CodecSpeedRatioUse Case
lz4FastestModerateDefault for most jobs
snappyFastModerateHadoop-compatible systems
zstdModerateBestStorage-bound workloads
gzipSlowestBestCold storage archival

For most production Spark jobs: Kryo serializer + LZ4 compression is the highest-performance combination.