Spark Serialization
Serialization converts Java/Python objects into a byte stream for network transmission (during shuffles) or disk storage (caching). The serialization format directly affects shuffle speed, memory usage, and caching efficiency. A poorly chosen serializer can make jobs 2-5× slower than necessary.
When Spark Serializes Data
- Shuffles — records move between executors. Serialized, written to disk, transferred over the network, deserialized on the receiving end.
- Persistence — when using
MEMORY_AND_DISKand data spills, orMEMORY_ONLY_SER(store as serialized bytes). - Broadcast variables — serialized on the driver, shipped to all executors.
Java Serialization (Default)
Java serialization is the default for JVM types. Every class that implements java.io.Serializable works out of the box. However, it’s slow and verbose — it embeds full class metadata in every serialized object.
Kryo Serialization
Kryo is 5-10× faster than Java serialization and produces 2-5× smaller output:
from pyspark.sql import SparkSession
spark = SparkSession.builder \ .appName("Kryo Demo") \ .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \ .config("spark.kryo.registrationRequired", "false") \ .getOrCreate()// Scala — register domain classes for maximum benefitval conf = new SparkConf() .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .registerKryoClasses(Array( classOf[MyEvent], classOf[TransactionRecord] ))val spark = SparkSession.builder().config(conf).getOrCreate()Kryo vs Java Comparison
| Aspect | Java | Kryo |
|---|---|---|
| Speed | Baseline | 5-10× faster |
| Output size | Large | 2-5× smaller |
| Setup | None | Register classes for max performance |
| Compatibility | Universal | Most types; some edge cases |
| When to use | Simple jobs | Production, large shuffles/caches |
Python Serialization (PySpark)
PySpark uses pickle for Python objects (UDFs, lambdas, RDD operations):
# BAD: large object in closure — pickled per taskhuge_dict = {str(i): i for i in range(1_000_000)}rdd.map(lambda x: huge_dict.get(x, 0)) # Pickled for each task
# GOOD: broadcast large objectsbc_huge_dict = sc.broadcast(huge_dict)rdd.map(lambda x: bc_huge_dict.value.get(x, 0)) # Sent once per executorConfiguration Reference
# Enable Kryospark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")spark.conf.set("spark.kryoserializer.buffer", "64k")spark.conf.set("spark.kryoserializer.buffer.max", "512m")
# Compress shuffle dataspark.conf.set("spark.shuffle.compress", "true")spark.conf.set("spark.rdd.compress", "true")spark.conf.set("spark.io.compression.codec", "lz4")Compression Codecs
| Codec | Speed | Ratio | Use Case |
|---|---|---|---|
lz4 | Fastest | Moderate | Default for most jobs |
snappy | Fast | Moderate | Hadoop-compatible systems |
zstd | Moderate | Best | Storage-bound workloads |
gzip | Slowest | Best | Cold storage archival |
For most production Spark jobs: Kryo serializer + LZ4 compression is the highest-performance combination.