Kryo Serialization in Apache Spark
Kryo is a fast, compact Java serialization library that reduces shuffle overhead and memory footprint compared to Java’s default serialization. Switching to Kryo is one of the highest-value single-configuration performance improvements for RDD-heavy workloads.
Enabling Kryo
from pyspark.sql import SparkSession
spark = SparkSession.builder \ .appName("KryoApp") \ .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \ .config("spark.kryoserializer.buffer", "64k") \ .config("spark.kryoserializer.buffer.max", "512m") \ .getOrCreate()Class Registration (Scala / Java)
Without registration, Kryo writes the full class name alongside data — partially negating the size benefit. Registering classes tells Kryo to use a compact numeric ID:
import org.apache.spark.{SparkConf, SparkContext}import com.esotericsoftware.kryo.Kryo
// Option 1: registerKryoClasses in SparkConfval conf = new SparkConf() .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .registerKryoClasses(Array( classOf[TransactionRecord], classOf[CustomerProfile] ))
// Option 2: Custom KryoRegistrator for large class setsclass MyRegistrator extends KryoRegistrator { override def registerClasses(kryo: Kryo): Unit = { kryo.register(classOf[TransactionRecord]) kryo.register(classOf[CustomerProfile]) kryo.register(classOf[org.joda.time.LocalDate]) }}
val conf2 = new SparkConf() .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .set("spark.kryo.registrator", "com.example.MyRegistrator")Buffer Tuning
Kryo uses a fixed-size buffer per thread. If an object exceeds the buffer, a Buffer Overflow exception is thrown:
spark.conf.set("spark.kryoserializer.buffer", "64k") # Initial (default: 64k)spark.conf.set("spark.kryoserializer.buffer.max", "512m") # Maximum (default: 64m)
# Error: KryoException: Buffer overflow# Fix: increase buffer.max to accommodate your largest recordPerformance Comparison
Typical results for a 1M-row shuffle-heavy job:
| Serializer | Duration | Shuffle Write Size |
|---|---|---|
| Java | ~8.2 seconds | ~420 MB |
| Kryo | ~2.1 seconds | ~95 MB |
| Improvement | 4× faster | 4.4× smaller |
PySpark: Arrow for DataFrame Performance
In PySpark, Python objects use pickle — Kryo only affects the JVM layer. For Python-to-JVM exchange:
# Enable Apache Arrow for columnar data exchangespark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
# Arrow speeds up:# - df.toPandas()# - spark.createDataFrame(pandas_df)# - pandas UDFs (vectorized UDFs)
pdf = spark.read.parquet("data.parquet").toPandas() # 10× faster with ArrowDebugging Kryo Issues
# 1. Log warnings for unregistered classes (instead of errors)spark.conf.set("spark.kryo.registrationRequired", "false")# Check driver/executor logs for: "Class is not registered: com.example.MyClass"
# 2. Confirm Kryo is activeprint(spark.conf.get("spark.serializer"))# "org.apache.spark.serializer.KryoSerializer"
# 3. Buffer overflow# Error: KryoException: Buffer overflow. Available: X, Required: Yspark.conf.set("spark.kryoserializer.buffer.max", "1g")
# 4. Class not found during deserialization# Ensure the class JAR is on all executors:# spark-submit --jars my-classes.jar my_app.py