Kryo Serialization in Apache Spark

Kryo is a fast, compact Java serialization library that reduces shuffle overhead and memory footprint compared to Java’s default serialization. Switching to Kryo is one of the highest-value single-configuration performance improvements for RDD-heavy workloads.

Enabling Kryo

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("KryoApp") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.kryoserializer.buffer",     "64k") \
    .config("spark.kryoserializer.buffer.max", "512m") \
    .getOrCreate()

Class Registration (Scala / Java)

Without registration, Kryo writes the full class name alongside data — partially negating the size benefit. Registering classes tells Kryo to use a compact numeric ID:

import org.apache.spark.{SparkConf, SparkContext}
import com.esotericsoftware.kryo.Kryo

// Option 1: registerKryoClasses in SparkConf
val conf = new SparkConf()
  .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  .registerKryoClasses(Array(
    classOf[TransactionRecord],
    classOf[CustomerProfile]
  ))

// Option 2: Custom KryoRegistrator for large class sets
class MyRegistrator extends KryoRegistrator {
  override def registerClasses(kryo: Kryo): Unit = {
    kryo.register(classOf[TransactionRecord])
    kryo.register(classOf[CustomerProfile])
    kryo.register(classOf[org.joda.time.LocalDate])
  }
}

val conf2 = new SparkConf()
  .set("spark.serializer",    "org.apache.spark.serializer.KryoSerializer")
  .set("spark.kryo.registrator", "com.example.MyRegistrator")

Buffer Tuning

Kryo uses a fixed-size buffer per thread. If an object exceeds the buffer, a Buffer Overflow exception is thrown:

spark.conf.set("spark.kryoserializer.buffer",     "64k")   # Initial (default: 64k)
spark.conf.set("spark.kryoserializer.buffer.max", "512m")  # Maximum (default: 64m)

# Error: KryoException: Buffer overflow
# Fix: increase buffer.max to accommodate your largest record

Performance Comparison

Typical results for a 1M-row shuffle-heavy job:

Serializer	Duration	Shuffle Write Size
Java	~8.2 seconds	~420 MB
Kryo	~2.1 seconds	~95 MB
Improvement	4× faster	4.4× smaller

PySpark: Arrow for DataFrame Performance

In PySpark, Python objects use pickle — Kryo only affects the JVM layer. For Python-to-JVM exchange:

# Enable Apache Arrow for columnar data exchange
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

# Arrow speeds up:
# - df.toPandas()
# - spark.createDataFrame(pandas_df)
# - pandas UDFs (vectorized UDFs)

pdf = spark.read.parquet("data.parquet").toPandas()  # 10× faster with Arrow

Debugging Kryo Issues

# 1. Log warnings for unregistered classes (instead of errors)
spark.conf.set("spark.kryo.registrationRequired", "false")
# Check driver/executor logs for: "Class is not registered: com.example.MyClass"

# 2. Confirm Kryo is active
print(spark.conf.get("spark.serializer"))
# "org.apache.spark.serializer.KryoSerializer"

# 3. Buffer overflow
# Error: KryoException: Buffer overflow. Available: X, Required: Y
spark.conf.set("spark.kryoserializer.buffer.max", "1g")

# 4. Class not found during deserialization
# Ensure the class JAR is on all executors:
# spark-submit --jars my-classes.jar my_app.py