Technology  /  Apache Spark

Apache Spark 49 guides · updated 2026

Distributed data processing at scale — RDDs, DataFrames, Structured Streaming, and the tuning techniques that keep Spark jobs fast and cheap.

Kryo Serialization in Apache Spark

Kryo is a fast, compact Java serialization library that reduces shuffle overhead and memory footprint compared to Java’s default serialization. Switching to Kryo is one of the highest-value single-configuration performance improvements for RDD-heavy workloads.


Enabling Kryo

from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("KryoApp") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.config("spark.kryoserializer.buffer", "64k") \
.config("spark.kryoserializer.buffer.max", "512m") \
.getOrCreate()

Class Registration (Scala / Java)

Without registration, Kryo writes the full class name alongside data — partially negating the size benefit. Registering classes tells Kryo to use a compact numeric ID:

import org.apache.spark.{SparkConf, SparkContext}
import com.esotericsoftware.kryo.Kryo
// Option 1: registerKryoClasses in SparkConf
val conf = new SparkConf()
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.registerKryoClasses(Array(
classOf[TransactionRecord],
classOf[CustomerProfile]
))
// Option 2: Custom KryoRegistrator for large class sets
class MyRegistrator extends KryoRegistrator {
override def registerClasses(kryo: Kryo): Unit = {
kryo.register(classOf[TransactionRecord])
kryo.register(classOf[CustomerProfile])
kryo.register(classOf[org.joda.time.LocalDate])
}
}
val conf2 = new SparkConf()
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryo.registrator", "com.example.MyRegistrator")

Buffer Tuning

Kryo uses a fixed-size buffer per thread. If an object exceeds the buffer, a Buffer Overflow exception is thrown:

spark.conf.set("spark.kryoserializer.buffer", "64k") # Initial (default: 64k)
spark.conf.set("spark.kryoserializer.buffer.max", "512m") # Maximum (default: 64m)
# Error: KryoException: Buffer overflow
# Fix: increase buffer.max to accommodate your largest record

Performance Comparison

Typical results for a 1M-row shuffle-heavy job:

SerializerDurationShuffle Write Size
Java~8.2 seconds~420 MB
Kryo~2.1 seconds~95 MB
Improvement4× faster4.4× smaller

PySpark: Arrow for DataFrame Performance

In PySpark, Python objects use pickle — Kryo only affects the JVM layer. For Python-to-JVM exchange:

# Enable Apache Arrow for columnar data exchange
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
# Arrow speeds up:
# - df.toPandas()
# - spark.createDataFrame(pandas_df)
# - pandas UDFs (vectorized UDFs)
pdf = spark.read.parquet("data.parquet").toPandas() # 10× faster with Arrow

Debugging Kryo Issues

# 1. Log warnings for unregistered classes (instead of errors)
spark.conf.set("spark.kryo.registrationRequired", "false")
# Check driver/executor logs for: "Class is not registered: com.example.MyClass"
# 2. Confirm Kryo is active
print(spark.conf.get("spark.serializer"))
# "org.apache.spark.serializer.KryoSerializer"
# 3. Buffer overflow
# Error: KryoException: Buffer overflow. Available: X, Required: Y
spark.conf.set("spark.kryoserializer.buffer.max", "1g")
# 4. Class not found during deserialization
# Ensure the class JAR is on all executors:
# spark-submit --jars my-classes.jar my_app.py