Core Apache Spark Concepts
- Resilient Distributed Dataset (RDD)
- DataFrames
- Datasets
- Transformations
- Actions
- Lazy Evaluation
- SparkSession
- SparkContext
- Partitions
- Shuffling
- Persistence & Caching
- Lineage Graphs
- Jobs
- Stages
- Tasks
Apache Spark
- Apache Spark: Big Data Processing & Analytics
- Spark DataFrames: Features, Use Cases & Optimization for Big Data
- Spark Architecture
- Dataframe create from file
- Dataframe Pyspark create from collections
- Spark Dataframe save as csv
- Dataframe save as parquet
- Dataframe show() between take() methods
- Apache SparkSession
- Understanding the RDD of Apache Spark
- Spark RDD creation from collection
- Different method to print data from rdd
- Practical use of unionByName method
- Creating Spark DataFrames: Methods & Examples
- Setup Spark in PyCharm
- Apache Spark all APIs
- Spark for the word count program
- Spark Accumulators
- aggregateByKey in Apache Spark
- Spark Broadcast with Examples
- Spark combineByKey
- Apache Spark Using countByKey
- Spark CrossJoin know all
- Optimizing Spark groupByKey: Usage, Best Practices, and Examples
- Mastering Spark Joins: Inner, Outer, Left, Right & Semi Joins Explained
- Apache Spark: Local Mode vs Cluster Mode - Key Differences & Examples
- Spark map vs flatMap: Key Differences with Examples
- Efficient Data Processing with Spark mapPartitionsWithIndex
- Spark reduceByKey with 5 Real-World Examples
- Spark Union vs UnionAll vs Union Available – Key Differences & Examples
What is Persistence/Caching in Spark?
Persistence (or caching) is Spark’s mechanism to store intermediate results across operations. When you persist an RDD or DataFrame:
- It gets stored in memory or disk across nodes
- Subsequent actions reuse the cached data instead of recomputing
- You trade memory/disk space for computation time
Real-World Analogy:
Imagine baking cookies:
- Without caching: Make dough from scratch every time
- With caching: Store prepared dough in fridge (memory) or freezer (disk)
- Tradeoff: Uses storage space but saves prep time
Why Persistence is Crucial
- Performance Boost - Avoid recomputing expensive transformations
- Iterative Algorithms - Essential for ML (10-100x speedups)
- Interactive Analysis - Faster response for repeated queries
- Resource Optimization - Better cluster utilization
Industry Fact: Proper caching can make Spark jobs 90% faster for iterative workloads!
Storage Levels Explained
Spark offers multiple persistence options:
Storage Level | Memory | Disk | Serialized | Replicated |
---|---|---|---|---|
MEMORY_ONLY | ✅ | ❌ | ❌ | ❌ |
MEMORY_ONLY_SER | ✅ | ❌ | ✅ | ❌ |
MEMORY_AND_DISK | ✅ | ✅ | ❌ | ❌ |
DISK_ONLY | ❌ | ✅ | ❌ | ❌ |
OFF_HEAP | ✅ | ❌ | ✅ | ❌ |
Pro Tip: MEMORY_ONLY_SER
often gives the best memory efficiency!
3 Practical Caching Examples
Example 1: Basic RDD Caching
from pyspark import SparkContext, StorageLevelsc = SparkContext("local", "CachingDemo")
# Create and process RDDrdd = sc.parallelize(range(1, 1000000))\ .filter(lambda x: x % 2 == 0)\ .map(lambda x: x ** 2)
# Cache in memory onlyrdd.persist(StorageLevel.MEMORY_ONLY)
# First action computes and cachesprint("Sum:", rdd.sum())
# Subsequent actions use cacheprint("Count:", rdd.count())print("Max:", rdd.max())
Key Insight: The first action takes longer as it computes and caches, while subsequent actions are faster.
Example 2: DataFrame Caching with Storage Levels
from pyspark.sql import SparkSessionspark = SparkSession.builder.appName("DFCaching").getOrCreate()
# Create and process DataFramedf = spark.range(1, 1000000)\ .filter("id % 3 = 0")\ .withColumn("square", col("id") ** 2)
# Cache to memory and diskdf.persist(StorageLevel.MEMORY_AND_DISK)
# First actiondf.show(5) # Computes and caches
# Subsequent actionsprint("Rows:", df.count())df.groupBy().avg().show()
When to Use: Large datasets that might not fit fully in memory.
Example 3: Caching for Iterative Algorithms
# K-means clustering examplefrom pyspark.ml.clustering import KMeansfrom pyspark.ml.feature import VectorAssembler
# Load and prepare datadata = spark.read.csv("data.csv", header=True)assembler = VectorAssembler(inputCols=["feat1","feat2"], outputCol="features")dataset = assembler.transform(data).cache() # Cache feature vectors
# Iterative trainingkmeans = KMeans(k=3, maxIter=20)model = kmeans.fit(dataset) # Reuses cached data in each iteration
# Cleanupdataset.unpersist()
Why Cache Here? Without caching, Spark would recompute feature transformations in every iteration.
When to Cache (and When Not To)
Good Candidates for Caching:
- Iterative algorithms (ML training)
- Frequently accessed datasets
- Expensive transformation pipelines
- Interactive analysis sessions
When to Avoid Caching:
- One-time use datasets
- Very large datasets (unless using disk)
- Streaming applications
- When memory is scarce
Interview Preparation Guide
Common Questions:
-
“Explain Spark’s persistence model”
- Discuss storage levels and tradeoffs
-
”When would you use MEMORY_ONLY vs MEMORY_AND_DISK?”
- MEMORY_ONLY for smaller datasets, MEMORY_AND_DISK when spillover possible
-
”How does caching affect garbage collection?”
- More caching → more GC pressure → consider MEMORY_ONLY_SER
Memory Aids:
- “Cache what you reuse, skip what you use once"
- "MEMORY_ONLY_SER = Space saver"
- "Unpersist is as important as persist”
Advanced Caching Techniques
1. LRU Caching Behavior
Spark automatically drops cached partitions using Least Recently Used policy when memory fills up.
2. Manual Uncaching
rdd.unpersist() # Immediately free memorydf.unpersist(blocking=True) # Wait until freed
3. Monitoring Cache Usage
spark.sparkContext.getRDDStorageInfo()# Shows memory used, disk used, etc.
Performance Considerations
- Serialization - MEMORY_ONLY_SER saves space but adds CPU overhead
- GC Overhead - Java objects vs serialized bytes
- Disk Spill - MEMORY_AND_DISK avoids recomputation but slower
- Cluster Memory - Don’t cache more than 60% of available memory
Golden Rule: Benchmark different storage levels for your specific workload!
Real-World Use Cases
-
Machine Learning Pipelines
training_data.preprocess().cache()model.fit(training_data) # Multiple passes -
Dashboard Backends
# Cache aggregated data for fast queriesdaily_stats = sales.groupBy("date").agg(...).cache() -
Graph Algorithms
# PageRank needs repeated access to edgesgraph.edges.cache()
Troubleshooting Cache Issues
Symptom | Likely Cause | Solution |
---|---|---|
No speedup | Not actually cached | Verify with storageInfo() |
OOM errors | Too much cached | Use disk or serialized levels |
Slow caching | Wrong storage level | Try MEMORY_ONLY_SER |
Conclusion: Key Takeaways
- Cache strategically - Only reuseable datasets
- Choose storage level wisely - Balance speed vs memory
- Monitor usage - Check Spark UI storage tab
- Clean up - unpersist() when done
Pro Tip: For DataFrames, df.cache()
is shorthand for df.persist(StorageLevel.MEMORY_AND_DISK)
!