Resilient Distributed Datasets (RDD)
An RDD is Spark’s foundational data structure — an immutable, distributed collection of elements that can be processed in parallel across a cluster. Every higher-level Spark abstraction (DataFrame, Dataset) compiles down to RDD operations at runtime.
What Makes an RDD “Resilient”
Three properties give RDDs their resilience:
- Immutability — once created, data never changes. Transformations produce new RDDs.
- Lineage — Spark remembers every transformation that built the RDD. If a partition is lost, it’s recomputed from the lineage — no data replication needed.
- Partitioning — data is split across the cluster. Failures affect only individual partitions, which are rebuilt independently.
Creating RDDs
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("RDD Demo").getOrCreate()sc = spark.sparkContext
# From a Python collectionnumbers = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8], numSlices=4)
# From a text filelines = sc.textFile("s3://my-bucket/logs/2025/*.log")
# From another RDD (transformation)words = lines.flatMap(lambda line: line.split())Transformations vs Actions
| Type | What it does | Examples | When it runs |
|---|---|---|---|
| Transformation | Produces a new RDD — lazily | map, filter, flatMap, distinct | Only when an action is called |
| Action | Triggers computation, returns a result | collect, count, reduce, saveAsTextFile | Immediately |
# Transformations (lazy — nothing runs yet)sales = sc.parallelize([ ("Electronics", 1200), ("Clothing", 450), ("Electronics", 800), ("Books", 120), ("Clothing", 300)])
electronics = sales.filter(lambda x: x[0] == "Electronics")amounts = electronics.map(lambda x: x[1])doubled = amounts.map(lambda x: x * 2)
# Action — triggers the chain abovetotal = doubled.reduce(lambda a, b: a + b)print(f"Total: {total}") # 4000Key Transformations
rdd = sc.parallelize(range(1, 11))
# map — apply function to each elementsquared = rdd.map(lambda x: x ** 2)# [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
# filter — keep elements matching conditionevens = rdd.filter(lambda x: x % 2 == 0)# [2, 4, 6, 8, 10]
# flatMap — one-to-many mappingsentences = sc.parallelize(["hello world", "apache spark"])words = sentences.flatMap(lambda s: s.split())# ["hello", "world", "apache", "spark"]
# distinct — remove duplicatesdata = sc.parallelize([1, 2, 2, 3, 3, 3])unique = data.distinct()# [1, 2, 3]
# union — combine two RDDsa = sc.parallelize([1, 2, 3])b = sc.parallelize([3, 4, 5])combined = a.union(b) # [1, 2, 3, 3, 4, 5]
# groupByKey / reduceByKeypairs = sc.parallelize([("a", 1), ("b", 2), ("a", 3), ("b", 4)])reduced = pairs.reduceByKey(lambda x, y: x + y)# [("a", 4), ("b", 6)]Key Actions
rdd = sc.parallelize([10, 20, 30, 40, 50])
rdd.collect() # [10, 20, 30, 40, 50]rdd.count() # 5rdd.first() # 10rdd.take(3) # [10, 20, 30]rdd.reduce(lambda a, b: a + b) # 150rdd.max() # 50rdd.min() # 10rdd.sum() # 150rdd.mean() # 30.0
# Save resultsrdd.saveAsTextFile("s3://bucket/output/")Fault Tolerance Through Lineage
# If the "words" RDD loses a partition, Spark recomputes it# from "lines" — no replication overhead
lines = sc.textFile("hdfs://data/logs.txt") # Step 1words = lines.flatMap(lambda l: l.split()) # Step 2pairs = words.map(lambda w: (w, 1)) # Step 3counts = pairs.reduceByKey(lambda a, b: a + b) # Step 4
# Spark tracks: counts ← reduceByKey ← pairs ← map ← words ← flatMap ← lines ← textFile# Lost partition in "counts"? Rerun from "lines" for that partition.counts.saveAsTextFile("hdfs://output/wordcount")RDD vs DataFrame: When to Use Each
| Scenario | Use RDD | Use DataFrame |
|---|---|---|
| Unstructured data (logs, text) | ✅ | ❌ |
| Custom complex logic not expressible in SQL | ✅ | ❌ |
| Structured/semi-structured data | ❌ | ✅ |
| SQL queries and aggregations | ❌ | ✅ |
| Maximum performance (Catalyst optimizer) | ❌ | ✅ |
| Python type safety not needed | ✅ | ✅ |
| Interop with ML pipelines | ❌ | ✅ |
In 2025, DataFrames are the default choice for most workloads. Use RDDs only when you need low-level control or are processing truly unstructured data.