Resilient Distributed Datasets (RDD)

An RDD is Spark’s foundational data structure — an immutable, distributed collection of elements that can be processed in parallel across a cluster. Every higher-level Spark abstraction (DataFrame, Dataset) compiles down to RDD operations at runtime.

What Makes an RDD “Resilient”

Three properties give RDDs their resilience:

Immutability — once created, data never changes. Transformations produce new RDDs.
Lineage — Spark remembers every transformation that built the RDD. If a partition is lost, it’s recomputed from the lineage — no data replication needed.
Partitioning — data is split across the cluster. Failures affect only individual partitions, which are rebuilt independently.

Creating RDDs

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("RDD Demo").getOrCreate()
sc = spark.sparkContext

# From a Python collection
numbers = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8], numSlices=4)

# From a text file
lines = sc.textFile("s3://my-bucket/logs/2025/*.log")

# From another RDD (transformation)
words = lines.flatMap(lambda line: line.split())

Transformations vs Actions

Type	What it does	Examples	When it runs
Transformation	Produces a new RDD — lazily	`map`, `filter`, `flatMap`, `distinct`	Only when an action is called
Action	Triggers computation, returns a result	`collect`, `count`, `reduce`, `saveAsTextFile`	Immediately

# Transformations (lazy — nothing runs yet)
sales = sc.parallelize([
    ("Electronics", 1200), ("Clothing", 450), ("Electronics", 800),
    ("Books", 120), ("Clothing", 300)
])

electronics = sales.filter(lambda x: x[0] == "Electronics")
amounts     = electronics.map(lambda x: x[1])
doubled     = amounts.map(lambda x: x * 2)

# Action — triggers the chain above
total = doubled.reduce(lambda a, b: a + b)
print(f"Total: {total}")  # 4000

Key Transformations

rdd = sc.parallelize(range(1, 11))

# map — apply function to each element
squared = rdd.map(lambda x: x ** 2)
# [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

# filter — keep elements matching condition
evens = rdd.filter(lambda x: x % 2 == 0)
# [2, 4, 6, 8, 10]

# flatMap — one-to-many mapping
sentences = sc.parallelize(["hello world", "apache spark"])
words = sentences.flatMap(lambda s: s.split())
# ["hello", "world", "apache", "spark"]

# distinct — remove duplicates
data = sc.parallelize([1, 2, 2, 3, 3, 3])
unique = data.distinct()
# [1, 2, 3]

# union — combine two RDDs
a = sc.parallelize([1, 2, 3])
b = sc.parallelize([3, 4, 5])
combined = a.union(b)  # [1, 2, 3, 3, 4, 5]

# groupByKey / reduceByKey
pairs = sc.parallelize([("a", 1), ("b", 2), ("a", 3), ("b", 4)])
reduced = pairs.reduceByKey(lambda x, y: x + y)
# [("a", 4), ("b", 6)]

Key Actions

rdd = sc.parallelize([10, 20, 30, 40, 50])

rdd.collect()          # [10, 20, 30, 40, 50]
rdd.count()            # 5
rdd.first()            # 10
rdd.take(3)            # [10, 20, 30]
rdd.reduce(lambda a, b: a + b)  # 150
rdd.max()              # 50
rdd.min()              # 10
rdd.sum()              # 150
rdd.mean()             # 30.0

# Save results
rdd.saveAsTextFile("s3://bucket/output/")

Fault Tolerance Through Lineage

# If the "words" RDD loses a partition, Spark recomputes it
# from "lines" — no replication overhead

lines = sc.textFile("hdfs://data/logs.txt")          # Step 1
words = lines.flatMap(lambda l: l.split())           # Step 2
pairs = words.map(lambda w: (w, 1))                  # Step 3
counts = pairs.reduceByKey(lambda a, b: a + b)       # Step 4

# Spark tracks: counts ← reduceByKey ← pairs ← map ← words ← flatMap ← lines ← textFile
# Lost partition in "counts"? Rerun from "lines" for that partition.
counts.saveAsTextFile("hdfs://output/wordcount")

RDD vs DataFrame: When to Use Each

Scenario	Use RDD	Use DataFrame
Unstructured data (logs, text)	✅	❌
Custom complex logic not expressible in SQL	✅	❌
Structured/semi-structured data	❌	✅
SQL queries and aggregations	❌	✅
Maximum performance (Catalyst optimizer)	❌	✅
Python type safety not needed	✅	✅
Interop with ML pipelines	❌	✅

In 2025, DataFrames are the default choice for most workloads. Use RDDs only when you need low-level control or are processing truly unstructured data.