What is an RDD in Apache Spark?

An RDD (Resilient Distributed Dataset) is Apache Spark’s foundational data abstraction — an immutable, distributed collection of objects that can be processed in parallel across a cluster. Every higher-level Spark API (DataFrame, Dataset) is built on top of RDDs.

The Three RDD Properties

1. Resilient — fault-tolerant through lineage. If a partition is lost (due to node failure), Spark can recompute it from the lineage graph without requiring data replication.

2. Distributed — partitioned across multiple nodes. Each partition is processed independently by one task on one executor core.

3. Dataset — a collection of serializable objects (strings, integers, tuples, case classes, or any Java/Python objects).

RDD Characteristics

Property	Description
Immutable	Cannot be modified after creation; transformations produce new RDDs
Lazy	Transformations are not executed until an action triggers computation
Partitioned	Split into chunks distributed across the cluster
Typed	Strongly typed in Scala/Java; dynamically typed in Python
In-memory	Default storage is RAM; spills to disk when memory is exhausted

Creating RDDs

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("RDD Basics").getOrCreate()
sc = spark.sparkContext

# From a Python collection
rdd1 = sc.parallelize([1, 2, 3, 4, 5])
rdd2 = sc.parallelize(range(1000), numSlices=8)   # 8 partitions

# From a file
rdd3 = sc.textFile("s3://bucket/logs/*.log")
rdd4 = sc.textFile("hdfs://namenode/data.txt", minPartitions=4)

# From another RDD (transformation)
rdd5 = rdd1.filter(lambda x: x % 2 == 0)
rdd6 = rdd3.map(lambda line: line.upper())

How Fault Tolerance Works

# Step 1: Create a chain of transformations
raw  = sc.textFile("s3://logs.txt")
filt = raw.filter(lambda l: "ERROR" in l)
msgs = filt.map(lambda l: l.split("|")[2])

# Step 2: If an executor fails mid-job and loses a partition of "msgs":
# - Spark does NOT need a backup copy of the data
# - Spark replays: textFile → filter → map for that partition only
# - Job continues without data loss

msgs.count()   # Completes even if a node fails

RDD Operations

rdd = sc.parallelize(["hello world", "apache spark", "big data"])

# Transformations (lazy)
words   = rdd.flatMap(lambda s: s.split())          # ["hello", "world", ...]
lengths = words.map(lambda w: (w, len(w)))          # [("hello", 5), ("world", 5), ...]
long_w  = words.filter(lambda w: len(w) > 4)       # ["hello", "world", "apache", "spark"]

# Actions (trigger execution)
long_w.collect()      # ["hello", "world", "apache", "spark"]
long_w.count()        # 4
long_w.take(2)        # ["hello", "world"]

RDD vs DataFrame: When to Use Each

In 2025, use DataFrames for the vast majority of work:

Use RDDs When…	Use DataFrames When…
Processing unstructured data (raw text, binary)	Structured/semi-structured data
Complex custom logic that doesn’t map to DataFrame API	SQL-like transformations
Low-level control over partitioning	Automatic Catalyst optimization
Legacy Spark 1.x code	New code — always prefer DataFrames
Python closures with non-serializable objects	Standard column operations

DataFrames are faster (Catalyst + Tungsten), safer (schema enforcement), and more concise.