Technology  /  Apache Spark

Apache Spark 49 guides · updated 2026

Distributed data processing at scale — RDDs, DataFrames, Structured Streaming, and the tuning techniques that keep Spark jobs fast and cheap.

What is an RDD in Apache Spark?

An RDD (Resilient Distributed Dataset) is Apache Spark’s foundational data abstraction — an immutable, distributed collection of objects that can be processed in parallel across a cluster. Every higher-level Spark API (DataFrame, Dataset) is built on top of RDDs.


The Three RDD Properties

1. Resilient — fault-tolerant through lineage. If a partition is lost (due to node failure), Spark can recompute it from the lineage graph without requiring data replication.

2. Distributed — partitioned across multiple nodes. Each partition is processed independently by one task on one executor core.

3. Dataset — a collection of serializable objects (strings, integers, tuples, case classes, or any Java/Python objects).


RDD Characteristics

PropertyDescription
ImmutableCannot be modified after creation; transformations produce new RDDs
LazyTransformations are not executed until an action triggers computation
PartitionedSplit into chunks distributed across the cluster
TypedStrongly typed in Scala/Java; dynamically typed in Python
In-memoryDefault storage is RAM; spills to disk when memory is exhausted

Creating RDDs

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("RDD Basics").getOrCreate()
sc = spark.sparkContext
# From a Python collection
rdd1 = sc.parallelize([1, 2, 3, 4, 5])
rdd2 = sc.parallelize(range(1000), numSlices=8) # 8 partitions
# From a file
rdd3 = sc.textFile("s3://bucket/logs/*.log")
rdd4 = sc.textFile("hdfs://namenode/data.txt", minPartitions=4)
# From another RDD (transformation)
rdd5 = rdd1.filter(lambda x: x % 2 == 0)
rdd6 = rdd3.map(lambda line: line.upper())

How Fault Tolerance Works

# Step 1: Create a chain of transformations
raw = sc.textFile("s3://logs.txt")
filt = raw.filter(lambda l: "ERROR" in l)
msgs = filt.map(lambda l: l.split("|")[2])
# Step 2: If an executor fails mid-job and loses a partition of "msgs":
# - Spark does NOT need a backup copy of the data
# - Spark replays: textFile → filter → map for that partition only
# - Job continues without data loss
msgs.count() # Completes even if a node fails

RDD Operations

rdd = sc.parallelize(["hello world", "apache spark", "big data"])
# Transformations (lazy)
words = rdd.flatMap(lambda s: s.split()) # ["hello", "world", ...]
lengths = words.map(lambda w: (w, len(w))) # [("hello", 5), ("world", 5), ...]
long_w = words.filter(lambda w: len(w) > 4) # ["hello", "world", "apache", "spark"]
# Actions (trigger execution)
long_w.collect() # ["hello", "world", "apache", "spark"]
long_w.count() # 4
long_w.take(2) # ["hello", "world"]

RDD vs DataFrame: When to Use Each

In 2025, use DataFrames for the vast majority of work:

Use RDDs When…Use DataFrames When…
Processing unstructured data (raw text, binary)Structured/semi-structured data
Complex custom logic that doesn’t map to DataFrame APISQL-like transformations
Low-level control over partitioningAutomatic Catalyst optimization
Legacy Spark 1.x codeNew code — always prefer DataFrames
Python closures with non-serializable objectsStandard column operations

DataFrames are faster (Catalyst + Tungsten), safer (schema enforcement), and more concise.