Technology  /  Apache Spark

Apache Spark 49 guides · updated 2026

Distributed data processing at scale — RDDs, DataFrames, Structured Streaming, and the tuning techniques that keep Spark jobs fast and cheap.

Resilient Distributed Datasets (RDD)

An RDD is Spark’s foundational data structure — an immutable, distributed collection of elements that can be processed in parallel across a cluster. Every higher-level Spark abstraction (DataFrame, Dataset) compiles down to RDD operations at runtime.


What Makes an RDD “Resilient”

Three properties give RDDs their resilience:

  1. Immutability — once created, data never changes. Transformations produce new RDDs.
  2. Lineage — Spark remembers every transformation that built the RDD. If a partition is lost, it’s recomputed from the lineage — no data replication needed.
  3. Partitioning — data is split across the cluster. Failures affect only individual partitions, which are rebuilt independently.

Creating RDDs

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("RDD Demo").getOrCreate()
sc = spark.sparkContext
# From a Python collection
numbers = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8], numSlices=4)
# From a text file
lines = sc.textFile("s3://my-bucket/logs/2025/*.log")
# From another RDD (transformation)
words = lines.flatMap(lambda line: line.split())

Transformations vs Actions

TypeWhat it doesExamplesWhen it runs
TransformationProduces a new RDD — lazilymap, filter, flatMap, distinctOnly when an action is called
ActionTriggers computation, returns a resultcollect, count, reduce, saveAsTextFileImmediately
# Transformations (lazy — nothing runs yet)
sales = sc.parallelize([
("Electronics", 1200), ("Clothing", 450), ("Electronics", 800),
("Books", 120), ("Clothing", 300)
])
electronics = sales.filter(lambda x: x[0] == "Electronics")
amounts = electronics.map(lambda x: x[1])
doubled = amounts.map(lambda x: x * 2)
# Action — triggers the chain above
total = doubled.reduce(lambda a, b: a + b)
print(f"Total: {total}") # 4000

Key Transformations

rdd = sc.parallelize(range(1, 11))
# map — apply function to each element
squared = rdd.map(lambda x: x ** 2)
# [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
# filter — keep elements matching condition
evens = rdd.filter(lambda x: x % 2 == 0)
# [2, 4, 6, 8, 10]
# flatMap — one-to-many mapping
sentences = sc.parallelize(["hello world", "apache spark"])
words = sentences.flatMap(lambda s: s.split())
# ["hello", "world", "apache", "spark"]
# distinct — remove duplicates
data = sc.parallelize([1, 2, 2, 3, 3, 3])
unique = data.distinct()
# [1, 2, 3]
# union — combine two RDDs
a = sc.parallelize([1, 2, 3])
b = sc.parallelize([3, 4, 5])
combined = a.union(b) # [1, 2, 3, 3, 4, 5]
# groupByKey / reduceByKey
pairs = sc.parallelize([("a", 1), ("b", 2), ("a", 3), ("b", 4)])
reduced = pairs.reduceByKey(lambda x, y: x + y)
# [("a", 4), ("b", 6)]

Key Actions

rdd = sc.parallelize([10, 20, 30, 40, 50])
rdd.collect() # [10, 20, 30, 40, 50]
rdd.count() # 5
rdd.first() # 10
rdd.take(3) # [10, 20, 30]
rdd.reduce(lambda a, b: a + b) # 150
rdd.max() # 50
rdd.min() # 10
rdd.sum() # 150
rdd.mean() # 30.0
# Save results
rdd.saveAsTextFile("s3://bucket/output/")

Fault Tolerance Through Lineage

# If the "words" RDD loses a partition, Spark recomputes it
# from "lines" — no replication overhead
lines = sc.textFile("hdfs://data/logs.txt") # Step 1
words = lines.flatMap(lambda l: l.split()) # Step 2
pairs = words.map(lambda w: (w, 1)) # Step 3
counts = pairs.reduceByKey(lambda a, b: a + b) # Step 4
# Spark tracks: counts ← reduceByKey ← pairs ← map ← words ← flatMap ← lines ← textFile
# Lost partition in "counts"? Rerun from "lines" for that partition.
counts.saveAsTextFile("hdfs://output/wordcount")

RDD vs DataFrame: When to Use Each

ScenarioUse RDDUse DataFrame
Unstructured data (logs, text)
Custom complex logic not expressible in SQL
Structured/semi-structured data
SQL queries and aggregations
Maximum performance (Catalyst optimizer)
Python type safety not needed
Interop with ML pipelines

In 2025, DataFrames are the default choice for most workloads. Use RDDs only when you need low-level control or are processing truly unstructured data.