Technology  /  Apache Spark

Apache Spark 49 guides · updated 2026

Distributed data processing at scale — RDDs, DataFrames, Structured Streaming, and the tuning techniques that keep Spark jobs fast and cheap.

Creating an RDD from a Collection

sc.parallelize() is the simplest way to create an RDD — it distributes a Python collection (list, range, or any iterable) across the cluster’s partitions. This is the starting point for learning Spark transformations without needing real data files.


Basic parallelize()

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("RDD Collection").getOrCreate()
sc = spark.sparkContext
# From a list of integers
rdd1 = sc.parallelize([1, 2, 3, 4, 5])
print(rdd1.collect()) # [1, 2, 3, 4, 5]
# From a range
rdd2 = sc.parallelize(range(1000))
print(rdd2.count()) # 1000
# From a list of strings
rdd3 = sc.parallelize(["apple", "banana", "cherry"])
rdd3.foreach(print)
# From a list of tuples (key-value pairs)
pairs = sc.parallelize([("a", 1), ("b", 2), ("a", 3), ("c", 1)])
pairs.reduceByKey(lambda x, y: x + y).collect()
# [("a", 4), ("b", 2), ("c", 1)]

Controlling the Number of Partitions

The numSlices parameter sets how many partitions the collection is split into:

# Default: sc.defaultParallelism (usually 2 × CPU cores)
rdd_default = sc.parallelize(range(100))
print(rdd_default.getNumPartitions()) # e.g., 8 on 4 cores
# Explicit partition count
rdd_4 = sc.parallelize(range(100), numSlices=4)
print(rdd_4.getNumPartitions()) # 4
rdd_1 = sc.parallelize(range(100), numSlices=1)
print(rdd_1.getNumPartitions()) # 1
# Too few partitions → underutilizes the cluster
# Too many partitions → scheduling overhead dominates (for small data)
# Rule of thumb for small in-memory data: 2-4 × number of cores

What Each Partition Gets

# See which elements are in which partition
rdd = sc.parallelize(range(10), numSlices=4)
rdd.mapPartitionsWithIndex(
lambda i, it: [(i, list(it))]
).collect()
# [(0, [0, 1, 2]), (1, [3, 4, 5]), (2, [6, 7]), (3, [8, 9])]

Various Collection Types

# Nested lists
nested = sc.parallelize([[1, 2, 3], [4, 5], [6, 7, 8, 9]])
nested.map(sum).collect() # [6, 9, 30]
# Dictionaries
dicts = sc.parallelize([
{"name": "Alice", "score": 92},
{"name": "Bob", "score": 78},
])
dicts.map(lambda d: (d["name"], d["score"])).collect()
# Mixed types (possible in Python, but avoid it)
mixed = sc.parallelize([1, "two", 3.0])
mixed.collect() # [1, "two", 3.0]

parallelize() vs Realistic Data Loading

parallelize() is ideal for testing but has real-world limitations:

# FINE for:
# - Unit tests
# - Learning/experimenting
# - Small lookup tables to broadcast
small_lookup = sc.parallelize(list(range(100)))
# Use file-based reading for production:
# - Data that doesn't fit in driver memory can't be parallelized from a Python list
# - sc.textFile() / spark.read.parquet() streams data directly from storage
production_rdd = sc.textFile("s3://bucket/large-data/*.txt")

Converting Parallelized RDD to DataFrame

from pyspark.sql import Row
data = [
Row(name="Alice", salary=95000, dept="Engineering"),
Row(name="Bob", salary=72000, dept="Marketing"),
]
rdd = sc.parallelize(data)
df = spark.createDataFrame(rdd)
df.show()