Technology  /  Apache Spark

Apache Spark 49 guides · updated 2026

Distributed data processing at scale — RDDs, DataFrames, Structured Streaming, and the tuning techniques that keep Spark jobs fast and cheap.

Printing and Inspecting Data from Spark RDDs

Unlike DataFrames which have show(), RDDs don’t have a built-in pretty-print method. Instead, you use actions like collect(), take(), foreach(), and others to retrieve or display data.


collect() — Bring All Data to Driver

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("RDD Print").getOrCreate()
sc = spark.sparkContext
rdd = sc.parallelize([10, 20, 30, 40, 50])
# collect() returns a Python list
data = rdd.collect()
print(data) # [10, 20, 30, 40, 50]
for item in data:
print(item)
# ⚠️ Only safe for small RDDs — collects ALL data to driver memory

take(n) — First N Elements

# Returns first n elements as a Python list (no sorting guarantee)
rdd.take(3) # [10, 20, 30]
# More efficient than collect() — only scans until it has n elements
sc.textFile("s3://huge-log-file.txt").take(5) # Safe even for multi-TB files

first() — Single Element

rdd.first() # 10 — equivalent to take(1)[0]

top(n) — Largest N Elements

rdd.top(3) # [50, 40, 30] — descending natural order
# With a custom key
words = sc.parallelize(["banana", "apple", "cherry", "date"])
words.top(3, key=lambda w: len(w)) # ["banana", "cherry", "apple"]

foreach() — Process Each Element (on Executors)

# Runs a function on each element IN THE EXECUTOR — not the driver
# Use for side effects: writing to a file, sending to a message queue
rdd.foreach(lambda x: print(f"Processing: {x}"))
# Output appears in EXECUTOR logs, not driver console!
# For printing in the driver, use collect() first:
for item in rdd.collect():
print(f"Driver: {item}")

Inspecting Structure

# Count elements
rdd.count() # 5
# Count elements per value
rdd2 = sc.parallelize(["a", "b", "a", "c", "b", "a"])
rdd2.countByValue() # {'a': 3, 'b': 2, 'c': 1}
# Statistics (numeric RDDs)
rdd.stats()
# count: 5, mean: 30.0, stdev: 14.14, max: 50.0, min: 10.0
rdd.sum() # 150
rdd.mean() # 30.0
rdd.max() # 50
rdd.min() # 10
rdd.variance() # 200.0
rdd.stdev() # 14.14...
# Partition information
print(f"Partitions: {rdd.getNumPartitions()}")

Viewing Partition Contents

# See what's in each partition
rdd = sc.parallelize(range(10), numSlices=3)
rdd.mapPartitionsWithIndex(
lambda idx, it: [(idx, list(it))]
).collect()
# [(0, [0, 1, 2, 3]), (1, [4, 5, 6]), (2, [7, 8, 9])]

takeSample() — Random Sample

# Take a random sample without replacement
rdd.takeSample(withReplacement=False, num=3, seed=42)
# e.g., [30, 10, 50]
# With replacement
rdd.takeSample(withReplacement=True, num=5, seed=42)

Safe Inspection Pattern

# Always check size before collect()
count = rdd.count()
if count < 10_000:
data = rdd.collect()
for item in data:
print(item)
else:
print(f"Too large to collect: {count} rows. Showing top 10:")
for item in rdd.take(10):
print(item)