Technology  /  Apache Spark

Apache Spark 49 guides · updated 2026

Distributed data processing at scale — RDDs, DataFrames, Structured Streaming, and the tuning techniques that keep Spark jobs fast and cheap.

Spark DataFrame: show() vs take() vs collect()

Spark provides several ways to inspect DataFrame contents. Each has a different output format, performance characteristic, and appropriate use case. Choosing the wrong one can either print unreadable output or crash your driver with an OOM error.


Quick Reference

MethodReturnsOutput formatUse when
show(n)NoneFormatted table printed to stdoutInteractive inspection
take(n)List[Row]Python list of Row objectsProgrammatic access to rows
head(n)List[Row]Same as take(n)Alias for take
collect()List[Row]All rows as Python listSmall DataFrames only
first()RowSingle Row objectFirst row only
toPandas()pandas.DataFramePandas DataFramePandas-downstream analysis

show()

show() prints a formatted ASCII table to stdout. It does not return a Python object.

from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.appName("ShowVsTake").getOrCreate()
data = [("Alice", "Engineering", 95000), ("Bob", "Marketing", 72000)]
df = spark.createDataFrame(data, ["name", "department", "salary"])
df.show()
# +-----+------------+------+
# | name| department|salary|
# +-----+------------+------+
# |Alice| Engineering| 95000|
# | Bob| Marketing| 72000|
# +-----+------------+------+
df.show(5) # Show up to 5 rows
df.show(5, truncate=False) # Don't cut long strings at 20 chars
df.show(5, truncate=30) # Truncate at 30 chars
df.show(5, vertical=True) # Print one column per line (better for wide tables)

take()

take(n) returns a Python list of Row objects. Use it when you need to process rows programmatically.

rows = df.take(3)
# [Row(name='Alice', department='Engineering', salary=95000),
# Row(name='Bob', department='Marketing', salary=72000)]
# Access fields by name
for row in rows:
print(f"{row['name']}: {row.salary}")
# Or as a dict
rows[0].asDict()
# {'name': 'Alice', 'department': 'Engineering', 'salary': 95000}

collect()

collect() brings all rows to the driver. Only use it when the entire result fits in driver memory (typically < 1 GB).

all_rows = df.collect() # ← OOM risk for large DataFrames
# Safe pattern: filter first, then collect
small_result = df.filter(F.col("salary") > 100000).select("name").collect()
names = [row["name"] for row in small_result]

first() and head()

# first() — returns one Row object (same as take(1)[0])
row = df.first()
print(row["name"]) # "Alice"
# head(n) — same as take(n)
rows = df.head(3)

toPandas()

# Converts the entire Spark DataFrame to a pandas DataFrame
# Only use when all data fits in driver memory
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
pdf = df.toPandas() # pandas.DataFrame
print(pdf.dtypes)

Performance Considerations

# show() vs take() — both scan the same number of rows from the cluster
# show() is slightly slower because it formats the output as a table
# collect() scans ALL rows and transfers them to the driver
# For a 1 TB DataFrame: collect() would transfer 1 TB of data to one machine
# Safe pattern for large DataFrames
result = df.groupBy("department").agg(F.avg("salary")) # Small result
result.collect() # Fine — only a few rows per department