Spark DataFrame: show() vs take() vs collect()

Spark provides several ways to inspect DataFrame contents. Each has a different output format, performance characteristic, and appropriate use case. Choosing the wrong one can either print unreadable output or crash your driver with an OOM error.

Quick Reference

Method	Returns	Output format	Use when
`show(n)`	`None`	Formatted table printed to stdout	Interactive inspection
`take(n)`	`List[Row]`	Python list of Row objects	Programmatic access to rows
`head(n)`	`List[Row]`	Same as `take(n)`	Alias for `take`
`collect()`	`List[Row]`	All rows as Python list	Small DataFrames only
`first()`	`Row`	Single Row object	First row only
`toPandas()`	`pandas.DataFrame`	Pandas DataFrame	Pandas-downstream analysis

show()

show() prints a formatted ASCII table to stdout. It does not return a Python object.

from pyspark.sql import SparkSession, functions as F

spark = SparkSession.builder.appName("ShowVsTake").getOrCreate()

data = [("Alice", "Engineering", 95000), ("Bob", "Marketing", 72000)]
df = spark.createDataFrame(data, ["name", "department", "salary"])

df.show()
# +-----+------------+------+
# | name|  department|salary|
# +-----+------------+------+
# |Alice| Engineering| 95000|
# |  Bob|   Marketing| 72000|
# +-----+------------+------+

df.show(5)                  # Show up to 5 rows
df.show(5, truncate=False)  # Don't cut long strings at 20 chars
df.show(5, truncate=30)     # Truncate at 30 chars
df.show(5, vertical=True)   # Print one column per line (better for wide tables)

take()

take(n) returns a Python list of Row objects. Use it when you need to process rows programmatically.

rows = df.take(3)
# [Row(name='Alice', department='Engineering', salary=95000),
#  Row(name='Bob',   department='Marketing',   salary=72000)]

# Access fields by name
for row in rows:
    print(f"{row['name']}: {row.salary}")

# Or as a dict
rows[0].asDict()
# {'name': 'Alice', 'department': 'Engineering', 'salary': 95000}

collect()

collect() brings all rows to the driver. Only use it when the entire result fits in driver memory (typically < 1 GB).

all_rows = df.collect()   # ← OOM risk for large DataFrames

# Safe pattern: filter first, then collect
small_result = df.filter(F.col("salary") > 100000).select("name").collect()
names = [row["name"] for row in small_result]

first() and head()

# first() — returns one Row object (same as take(1)[0])
row = df.first()
print(row["name"])   # "Alice"

# head(n) — same as take(n)
rows = df.head(3)

toPandas()

# Converts the entire Spark DataFrame to a pandas DataFrame
# Only use when all data fits in driver memory

spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
pdf = df.toPandas()   # pandas.DataFrame
print(pdf.dtypes)

Performance Considerations

# show() vs take() — both scan the same number of rows from the cluster
# show() is slightly slower because it formats the output as a table

# collect() scans ALL rows and transfers them to the driver
# For a 1 TB DataFrame: collect() would transfer 1 TB of data to one machine

# Safe pattern for large DataFrames
result = df.groupBy("department").agg(F.avg("salary"))   # Small result
result.collect()   # Fine — only a few rows per department