Spark DataFrame: show() vs take() vs collect()
Spark provides several ways to inspect DataFrame contents. Each has a different output format, performance characteristic, and appropriate use case. Choosing the wrong one can either print unreadable output or crash your driver with an OOM error.
Quick Reference
| Method | Returns | Output format | Use when |
|---|---|---|---|
show(n) | None | Formatted table printed to stdout | Interactive inspection |
take(n) | List[Row] | Python list of Row objects | Programmatic access to rows |
head(n) | List[Row] | Same as take(n) | Alias for take |
collect() | List[Row] | All rows as Python list | Small DataFrames only |
first() | Row | Single Row object | First row only |
toPandas() | pandas.DataFrame | Pandas DataFrame | Pandas-downstream analysis |
show()
show() prints a formatted ASCII table to stdout. It does not return a Python object.
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.appName("ShowVsTake").getOrCreate()
data = [("Alice", "Engineering", 95000), ("Bob", "Marketing", 72000)]df = spark.createDataFrame(data, ["name", "department", "salary"])
df.show()# +-----+------------+------+# | name| department|salary|# +-----+------------+------+# |Alice| Engineering| 95000|# | Bob| Marketing| 72000|# +-----+------------+------+
df.show(5) # Show up to 5 rowsdf.show(5, truncate=False) # Don't cut long strings at 20 charsdf.show(5, truncate=30) # Truncate at 30 charsdf.show(5, vertical=True) # Print one column per line (better for wide tables)take()
take(n) returns a Python list of Row objects. Use it when you need to process rows programmatically.
rows = df.take(3)# [Row(name='Alice', department='Engineering', salary=95000),# Row(name='Bob', department='Marketing', salary=72000)]
# Access fields by namefor row in rows: print(f"{row['name']}: {row.salary}")
# Or as a dictrows[0].asDict()# {'name': 'Alice', 'department': 'Engineering', 'salary': 95000}collect()
collect() brings all rows to the driver. Only use it when the entire result fits in driver memory (typically < 1 GB).
all_rows = df.collect() # ← OOM risk for large DataFrames
# Safe pattern: filter first, then collectsmall_result = df.filter(F.col("salary") > 100000).select("name").collect()names = [row["name"] for row in small_result]first() and head()
# first() — returns one Row object (same as take(1)[0])row = df.first()print(row["name"]) # "Alice"
# head(n) — same as take(n)rows = df.head(3)toPandas()
# Converts the entire Spark DataFrame to a pandas DataFrame# Only use when all data fits in driver memory
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")pdf = df.toPandas() # pandas.DataFrameprint(pdf.dtypes)Performance Considerations
# show() vs take() — both scan the same number of rows from the cluster# show() is slightly slower because it formats the output as a table
# collect() scans ALL rows and transfers them to the driver# For a 1 TB DataFrame: collect() would transfer 1 TB of data to one machine
# Safe pattern for large DataFramesresult = df.groupBy("department").agg(F.avg("salary")) # Small resultresult.collect() # Fine — only a few rows per department