Technology  /  Apache Spark

Apache Spark 49 guides · updated 2026

Distributed data processing at scale — RDDs, DataFrames, Structured Streaming, and the tuning techniques that keep Spark jobs fast and cheap.

Spark DataFrame crossJoin

A cross join (also called a Cartesian product) combines every row from the left DataFrame with every row from the right DataFrame. If left has M rows and right has N rows, the result has M × N rows. Cross joins are rarely used intentionally for analysis but are a building block for certain algorithms.


Basic crossJoin()

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("CrossJoin").getOrCreate()
colors = spark.createDataFrame([("Red",), ("Green",), ("Blue",)], ["color"])
sizes = spark.createDataFrame([("S",), ("M",), ("L",)], ["size"])
# Every color paired with every size: 3 × 3 = 9 rows
product = colors.crossJoin(sizes)
product.show()
# +-----+----+
# |color|size|
# +-----+----+
# | Red| S|
# | Red| M|
# | Red| L|
# |Green| S|
# |Green| M|
# |Green| L|
# | Blue| S|
# | Blue| M|
# | Blue| L|
# +-----+----+

cross join in Spark SQL

colors.createOrReplaceTempView("colors")
sizes.createOrReplaceTempView("sizes")
spark.sql("""
SELECT c.color, s.size
FROM colors c
CROSS JOIN sizes s
""").show()
# Equivalent without explicit CROSS JOIN keyword:
spark.sql("""
SELECT c.color, s.size
FROM colors c, sizes s
""").show()

Practical Use Cases

1. Generating All Combinations

# All possible experiment configurations
algorithms = spark.createDataFrame(
[("gradient_boost",), ("random_forest",), ("xgboost",)], ["algorithm"]
)
learning_rates = spark.createDataFrame(
[(0.01,), (0.1,), (1.0,)], ["learning_rate"]
)
grid = algorithms.crossJoin(learning_rates)
grid.show()
# 3 algorithms × 3 rates = 9 configurations to evaluate

2. Date × Entity Matrix for Completeness Checks

from pyspark.sql import functions as F
dates = spark.sql("SELECT sequence(DATE'2025-01-01', DATE'2025-01-07', INTERVAL 1 DAY) AS d") \
.select(F.explode(F.col("d")).alias("date"))
stores = spark.createDataFrame(
[("S001",), ("S002",), ("S003",)], ["store_id"]
)
# All expected date-store combinations
expected = dates.crossJoin(stores)
actual = spark.read.parquet("daily_sales.parquet")
# Find missing combinations
expected.join(actual, ["date", "store_id"], "left_anti").show()

Performance Warning

# ⚠️ Cross join on large tables is EXTREMELY EXPENSIVE
# 1M rows × 1M rows = 1 trillion rows in the result
# Spark will warn you:
# "Detected implicit cartesian product for INNER join between logical plans"
# To suppress this warning (only when intentional):
spark.conf.set("spark.sql.crossJoin.enabled", "true")
# Always check result size before materializing:
expected_rows = colors.count() * sizes.count()
print(f"Expected result size: {expected_rows:,} rows")

Filtered Cross Join (Often More Appropriate)

Instead of a pure cross join, most use cases want a filtered Cartesian product:

# Cross join then filter is equivalent to a join with complex condition
products.crossJoin(discounts) \
.filter(F.col("category") == F.col("discount_category")) \
.filter(F.col("price") >= F.col("min_price"))
# For complex non-equi join conditions, this is sometimes the clearest approach