Spark DataFrame crossJoin

A cross join (also called a Cartesian product) combines every row from the left DataFrame with every row from the right DataFrame. If left has M rows and right has N rows, the result has M × N rows. Cross joins are rarely used intentionally for analysis but are a building block for certain algorithms.

Basic crossJoin()

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CrossJoin").getOrCreate()

colors = spark.createDataFrame([("Red",), ("Green",), ("Blue",)], ["color"])
sizes  = spark.createDataFrame([("S",), ("M",), ("L",)],          ["size"])

# Every color paired with every size: 3 × 3 = 9 rows
product = colors.crossJoin(sizes)
product.show()
# +-----+----+
# |color|size|
# +-----+----+
# |  Red|   S|
# |  Red|   M|
# |  Red|   L|
# |Green|   S|
# |Green|   M|
# |Green|   L|
# | Blue|   S|
# | Blue|   M|
# | Blue|   L|
# +-----+----+

cross join in Spark SQL

colors.createOrReplaceTempView("colors")
sizes.createOrReplaceTempView("sizes")

spark.sql("""
    SELECT c.color, s.size
    FROM colors c
    CROSS JOIN sizes s
""").show()

# Equivalent without explicit CROSS JOIN keyword:
spark.sql("""
    SELECT c.color, s.size
    FROM colors c, sizes s
""").show()

Practical Use Cases

1. Generating All Combinations

# All possible experiment configurations
algorithms = spark.createDataFrame(
    [("gradient_boost",), ("random_forest",), ("xgboost",)], ["algorithm"]
)
learning_rates = spark.createDataFrame(
    [(0.01,), (0.1,), (1.0,)], ["learning_rate"]
)

grid = algorithms.crossJoin(learning_rates)
grid.show()
# 3 algorithms × 3 rates = 9 configurations to evaluate

2. Date × Entity Matrix for Completeness Checks

from pyspark.sql import functions as F

dates = spark.sql("SELECT sequence(DATE'2025-01-01', DATE'2025-01-07', INTERVAL 1 DAY) AS d") \
    .select(F.explode(F.col("d")).alias("date"))

stores = spark.createDataFrame(
    [("S001",), ("S002",), ("S003",)], ["store_id"]
)

# All expected date-store combinations
expected = dates.crossJoin(stores)

actual = spark.read.parquet("daily_sales.parquet")

# Find missing combinations
expected.join(actual, ["date", "store_id"], "left_anti").show()

Performance Warning

# ⚠️ Cross join on large tables is EXTREMELY EXPENSIVE
# 1M rows × 1M rows = 1 trillion rows in the result

# Spark will warn you:
# "Detected implicit cartesian product for INNER join between logical plans"
# To suppress this warning (only when intentional):
spark.conf.set("spark.sql.crossJoin.enabled", "true")

# Always check result size before materializing:
expected_rows = colors.count() * sizes.count()
print(f"Expected result size: {expected_rows:,} rows")

Filtered Cross Join (Often More Appropriate)

Instead of a pure cross join, most use cases want a filtered Cartesian product:

# Cross join then filter is equivalent to a join with complex condition
products.crossJoin(discounts) \
    .filter(F.col("category") == F.col("discount_category")) \
    .filter(F.col("price") >= F.col("min_price"))
# For complex non-equi join conditions, this is sometimes the clearest approach