Spark DataFrame crossJoin
A cross join (also called a Cartesian product) combines every row from the left DataFrame with every row from the right DataFrame. If left has M rows and right has N rows, the result has M × N rows. Cross joins are rarely used intentionally for analysis but are a building block for certain algorithms.
Basic crossJoin()
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("CrossJoin").getOrCreate()
colors = spark.createDataFrame([("Red",), ("Green",), ("Blue",)], ["color"])sizes = spark.createDataFrame([("S",), ("M",), ("L",)], ["size"])
# Every color paired with every size: 3 × 3 = 9 rowsproduct = colors.crossJoin(sizes)product.show()# +-----+----+# |color|size|# +-----+----+# | Red| S|# | Red| M|# | Red| L|# |Green| S|# |Green| M|# |Green| L|# | Blue| S|# | Blue| M|# | Blue| L|# +-----+----+cross join in Spark SQL
colors.createOrReplaceTempView("colors")sizes.createOrReplaceTempView("sizes")
spark.sql(""" SELECT c.color, s.size FROM colors c CROSS JOIN sizes s""").show()
# Equivalent without explicit CROSS JOIN keyword:spark.sql(""" SELECT c.color, s.size FROM colors c, sizes s""").show()Practical Use Cases
1. Generating All Combinations
# All possible experiment configurationsalgorithms = spark.createDataFrame( [("gradient_boost",), ("random_forest",), ("xgboost",)], ["algorithm"])learning_rates = spark.createDataFrame( [(0.01,), (0.1,), (1.0,)], ["learning_rate"])
grid = algorithms.crossJoin(learning_rates)grid.show()# 3 algorithms × 3 rates = 9 configurations to evaluate2. Date × Entity Matrix for Completeness Checks
from pyspark.sql import functions as F
dates = spark.sql("SELECT sequence(DATE'2025-01-01', DATE'2025-01-07', INTERVAL 1 DAY) AS d") \ .select(F.explode(F.col("d")).alias("date"))
stores = spark.createDataFrame( [("S001",), ("S002",), ("S003",)], ["store_id"])
# All expected date-store combinationsexpected = dates.crossJoin(stores)
actual = spark.read.parquet("daily_sales.parquet")
# Find missing combinationsexpected.join(actual, ["date", "store_id"], "left_anti").show()Performance Warning
# ⚠️ Cross join on large tables is EXTREMELY EXPENSIVE# 1M rows × 1M rows = 1 trillion rows in the result
# Spark will warn you:# "Detected implicit cartesian product for INNER join between logical plans"# To suppress this warning (only when intentional):spark.conf.set("spark.sql.crossJoin.enabled", "true")
# Always check result size before materializing:expected_rows = colors.count() * sizes.count()print(f"Expected result size: {expected_rows:,} rows")Filtered Cross Join (Often More Appropriate)
Instead of a pure cross join, most use cases want a filtered Cartesian product:
# Cross join then filter is equivalent to a join with complex conditionproducts.crossJoin(discounts) \ .filter(F.col("category") == F.col("discount_category")) \ .filter(F.col("price") >= F.col("min_price"))# For complex non-equi join conditions, this is sometimes the clearest approach