Spark Local Mode vs Cluster Mode
Spark supports multiple deployment configurations. The most fundamental distinction is between local mode — where everything runs in a single JVM on your machine — and cluster mode — where driver and executors are distributed across multiple machines in a cluster.
Local Mode
In local mode, Spark runs the driver and all executors within a single JVM process. No cluster manager is involved.
from pyspark.sql import SparkSession
# local — single thread (good for debugging)spark = SparkSession.builder.master("local").appName("Test").getOrCreate()
# local[4] — 4 parallel threadsspark = SparkSession.builder.master("local[4]").appName("Test").getOrCreate()
# local[*] — use all available CPU coresspark = SparkSession.builder.master("local[*]").appName("Test").getOrCreate()Local mode is ideal for:
- Unit testing Spark code
- Developing and debugging transformations
- Datasets that fit in a single machine’s memory
- CI/CD pipeline tests
Cluster Mode
In cluster mode, Spark connects to a cluster manager (YARN, Kubernetes, Standalone) that allocates resources across multiple machines.
# Submit to YARNspark-submit \ --master yarn \ --deploy-mode cluster \ --num-executors 10 \ --executor-cores 4 \ --executor-memory 8g \ --driver-memory 4g \ my_pipeline.py
# Submit to Kubernetesspark-submit \ --master k8s://https://k8s-api:6443 \ --deploy-mode cluster \ --conf spark.kubernetes.container.image=my-spark:3.5 \ my_pipeline.py
# Submit to Standalone clusterspark-submit \ --master spark://master-host:7077 \ --deploy-mode cluster \ my_pipeline.pyClient vs Cluster Deploy Mode
Within cluster submissions, there’s an additional distinction:
--deploy-mode client | --deploy-mode cluster | |
|---|---|---|
| Driver location | Submitting machine | Random worker node |
| Best for | Interactive notebooks, debugging | Production batch jobs |
| stdout/stderr | In your terminal | In cluster logs |
| Requires connectivity | While running | Only at submission |
| Network | Driver on client machine — may be far from data | Driver co-located with executors |
Comparison Table
| Aspect | Local Mode | Cluster Mode |
|---|---|---|
| Hardware | Single machine | Multiple machines |
| Fault tolerance | Limited (no task retry across machines) | Full (tasks retry on other nodes) |
| Scalability | Single machine | Thousands of cores |
| Data size | GB scale | TB to PB scale |
| Cluster manager | None | YARN / Kubernetes / Standalone |
| Spark UI | http://localhost:4040 | Cluster-provided URL |
| Setup complexity | None | Cluster provisioning required |
Writing Portable Code
import osfrom pyspark.sql import SparkSession
# Read environment to choose masterMASTER = os.environ.get("SPARK_MASTER", "local[*]")
spark = SparkSession.builder \ .appName("PortableApp") \ .master(MASTER) \ .getOrCreate()
# Dev: SPARK_MASTER=local[4] python my_app.py# Prod: spark-submit --master yarn my_app.py