Technology  /  Apache Spark

Apache Spark 49 guides · updated 2026

Distributed data processing at scale — RDDs, DataFrames, Structured Streaming, and the tuning techniques that keep Spark jobs fast and cheap.

Spark Jobs

A job is the highest-level unit of work in Spark. Every time you call an action (.count(), .collect(), .save()) on a DataFrame or RDD, Spark creates a job to compute the result. A single Spark application can run many jobs sequentially or in parallel.


Job → Stage → Task Hierarchy

Application
└── Job (1 per action)
└── Stage (1 per shuffle boundary)
└── Task (1 per partition)
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.appName("Jobs Demo").getOrCreate()
df = spark.read.parquet("sales.parquet")
# Each action below submits one job to the cluster
df.count() # Job 1
df.filter(F.col("amount") > 100).count() # Job 2
df.groupBy("region").sum("amount").show() # Job 3 (shuffle → 2 stages)
df.write.mode("overwrite").parquet("out/") # Job 4

Job Lifecycle

1. Action called (e.g., df.count())
2. DAGScheduler receives the job
- Builds the logical plan
- Identifies shuffle boundaries → creates stages
3. DAGScheduler submits stages in dependency order
- Stage N must complete before Stage N+1
4. TaskScheduler takes each stage
- Creates one Task per partition
- Assigns tasks to available executor cores
5. Executors run tasks in parallel
- Each task reads its partition and applies transformations
6. Driver collects results (for collect/show/count)
- Or confirms files written (for save operations)
7. Job completes

Inside a Job With Shuffles

df = spark.read.parquet("transactions.parquet") # 50 partitions
# Stage 1: narrow transformations — 50 tasks run in parallel
filtered = df.filter(F.col("year") == 2025)
# → Shuffle boundary (groupBy hashes rows by customer_id)
# Stage 2: aggregate — 200 tasks (spark.sql.shuffle.partitions)
result = filtered.groupBy("customer_id").agg(F.sum("amount"))
result.show() # 2 stages, 50 + 200 = 250 total tasks

Job Grouping for Tracking

# Tag jobs for easier identification in the Spark UI
with spark.sparkContext.setJobGroup("etl-pipeline", "Daily ETL - sales aggregation"):
result = df.groupBy("category").sum("amount")
result.count()
# All jobs inside this block appear under "etl-pipeline" group in the UI

Fair Scheduling Between Jobs

By default, Spark uses FIFO — jobs run in submission order. For interactive workloads with multiple concurrent users, enable Fair Scheduling:

spark = SparkSession.builder \
.config("spark.scheduler.mode", "FAIR") \
.getOrCreate()
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "high-priority")
df.count()
spark.sparkContext.setLocalProperty("spark.scheduler.pool", None)

Monitoring Jobs in the Spark UI

Open http://localhost:4040 while your application runs:

Key metrics per job: