Spark Jobs
A job is the highest-level unit of work in Spark. Every time you call an action (.count(), .collect(), .save()) on a DataFrame or RDD, Spark creates a job to compute the result. A single Spark application can run many jobs sequentially or in parallel.
Job → Stage → Task Hierarchy
Application└── Job (1 per action) └── Stage (1 per shuffle boundary) └── Task (1 per partition)from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.appName("Jobs Demo").getOrCreate()df = spark.read.parquet("sales.parquet")
# Each action below submits one job to the clusterdf.count() # Job 1df.filter(F.col("amount") > 100).count() # Job 2df.groupBy("region").sum("amount").show() # Job 3 (shuffle → 2 stages)df.write.mode("overwrite").parquet("out/") # Job 4Job Lifecycle
1. Action called (e.g., df.count()) ↓2. DAGScheduler receives the job - Builds the logical plan - Identifies shuffle boundaries → creates stages ↓3. DAGScheduler submits stages in dependency order - Stage N must complete before Stage N+1 ↓4. TaskScheduler takes each stage - Creates one Task per partition - Assigns tasks to available executor cores ↓5. Executors run tasks in parallel - Each task reads its partition and applies transformations ↓6. Driver collects results (for collect/show/count) - Or confirms files written (for save operations) ↓7. Job completesInside a Job With Shuffles
df = spark.read.parquet("transactions.parquet") # 50 partitions
# Stage 1: narrow transformations — 50 tasks run in parallelfiltered = df.filter(F.col("year") == 2025)
# → Shuffle boundary (groupBy hashes rows by customer_id)
# Stage 2: aggregate — 200 tasks (spark.sql.shuffle.partitions)result = filtered.groupBy("customer_id").agg(F.sum("amount"))
result.show() # 2 stages, 50 + 200 = 250 total tasksJob Grouping for Tracking
# Tag jobs for easier identification in the Spark UIwith spark.sparkContext.setJobGroup("etl-pipeline", "Daily ETL - sales aggregation"): result = df.groupBy("category").sum("amount") result.count()
# All jobs inside this block appear under "etl-pipeline" group in the UIFair Scheduling Between Jobs
By default, Spark uses FIFO — jobs run in submission order. For interactive workloads with multiple concurrent users, enable Fair Scheduling:
spark = SparkSession.builder \ .config("spark.scheduler.mode", "FAIR") \ .getOrCreate()
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "high-priority")df.count()
spark.sparkContext.setLocalProperty("spark.scheduler.pool", None)Monitoring Jobs in the Spark UI
Open http://localhost:4040 while your application runs:
- Jobs tab: all completed and running jobs with duration, status, and stage counts
- Click a job → Stage detail and DAG visualization
- Click a stage → individual task durations, shuffle metrics, and GC time
Key metrics per job:
- Input Size — bytes read from storage
- Shuffle Read/Write — network data movement (high = bottleneck)
- Stage skipped — green stage; result was cached and reused