Spark Jobs

A job is the highest-level unit of work in Spark. Every time you call an action (.count(), .collect(), .save()) on a DataFrame or RDD, Spark creates a job to compute the result. A single Spark application can run many jobs sequentially or in parallel.

Job → Stage → Task Hierarchy

Application
└── Job (1 per action)
    └── Stage (1 per shuffle boundary)
        └── Task (1 per partition)

from pyspark.sql import SparkSession, functions as F

spark = SparkSession.builder.appName("Jobs Demo").getOrCreate()
df = spark.read.parquet("sales.parquet")

# Each action below submits one job to the cluster
df.count()                                   # Job 1
df.filter(F.col("amount") > 100).count()     # Job 2
df.groupBy("region").sum("amount").show()    # Job 3 (shuffle → 2 stages)
df.write.mode("overwrite").parquet("out/")   # Job 4

Job Lifecycle

1. Action called (e.g., df.count())
        ↓
2. DAGScheduler receives the job
   - Builds the logical plan
   - Identifies shuffle boundaries → creates stages
        ↓
3. DAGScheduler submits stages in dependency order
   - Stage N must complete before Stage N+1
        ↓
4. TaskScheduler takes each stage
   - Creates one Task per partition
   - Assigns tasks to available executor cores
        ↓
5. Executors run tasks in parallel
   - Each task reads its partition and applies transformations
        ↓
6. Driver collects results (for collect/show/count)
   - Or confirms files written (for save operations)
        ↓
7. Job completes

Inside a Job With Shuffles

df = spark.read.parquet("transactions.parquet")   # 50 partitions

# Stage 1: narrow transformations — 50 tasks run in parallel
filtered = df.filter(F.col("year") == 2025)

# → Shuffle boundary (groupBy hashes rows by customer_id)

# Stage 2: aggregate — 200 tasks (spark.sql.shuffle.partitions)
result = filtered.groupBy("customer_id").agg(F.sum("amount"))

result.show()   # 2 stages, 50 + 200 = 250 total tasks

Job Grouping for Tracking

# Tag jobs for easier identification in the Spark UI
with spark.sparkContext.setJobGroup("etl-pipeline", "Daily ETL - sales aggregation"):
    result = df.groupBy("category").sum("amount")
    result.count()

# All jobs inside this block appear under "etl-pipeline" group in the UI

Fair Scheduling Between Jobs

By default, Spark uses FIFO — jobs run in submission order. For interactive workloads with multiple concurrent users, enable Fair Scheduling:

spark = SparkSession.builder \
    .config("spark.scheduler.mode", "FAIR") \
    .getOrCreate()

spark.sparkContext.setLocalProperty("spark.scheduler.pool", "high-priority")
df.count()

spark.sparkContext.setLocalProperty("spark.scheduler.pool", None)

Monitoring Jobs in the Spark UI

Open http://localhost:4040 while your application runs:

Jobs tab: all completed and running jobs with duration, status, and stage counts
Click a job → Stage detail and DAG visualization
Click a stage → individual task durations, shuffle metrics, and GC time

Key metrics per job:

Input Size — bytes read from storage
Shuffle Read/Write — network data movement (high = bottleneck)
Stage skipped — green stage; result was cached and reused

Written by NPBlue Data Team — Data Platform Engineers who runs distributed Spark workloads in production and tunes them for scale.

Reviewed for technical accuracy. Spot an error? Let us know.