SparkSession
SparkSession is the unified entry point for all Spark functionality introduced in Spark 2.0. It consolidates the older SparkContext, SQLContext, and HiveContext into a single object that manages your connection to the Spark cluster, data reading, SQL execution, and configuration.
Creating a SparkSession
from pyspark.sql import SparkSession
# Minimal — good for developmentspark = SparkSession.builder \ .appName("MyPipeline") \ .getOrCreate()
# Production configurationspark = SparkSession.builder \ .appName("DataPipeline") \ .master("yarn") \ .config("spark.executor.memory", "8g") \ .config("spark.executor.cores", "4") \ .config("spark.sql.shuffle.partitions", "200") \ .config("spark.sql.adaptive.enabled", "true") \ .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \ .enableHiveSupport() \ .getOrCreate()
print(spark.version) # "3.5.1"getOrCreate() returns an existing session if one already exists in the JVM — safe to call multiple times across modules and notebook cells.
Reading Data
# CSVdf = spark.read \ .option("header", "true") \ .option("inferSchema", "true") \ .csv("s3://bucket/sales.csv")
# Parquet — schema embedded in file, no options neededdf = spark.read.parquet("s3://bucket/transactions/")
# JSON (multi-line format)df = spark.read.option("multiLine", "true").json("events/*.json")
# Delta Lake (recommended for 2025 lakehouses)df = spark.read.format("delta").load("s3://bucket/delta/employees/")
# Database via JDBCdf = spark.read \ .format("jdbc") \ .option("url", "jdbc:postgresql://host:5432/db") \ .option("dbtable", "public.orders") \ .option("user", "spark") \ .option("password", "secret") \ .option("numPartitions", "16") \ .load()SQL Queries
# Register DataFrame as a SQL viewdf.createOrReplaceTempView("sales")
result = spark.sql(""" SELECT region, SUM(revenue) AS total_revenue, COUNT(DISTINCT customer_id) AS unique_customers FROM sales WHERE year = 2025 GROUP BY region ORDER BY total_revenue DESC""")result.show()
# Global view — accessible from other sessionsdf.createOrReplaceGlobalTempView("global_sales")spark.sql("SELECT COUNT(*) FROM global_temp.global_sales").show()Runtime Configuration
# Get and set runtime configspark.conf.get("spark.sql.shuffle.partitions") # "200"spark.conf.set("spark.sql.shuffle.partitions", "400")
# Enable Adaptive Query Execution (auto-tunes shuffle partitions)spark.conf.set("spark.sql.adaptive.enabled", "true")spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")Accessing SparkContext
sc = spark.sparkContextrdd = sc.parallelize([1, 2, 3, 4, 5])sc.getConf().getAll() # All Spark settingsStopping the Session
try: result = spark.read.parquet("data/").count() print(f"Rows: {result}")finally: spark.stop() # Always stop in scripts; notebooks usually leave it running