Technology  /  Apache Spark

Apache Spark 49 guides · updated 2026

Distributed data processing at scale — RDDs, DataFrames, Structured Streaming, and the tuning techniques that keep Spark jobs fast and cheap.

Apache Spark APIs

Spark provides a unified multi-language API stack covering batch processing, streaming, SQL, machine learning, and graph computation. All APIs share the same execution engine and scheduler.


API Overview

Apache Spark APIs
├── Spark Core (RDD API) — Low-level distributed collections
├── Spark SQL (DataFrame API) — Optimized structured data processing
├── Structured Streaming — Real-time streaming with DataFrame API
├── MLlib — Distributed machine learning
└── GraphX (Scala only) — Graph computation

1. RDD API (Spark Core)

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("APIs").getOrCreate()
sc = spark.sparkContext
# Word count — the "hello world" of Spark
counts = sc.textFile("s3://bucket/text/") \
.flatMap(lambda line: line.split()) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b) \
.sortBy(lambda x: x[1], ascending=False)
counts.take(10)

2. DataFrame API (Spark SQL)

from pyspark.sql import functions as F
df = spark.read.parquet("sales.parquet")
result = df \
.filter(F.col("year") == 2025) \
.groupBy("region", "product_category") \
.agg(
F.sum("revenue").alias("total"),
F.countDistinct("customer_id").alias("unique_customers")
) \
.orderBy(F.col("total").desc())
result.show(20)

3. SQL API

df.createOrReplaceTempView("sales")
spark.sql("""
WITH ranked AS (
SELECT
region,
product_category,
SUM(revenue) AS total,
RANK() OVER (PARTITION BY region ORDER BY SUM(revenue) DESC) AS rank
FROM sales
WHERE year = 2025
GROUP BY region, product_category
)
SELECT * FROM ranked WHERE rank <= 3
""").show()

4. Structured Streaming API

# Read from Kafka in real time
kafka_df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "broker1:9092,broker2:9092") \
.option("subscribe", "user-events") \
.load()
# Parse JSON payload
from pyspark.sql.types import StructType, StringType, LongType
schema = StructType() \
.add("user_id", StringType()) \
.add("event_type", StringType()) \
.add("timestamp_ms", LongType())
events = kafka_df.select(
F.from_json(F.col("value").cast("string"), schema).alias("data")
).select("data.*")
# Aggregate in a sliding window
windowed = events \
.withWatermark("event_time", "10 minutes") \
.groupBy(
F.window(F.col("event_time"), "5 minutes", "1 minute"),
F.col("event_type")
) \
.count()
query = windowed.writeStream \
.outputMode("update") \
.format("console") \
.trigger(processingTime="30 seconds") \
.start()
query.awaitTermination()

5. MLlib API

from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.classification import GBTClassifier
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator
df = spark.read.parquet("customer_features.parquet")
assembler = VectorAssembler(
inputCols=["age", "tenure_months", "monthly_spend", "support_tickets"],
outputCol="raw_features"
)
scaler = StandardScaler(inputCol="raw_features", outputCol="features")
gbt = GBTClassifier(featuresCol="features", labelCol="churned", maxIter=50)
pipeline = Pipeline(stages=[assembler, scaler, gbt])
train, test = df.randomSplit([0.8, 0.2], seed=42)
model = pipeline.fit(train)
predictions = model.transform(test)
evaluator = BinaryClassificationEvaluator(labelCol="churned")
print(f"AUC: {evaluator.evaluate(predictions):.4f}")

Choosing the Right API

APIBest ForPerformance
RDDUnstructured data, complex logicGood
DataFrameStructured data, SQL-like opsBest (Catalyst optimizer)
SQLAnalysts, reportingBest (same as DataFrame)
StreamingReal-time event processingExcellent
MLlibLarge-scale distributed MLGood