Apache Spark APIs
Spark provides a unified multi-language API stack covering batch processing, streaming, SQL, machine learning, and graph computation. All APIs share the same execution engine and scheduler.
API Overview
Apache Spark APIs├── Spark Core (RDD API) — Low-level distributed collections├── Spark SQL (DataFrame API) — Optimized structured data processing├── Structured Streaming — Real-time streaming with DataFrame API├── MLlib — Distributed machine learning└── GraphX (Scala only) — Graph computation1. RDD API (Spark Core)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("APIs").getOrCreate()sc = spark.sparkContext
# Word count — the "hello world" of Sparkcounts = sc.textFile("s3://bucket/text/") \ .flatMap(lambda line: line.split()) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) \ .sortBy(lambda x: x[1], ascending=False)
counts.take(10)2. DataFrame API (Spark SQL)
from pyspark.sql import functions as F
df = spark.read.parquet("sales.parquet")
result = df \ .filter(F.col("year") == 2025) \ .groupBy("region", "product_category") \ .agg( F.sum("revenue").alias("total"), F.countDistinct("customer_id").alias("unique_customers") ) \ .orderBy(F.col("total").desc())
result.show(20)3. SQL API
df.createOrReplaceTempView("sales")
spark.sql(""" WITH ranked AS ( SELECT region, product_category, SUM(revenue) AS total, RANK() OVER (PARTITION BY region ORDER BY SUM(revenue) DESC) AS rank FROM sales WHERE year = 2025 GROUP BY region, product_category ) SELECT * FROM ranked WHERE rank <= 3""").show()4. Structured Streaming API
# Read from Kafka in real timekafka_df = spark.readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "broker1:9092,broker2:9092") \ .option("subscribe", "user-events") \ .load()
# Parse JSON payloadfrom pyspark.sql.types import StructType, StringType, LongTypeschema = StructType() \ .add("user_id", StringType()) \ .add("event_type", StringType()) \ .add("timestamp_ms", LongType())
events = kafka_df.select( F.from_json(F.col("value").cast("string"), schema).alias("data")).select("data.*")
# Aggregate in a sliding windowwindowed = events \ .withWatermark("event_time", "10 minutes") \ .groupBy( F.window(F.col("event_time"), "5 minutes", "1 minute"), F.col("event_type") ) \ .count()
query = windowed.writeStream \ .outputMode("update") \ .format("console") \ .trigger(processingTime="30 seconds") \ .start()
query.awaitTermination()5. MLlib API
from pyspark.ml.feature import VectorAssembler, StandardScalerfrom pyspark.ml.classification import GBTClassifierfrom pyspark.ml import Pipelinefrom pyspark.ml.evaluation import BinaryClassificationEvaluator
df = spark.read.parquet("customer_features.parquet")
assembler = VectorAssembler( inputCols=["age", "tenure_months", "monthly_spend", "support_tickets"], outputCol="raw_features")scaler = StandardScaler(inputCol="raw_features", outputCol="features")gbt = GBTClassifier(featuresCol="features", labelCol="churned", maxIter=50)
pipeline = Pipeline(stages=[assembler, scaler, gbt])train, test = df.randomSplit([0.8, 0.2], seed=42)model = pipeline.fit(train)
predictions = model.transform(test)evaluator = BinaryClassificationEvaluator(labelCol="churned")print(f"AUC: {evaluator.evaluate(predictions):.4f}")Choosing the Right API
| API | Best For | Performance |
|---|---|---|
| RDD | Unstructured data, complex logic | Good |
| DataFrame | Structured data, SQL-like ops | Best (Catalyst optimizer) |
| SQL | Analysts, reporting | Best (same as DataFrame) |
| Streaming | Real-time event processing | Excellent |
| MLlib | Large-scale distributed ML | Good |