Spark APIs with Examples

Apache Spark is an open-source, distributed computing system that offers a powerful framework for big data processing. It comes equipped with various APIs, each catering to specific tasks and functionalities. These APIs are the building blocks that empower developers to interact with Spark and harness its capabilities. Let’s explore these APIs one by one.

Spark Core API

The Spark Core API serves as the foundation for all other Spark APIs. It provides the basic functionality for distributing data across a cluster and processing it in parallel. With the Spark Core API, you can perform transformations and actions on your data.

Example: Calculating Pi with Spark Core


from pyspark import SparkContext

sc = SparkContext("local", "Pi App")
n = 100000  # Number of data points

def inside(p):
    x, y = random.random(), random.random()
    return x*x + y*y < 1

count = sc.parallelize(range(1, n+1)).filter(inside).count()
print("Pi is roughly", 4.0 * count / n)
  

In this example, we use Spark Core to calculate an approximation of the mathematical constant π using a Monte Carlo method.

Spark SQL API

Spark SQL allows you to work with structured data using SQL queries. It seamlessly integrates with data sources like Parquet, Avro, ORC, and JSON. With Spark SQL, you can perform complex data manipulations and query large datasets effortlessly.

Example: Querying Data with Spark SQL



from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SparkSQLExample").getOrCreate()
df = spark.read.json("people.json")
df.createOrReplaceTempView("people")

result = spark.sql("SELECT * FROM people WHERE age >= 21")
result.show()

Here, we use Spark SQL to load a JSON file, create a temporary view, and execute SQL queries on it.

Spark Streaming API

The Spark Streaming API enables real-time processing of data streams. It ingests data from various sources, such as Kafka, Flume, and HDFS, and processes it in mini-batches. This makes it an ideal choice for applications that require real-time data analytics.

Example: Streaming Word Count


from pyspark.streaming import StreamingContext

sc = SparkContext("local[2]", "StreamingExample")
ssc = StreamingContext(sc, 1)

lines = ssc.socketTextStream("localhost", 9999)
words = lines.flatMap(lambda line: line.split(" "))
wordCounts = words.count()
wordCounts.pprint()

ssc.start()
ssc.awaitTermination()


In this example, we use Spark Streaming to count words in real-time data streams.

Spark MLlib API

The Spark MLlib API is a machine learning library that provides a wide range of machine learning algorithms and tools. It simplifies the process of building, training, and deploying machine learning models.

Example: Machine Learning with Spark MLlib



from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

# Load data
dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")

# Train a K-Means model
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(dataset)

# Make predictions
predictions = model.transform(dataset)

# Evaluate the clustering by computing Silhouette score
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))

  

Here, we use Spark MLlib to perform clustering using the K-Means algorithm.

Spark GraphX API

Spark GraphX is a powerful graph computation library that allows you to perform graph-based data processing and analysis. It's suitable for tasks like social network analysis and recommendation systems.

Example: Graph Analysis with Spark GraphX



from pyspark import SparkContext
from pyspark.graphx import Graph

sc = SparkContext(appName="GraphXExample")
vertices = sc.parallelize([(1, "Alice"), (2, "Bob"), (3, "Charlie")])
edges = sc.parallelize([(1, 2, "friend"), (2, 3, "follow")])
graph = Graph(vertices, edges)
print(graph.vertices.collect())

In this example, we use Spark GraphX to create and analyze a graph with vertices and edges.