Apache Spark
- Apache Spark: Big Data Processing & Analytics
- Spark DataFrames: Features, Use Cases & Optimization for Big Data
- Spark Architecture
- Dataframe create from file
- Dataframe Pyspark create from collections
- Spark Dataframe save as csv
- Dataframe save as parquet
- Dataframe show() between take() methods
- Apache SparkSession
- Understanding the RDD of Apache Spark
- Spark RDD creation from collection
- Different method to print data from rdd
- Practical use of unionByName method
- Creating Spark DataFrames: Methods & Examples
- Setup Spark in PyCharm
- Apache Spark all APIs
- Spark for the word count program
- Spark Accumulators
- aggregateByKey in Apache Spark
- Spark Broadcast with Examples
- Spark combineByKey
- Apache Spark Using countByKey
- Spark CrossJoin know all
- Optimizing Spark groupByKey: Usage, Best Practices, and Examples
- Mastering Spark Joins: Inner, Outer, Left, Right & Semi Joins Explained
- Apache Spark: Local Mode vs Cluster Mode - Key Differences & Examples
- Spark map vs flatMap: Key Differences with Examples
- Efficient Data Processing with Spark mapPartitionsWithIndex
- Spark reduceByKey with 5 Real-World Examples
- Spark Union vs UnionAll vs Union Available – Key Differences & Examples
Spark APIs with Examples
Apache Spark is an open-source, distributed computing system that offers a powerful framework for big data processing. It comes equipped with various APIs, each catering to specific tasks and functionalities. These APIs are the building blocks that empower developers to interact with Spark and harness its capabilities. Let’s explore these APIs one by one.
Spark Core API
The Spark Core API serves as the foundation for all other Spark APIs. It provides the basic functionality for distributing data across a cluster and processing it in parallel. With the Spark Core API, you can perform transformations and actions on your data.
Example: Calculating Pi with Spark Core
from pyspark import SparkContext
sc = SparkContext("local", "Pi App")
n = 100000 # Number of data points
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
count = sc.parallelize(range(1, n+1)).filter(inside).count()
print("Pi is roughly", 4.0 * count / n)
In this example, we use Spark Core to calculate an approximation of the mathematical constant π using a Monte Carlo method.
Spark SQL API
Spark SQL allows you to work with structured data using SQL queries. It seamlessly integrates with data sources like Parquet, Avro, ORC, and JSON. With Spark SQL, you can perform complex data manipulations and query large datasets effortlessly.
Example: Querying Data with Spark SQL
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SparkSQLExample").getOrCreate()
df = spark.read.json("people.json")
df.createOrReplaceTempView("people")
result = spark.sql("SELECT * FROM people WHERE age >= 21")
result.show()
Here, we use Spark SQL to load a JSON file, create a temporary view, and execute SQL queries on it.
Spark Streaming API
The Spark Streaming API enables real-time processing of data streams. It ingests data from various sources, such as Kafka, Flume, and HDFS, and processes it in mini-batches. This makes it an ideal choice for applications that require real-time data analytics.
Example: Streaming Word Count
from pyspark.streaming import StreamingContext
sc = SparkContext("local[2]", "StreamingExample")
ssc = StreamingContext(sc, 1)
lines = ssc.socketTextStream("localhost", 9999)
words = lines.flatMap(lambda line: line.split(" "))
wordCounts = words.count()
wordCounts.pprint()
ssc.start()
ssc.awaitTermination()
In this example, we use Spark Streaming to count words in real-time data streams.
Spark MLlib API
The Spark MLlib API is a machine learning library that provides a wide range of machine learning algorithms and tools. It simplifies the process of building, training, and deploying machine learning models.
Example: Machine Learning with Spark MLlib
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
# Load data
dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")
# Train a K-Means model
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(dataset)
# Make predictions
predictions = model.transform(dataset)
# Evaluate the clustering by computing Silhouette score
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))
Here, we use Spark MLlib to perform clustering using the K-Means algorithm.
Spark GraphX API
Spark GraphX is a powerful graph computation library that allows you to perform graph-based data processing and analysis. It's suitable for tasks like social network analysis and recommendation systems.
Example: Graph Analysis with Spark GraphX
from pyspark import SparkContext
from pyspark.graphx import Graph
sc = SparkContext(appName="GraphXExample")
vertices = sc.parallelize([(1, "Alice"), (2, "Bob"), (3, "Charlie")])
edges = sc.parallelize([(1, 2, "friend"), (2, 3, "follow")])
graph = Graph(vertices, edges)
print(graph.vertices.collect())
In this example, we use Spark GraphX to create and analyze a graph with vertices and edges.