Spark for the word count program

Using Spark for the word count program offers several advantages. Firstly, Spark’s ability to distribute data across multiple nodes in a cluster allows for parallel processing, dramatically reducing the processing time for large datasets. Additionally, Spark’s resilient distributed datasets (RDDs) enable fault tolerance, ensuring that processing continues even if a node fails. This makes Spark an ideal choice for handling large-scale word count tasks in a robust manner.

Program :

Step 1:

from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
  .master("local[1]") \
  .appName("NpBlue.com") \
  .getOrCreate()

Step 2:

lines_rdd = spark.sparkContext.parallelize([
"Using Spark for the word count program offers several advantages. Firstly, Spark's ability to distribute data across multiple nodes in a cluster allows for parallel processing, dramatically reducing the processing time for large datasets. ",
"Additionally, Spark's resilient distributed datasets (RDDs) enable fault tolerance, ensuring that processing continues even if a node fails. This makes Spark an ideal choice for handling large-scale word count tasks in a robust manner."
])

Step 3:

# Using flatMap to split lines into individual words
words_rdd = lines_rdd.flatMap(lambda line: line.split(" "))
words_rdd.foreach(print)
Using
Spark
for
the
word
count
program
offers
several
advantages.
Firstly,
Spark's
ability
to
distribute
data
across
multiple
nodes
in
a
cluster
allows
.... so on 

Step 4:

# Using map to convert each word into a key-value pair, with the word as the key and count 1 as the value
word_count_pairs_rdd = words_rdd.map(lambda word: (word, 1))
word_count_pairs_rdd.foreach(print)


('Using', 1)
('Spark', 1)
('for', 1)
('the', 1)
('word', 1)
('count', 1)
('program', 1)
('offers', 1)
('several', 1)
('advantages.', 1)
('Firstly,', 1)
--- so on --

Step 5:

# Using reduceByKey to sum the counts for each word
word_counts_rdd = word_count_pairs_rdd.reduceByKey(lambda x, y: x + y)

Step 6:

# Collecting and displaying the word count result
result = word_counts_rdd.collect()
for word, count in result:
print(f"{word}: {count}")

final Output :



Using: 1
Spark: 2
for: 4
the: 2
word: 2
count: 2
program: 1
offers: 1
several: 1
advantages.: 1
Firstly,: 1
Spark's: 2
ability: 1
to: 1
distribute: 1
data: 1
across: 1
multiple: 1
nodes: 1
in: 2
a: 3
cluster: 1
allows: 1
parallel: 1
processing,: 1
dramatically: 1
reducing: 1
processing: 2
time: 1
large: 1
datasets.: 1
: 1
Additionally,: 1
resilient: 1
distributed: 1
datasets: 1
(RDDs): 1
enable: 1
fault: 1
tolerance,: 1
ensuring: 1
that: 1
continues: 1
even: 1
if: 1
node: 1
fails.: 1
This: 1
makes: 1
an: 1
ideal: 1
choice: 1
handling: 1
large-scale: 1
tasks: 1
robust: 1
manner.: 1