Core Apache Spark Concepts
- Resilient Distributed Dataset (RDD)
- DataFrames
- Datasets
- Transformations
- Actions
- Lazy Evaluation
- SparkSession
- SparkContext
- Partitions
- Shuffling
- Persistence & Caching
- Lineage Graphs
- Jobs
- Stages
- Tasks
Apache Spark
- Apache Spark: Big Data Processing & Analytics
- Spark DataFrames: Features, Use Cases & Optimization for Big Data
- Spark Architecture
- Dataframe create from file
- Dataframe Pyspark create from collections
- Spark Dataframe save as csv
- Dataframe save as parquet
- Dataframe show() between take() methods
- Apache SparkSession
- Understanding the RDD of Apache Spark
- Spark RDD creation from collection
- Different method to print data from rdd
- Practical use of unionByName method
- Creating Spark DataFrames: Methods & Examples
- Setup Spark in PyCharm
- Apache Spark all APIs
- Spark for the word count program
- Spark Accumulators
- aggregateByKey in Apache Spark
- Spark Broadcast with Examples
- Spark combineByKey
- Apache Spark Using countByKey
- Spark CrossJoin know all
- Optimizing Spark groupByKey: Usage, Best Practices, and Examples
- Mastering Spark Joins: Inner, Outer, Left, Right & Semi Joins Explained
- Apache Spark: Local Mode vs Cluster Mode - Key Differences & Examples
- Spark map vs flatMap: Key Differences with Examples
- Efficient Data Processing with Spark mapPartitionsWithIndex
- Spark reduceByKey with 5 Real-World Examples
- Spark Union vs UnionAll vs Union Available – Key Differences & Examples
What Are Transformations in Apache Spark?
In Apache Spark, transformations are operations that create a new Resilient Distributed Dataset (RDD) or DataFrame from an existing one. Unlike actions (which trigger computation), transformations are lazy—they don’t execute immediately but instead build a logical execution plan.
Key Characteristics of Transformations:
✔ Lazy Evaluation – Computations happen only when an action is called.
✔ Immutable – Original RDD/DataFrame remains unchanged.
✔ Optimized Execution – Spark optimizes transformations before running them.
Why Are Transformations Important?
- Efficiency – Lazy evaluation avoids unnecessary computations.
- Fault Tolerance – Lineage (dependency graph) helps recover lost data.
- Optimization – Spark’s Catalyst Optimizer improves query execution.
- Scalability – Works on distributed big data seamlessly.
Must-Know Spark Transformations
1. map() – Apply a Function to Each Element
- Processes each element of an RDD/DataFrame.
- Returns a new RDD with transformed values.
Example 1: Convert Strings to Uppercase (RDD)
from pyspark import SparkContext
sc = SparkContext("local", "MapExample")data = ["spark", "hadoop", "flink"]rdd = sc.parallelize(data)
# Using map to uppercase each elementupper_rdd = rdd.map(lambda x: x.upper())print(upper_rdd.collect())  # Output: ['SPARK', 'HADOOP', 'FLINK']Example 2: Square Numbers (RDD)
numbers = [1, 2, 3, 4, 5]num_rdd = sc.parallelize(numbers)
squared_rdd = num_rdd.map(lambda x: x ** 2)print(squared_rdd.collect())  # Output: [1, 4, 9, 16, 25]Example 3: Extract First Letter (DataFrame)
from pyspark.sql import SparkSessionspark = SparkSession.builder.appName("MapExampleDF").getOrCreate()
data = [("alice",), ("bob",), ("charlie",)]df = spark.createDataFrame(data, ["name"])
from pyspark.sql.functions import coldf_transformed = df.select(col("name").substr(1, 1).alias("first_letter"))df_transformed.show()# Output:# +------------+# |first_letter|# +------------+# |           a|# |           b|# |           c|# +------------+2. filter() – Select Elements Based on a Condition
- Returns a new RDD/DataFrame with elements that meet a condition.
Example 1: Filter Even Numbers (RDD)
numbers = [1, 2, 3, 4, 5, 6]num_rdd = sc.parallelize(numbers)
even_rdd = num_rdd.filter(lambda x: x % 2 == 0)print(even_rdd.collect())  # Output: [2, 4, 6]Example 2: Filter Names Starting with ‘A’ (RDD)
names = ["alice", "bob", "anna", "dave"]names_rdd = sc.parallelize(names)
filtered_names = names_rdd.filter(lambda x: x.startswith('a'))print(filtered_names.collect())  # Output: ['alice', 'anna']Example 3: Filter DataFrame Rows (Salary > 50000)
data = [("alice", 60000), ("bob", 45000), ("charlie", 70000)]df = spark.createDataFrame(data, ["name", "salary"])
filtered_df = df.filter(df.salary > 50000)filtered_df.show()# Output:# +-------+------+# |   name|salary|# +-------+------+# |  alice| 60000|# |charlie| 70000|# +-------+------+3. flatMap() – Transform and Flatten Results
- Applies a function to each element and flattens the results (unlike map, which keeps structure).
Example 1: Split Sentences into Words (RDD)
sentences = ["Hello world", "Apache Spark"]sent_rdd = sc.parallelize(sentences)
words_rdd = sent_rdd.flatMap(lambda x: x.split(" "))print(words_rdd.collect())  # Output: ['Hello', 'world', 'Apache', 'Spark']Example 2: Generate Pairs from Numbers (RDD)
numbers = [1, 2, 3]num_rdd = sc.parallelize(numbers)
pairs_rdd = num_rdd.flatMap(lambda x: [(x, x*1), (x, x*2)])print(pairs_rdd.collect())# Output: [(1, 1), (1, 2), (2, 2), (2, 4), (3, 3), (3, 6)]Example 3: Explode Array Column (DataFrame)
from pyspark.sql.functions import explode
data = [("alice", ["java", "python"]), ("bob", ["scala"])]df = spark.createDataFrame(data, ["name", "skills"])
exploded_df = df.select("name", explode("skills").alias("skill"))exploded_df.show()# Output:# +-----+------+# | name| skill|# +-----+------+# |alice|  java|# |alice|python|# |  bob| scala|# +-----+------+How to Remember Transformations for Interviews & Exams
- Lazy vs. Eager – Transformations are lazy, actions trigger execution.
- Common Transformations – map,filter,flatMap,groupBy,join.
- Think in Stages – Each transformation builds a step in the execution plan.
- Practice with Examples – Write small Spark jobs to reinforce concepts.
Conclusion
Spark transformations (map, filter, flatMap, etc.) are fundamental for efficient big data processing. They enable lazy evaluation, fault tolerance, and optimized execution.
Key Takeaways:
✅ Use map() for element-wise transformations.
✅ Apply filter() to select data conditionally.
✅ flatMap() helps flatten nested structures.
✅ Always remember: Transformations are lazy until an action is called!
By mastering these concepts, you’ll be well-prepared for Spark interviews, exams, and real-world big data projects!