Amazon EMR: Managed Hadoop and Spark Clusters for Big Data Processing at Scale

Running Spark at scale means dealing with cluster configuration, node failures, dependency management, and cost control — and none of that has anything to do with the actual data problem you are trying to solve. Amazon EMR takes the infrastructure management off the table so you can focus on the Spark jobs themselves.

EMR is not just a Spark runner. It is a managed platform for the entire Hadoop ecosystem — Spark, Hive, Presto, Flink, HBase, and Hadoop MapReduce — on clusters that can scale from 1 node to hundreds, with EC2 pricing and spot instance support.

What EMR Actually Is

EMR deploys a cluster of EC2 instances and installs your chosen big data frameworks on them. The cluster runs inside your VPC, uses IAM roles for access control, and reads and writes data from S3 as its primary data store.

Your VPC
  |
  +-- EMR Cluster
        |
        +-- Master Node(s)
        |     |-- YARN ResourceManager
        |     |-- HDFS NameNode
        |     |-- Spark Driver (in cluster mode)
        |
        +-- Core Nodes
        |     |-- YARN NodeManager
        |     |-- HDFS DataNode
        |     |-- Spark Executors
        |
        +-- Task Nodes (optional)
              |-- YARN NodeManager (no HDFS)
              |-- Spark Executors
              |-- Ideal for Spot Instances

        Data Storage: Amazon S3 (EMRFS)

EMR uses EMRFS (EMR File System), a connector that allows Spark and Hive jobs to read and write S3 as if it were HDFS. This is the standard pattern — store data persistently in S3, not on HDFS, so the cluster can be terminated between jobs without losing data.

Cluster Types and Deployment Options

Long-Running vs Transient Clusters

A long-running cluster stays up continuously. You submit multiple jobs to it over time. This makes sense when you have a near-constant stream of work and the startup overhead of a cluster (5-10 minutes) would add unacceptable latency.

A transient cluster starts for one job and terminates when the job completes. This is the more common and cost-efficient pattern. Combined with Spot instances for task nodes, transient clusters can reduce compute costs by 60-80% compared to always-on infrastructure.

Instance Groups vs Instance Fleets

Instance Groups: You define one instance type per group (master, core, task). Simple but inflexible. If your chosen Spot instance type is not available, the cluster cannot provision task nodes.

Instance Fleets: You specify up to five instance types per fleet, with a target capacity in vCPUs or instances. EMR picks the cheapest available combination across your listed types to meet the target. This significantly improves Spot availability and reduces interruptions.

Task Fleet (target: 20 vCPUs)
  |
  +-- m5.xlarge  (4 vCPUs) - Spot bid
  +-- m5a.xlarge (4 vCPUs) - Spot bid
  +-- m4.xlarge  (4 vCPUs) - Spot bid
  +-- r5.xlarge  (4 vCPUs) - Spot bid (higher memory if available)

EMR picks whichever instance types are cheapest and available
to meet the 20-vCPU target.

EMR Serverless

EMR Serverless (GA since 2022) removes cluster management entirely. You submit a Spark or Hive job to an EMR Serverless Application, and EMR allocates the workers, runs the job, and releases the workers when done. You pay only for vCPU-seconds and GB-seconds consumed during job execution.

You --> Submit Spark job to EMR Serverless Application
           |
           +-- EMR allocates workers automatically
           +-- Job runs
           +-- Workers are released
           +-- You pay per second of actual compute used

EMR Serverless is the right choice when:

Job frequency is low or unpredictable
You want zero cluster management overhead
You do not need specific instance types or cluster-level configuration

EMR Serverless is not ideal when:

You need very fast startup times (pre-initialised capacity helps but adds cost)
You need specific Spark configurations or custom JAR dependencies that require cluster-level setup
You are running very long jobs where per-second billing exceeds the cost of a dedicated cluster

EMR on EKS

EMR on EKS runs Spark jobs on an existing Amazon EKS (Kubernetes) cluster. This is useful when your organisation has already standardised on Kubernetes for workloads and wants to run Spark in the same environment.

EKS Cluster
  |
  +-- EMR Virtual Cluster (logical namespace)
        |
        +-- Spark Driver Pod
        +-- Spark Executor Pods (scaled dynamically)
        +-- Data: S3 via EMRFS

EMR on EKS gives you fine-grained Kubernetes resource quotas, namespace isolation between teams, and the ability to use Spot node groups for executor pods. The trade-off is that Kubernetes adds operational complexity — you need someone who knows both EMR and EKS.

Supported Frameworks

Framework	Use Case
Apache Spark	Batch ETL, ML feature engineering, streaming
Apache Hive	SQL-on-Hadoop, batch queries
Presto / Trino	Interactive SQL, low-latency queries
Apache Flink	Stateful stream processing
HBase	Wide-column NoSQL on HDFS
Hadoop MapReduce	Legacy batch processing
JupyterHub	Interactive notebooks on the cluster

Most new workloads use Spark. Presto/Trino is used when you need sub-second interactive query response. Flink is used for streaming pipelines that require exactly-once semantics.

Cost Optimisation

EMR costs come from EC2 instances plus a small EMR management fee per instance-hour.

Use Spot for task nodes: Task nodes hold no HDFS data, so Spot interruptions only slow the job, they do not corrupt it. Spot can be 60-80% cheaper than On-Demand. Always keep at least one On-Demand master and enough On-Demand core nodes for your HDFS replication factor.

Right-size the cluster: Use the Spark UI to check whether executors are memory-constrained (look for GC overhead) or CPU-constrained (look for CPU utilisation). Over-provisioning memory is a common and expensive mistake.

Use transient clusters: Spin up for the job, terminate when done. An idle cluster costs the same as a busy one.

Enable auto-scaling: For long-running clusters, EMR auto-scaling adds and removes task nodes based on YARN pending memory or a custom CloudWatch metric. This prevents paying for idle capacity during off-peak hours.

Use S3 instead of HDFS: Store all persistent data in S3. HDFS requires core nodes with storage, which are more expensive than task nodes. With S3 as the data store, you can run a minimal number of core nodes.

Real-World Scenario: Daily Log Processing Pipeline

A SaaS company processes 500 GB of application logs per day. The data arrives in S3 throughout the day in compressed JSON format. Each night at 02:00, a transient EMR cluster runs four Spark jobs:

Parse raw JSON and extract structured fields
Join events with a user dimension table
Aggregate session metrics by user and day
Write output to S3 in Parquet, partitioned by date

The cluster uses 1 On-Demand m5.xlarge master, 2 On-Demand m5.2xlarge core nodes, and 8 Spot m5.xlarge task nodes (instance fleet with three alternative types). The cluster runs for 45 minutes and terminates automatically. Total cost: approximately $2.80 per night. An equivalent always-on setup would cost$ 180+ per month.

Interview Notes

Q: What is the difference between core nodes and task nodes? Core nodes run YARN NodeManager and HDFS DataNode — they store data and process tasks. Task nodes run only YARN NodeManager — they process tasks but do not store HDFS data. Because task nodes hold no data, they can be Spot instances. Losing a task node slows the job; losing a core node can cause HDFS data loss if replication is insufficient.

Q: What is EMRFS, and why does it matter? EMRFS is a connector that allows EMR to use S3 as a distributed file system. It implements the Hadoop FileSystem API for S3, enabling Spark and Hive jobs to read and write S3 with the same API calls they would use for HDFS. EMRFS consistent view (now powered by S3 strong consistency) prevents issues where a file written by one node is not immediately visible to another.

Q: When would you choose EMR over Glue? Choose EMR when you need full control over Spark configuration, custom JARs or libraries not available in Glue, specific instance types, persistent clusters for interactive workloads (JupyterHub), or non-Spark frameworks like Flink or Presto. Choose Glue when you want zero cluster management and are building scheduled Spark batch ETL jobs.

Q: What is a bootstrap action in EMR? A shell script that runs on every node in the cluster before the framework software starts. Used to install additional libraries, set environment variables, configure system settings, or copy custom configuration files from S3.