Amazon EMR: Managed Hadoop and Spark Clusters for Big Data Processing at Scale
Running Spark at scale means dealing with cluster configuration, node failures, dependency management, and cost control — and none of that has anything to do with the actual data problem you are trying to solve. Amazon EMR takes the infrastructure management off the table so you can focus on the Spark jobs themselves.
EMR is not just a Spark runner. It is a managed platform for the entire Hadoop ecosystem — Spark, Hive, Presto, Flink, HBase, and Hadoop MapReduce — on clusters that can scale from 1 node to hundreds, with EC2 pricing and spot instance support.
What EMR Actually Is
EMR deploys a cluster of EC2 instances and installs your chosen big data frameworks on them. The cluster runs inside your VPC, uses IAM roles for access control, and reads and writes data from S3 as its primary data store.
Your VPC | +-- EMR Cluster | +-- Master Node(s) | |-- YARN ResourceManager | |-- HDFS NameNode | |-- Spark Driver (in cluster mode) | +-- Core Nodes | |-- YARN NodeManager | |-- HDFS DataNode | |-- Spark Executors | +-- Task Nodes (optional) |-- YARN NodeManager (no HDFS) |-- Spark Executors |-- Ideal for Spot Instances
Data Storage: Amazon S3 (EMRFS)EMR uses EMRFS (EMR File System), a connector that allows Spark and Hive jobs to read and write S3 as if it were HDFS. This is the standard pattern — store data persistently in S3, not on HDFS, so the cluster can be terminated between jobs without losing data.
Cluster Types and Deployment Options
Long-Running vs Transient Clusters
A long-running cluster stays up continuously. You submit multiple jobs to it over time. This makes sense when you have a near-constant stream of work and the startup overhead of a cluster (5-10 minutes) would add unacceptable latency.
A transient cluster starts for one job and terminates when the job completes. This is the more common and cost-efficient pattern. Combined with Spot instances for task nodes, transient clusters can reduce compute costs by 60-80% compared to always-on infrastructure.
Instance Groups vs Instance Fleets
Instance Groups: You define one instance type per group (master, core, task). Simple but inflexible. If your chosen Spot instance type is not available, the cluster cannot provision task nodes.
Instance Fleets: You specify up to five instance types per fleet, with a target capacity in vCPUs or instances. EMR picks the cheapest available combination across your listed types to meet the target. This significantly improves Spot availability and reduces interruptions.
Task Fleet (target: 20 vCPUs) | +-- m5.xlarge (4 vCPUs) - Spot bid +-- m5a.xlarge (4 vCPUs) - Spot bid +-- m4.xlarge (4 vCPUs) - Spot bid +-- r5.xlarge (4 vCPUs) - Spot bid (higher memory if available)
EMR picks whichever instance types are cheapest and availableto meet the 20-vCPU target.EMR Serverless
EMR Serverless (GA since 2022) removes cluster management entirely. You submit a Spark or Hive job to an EMR Serverless Application, and EMR allocates the workers, runs the job, and releases the workers when done. You pay only for vCPU-seconds and GB-seconds consumed during job execution.
You --> Submit Spark job to EMR Serverless Application | +-- EMR allocates workers automatically +-- Job runs +-- Workers are released +-- You pay per second of actual compute usedEMR Serverless is the right choice when:
- Job frequency is low or unpredictable
- You want zero cluster management overhead
- You do not need specific instance types or cluster-level configuration
EMR Serverless is not ideal when:
- You need very fast startup times (pre-initialised capacity helps but adds cost)
- You need specific Spark configurations or custom JAR dependencies that require cluster-level setup
- You are running very long jobs where per-second billing exceeds the cost of a dedicated cluster
EMR on EKS
EMR on EKS runs Spark jobs on an existing Amazon EKS (Kubernetes) cluster. This is useful when your organisation has already standardised on Kubernetes for workloads and wants to run Spark in the same environment.
EKS Cluster | +-- EMR Virtual Cluster (logical namespace) | +-- Spark Driver Pod +-- Spark Executor Pods (scaled dynamically) +-- Data: S3 via EMRFSEMR on EKS gives you fine-grained Kubernetes resource quotas, namespace isolation between teams, and the ability to use Spot node groups for executor pods. The trade-off is that Kubernetes adds operational complexity — you need someone who knows both EMR and EKS.
Supported Frameworks
| Framework | Use Case |
|---|---|
| Apache Spark | Batch ETL, ML feature engineering, streaming |
| Apache Hive | SQL-on-Hadoop, batch queries |
| Presto / Trino | Interactive SQL, low-latency queries |
| Apache Flink | Stateful stream processing |
| HBase | Wide-column NoSQL on HDFS |
| Hadoop MapReduce | Legacy batch processing |
| JupyterHub | Interactive notebooks on the cluster |
Most new workloads use Spark. Presto/Trino is used when you need sub-second interactive query response. Flink is used for streaming pipelines that require exactly-once semantics.
Cost Optimisation
EMR costs come from EC2 instances plus a small EMR management fee per instance-hour.
Use Spot for task nodes: Task nodes hold no HDFS data, so Spot interruptions only slow the job, they do not corrupt it. Spot can be 60-80% cheaper than On-Demand. Always keep at least one On-Demand master and enough On-Demand core nodes for your HDFS replication factor.
Right-size the cluster: Use the Spark UI to check whether executors are memory-constrained (look for GC overhead) or CPU-constrained (look for CPU utilisation). Over-provisioning memory is a common and expensive mistake.
Use transient clusters: Spin up for the job, terminate when done. An idle cluster costs the same as a busy one.
Enable auto-scaling: For long-running clusters, EMR auto-scaling adds and removes task nodes based on YARN pending memory or a custom CloudWatch metric. This prevents paying for idle capacity during off-peak hours.
Use S3 instead of HDFS: Store all persistent data in S3. HDFS requires core nodes with storage, which are more expensive than task nodes. With S3 as the data store, you can run a minimal number of core nodes.
Real-World Scenario: Daily Log Processing Pipeline
A SaaS company processes 500 GB of application logs per day. The data arrives in S3 throughout the day in compressed JSON format. Each night at 02:00, a transient EMR cluster runs four Spark jobs:
- Parse raw JSON and extract structured fields
- Join events with a user dimension table
- Aggregate session metrics by user and day
- Write output to S3 in Parquet, partitioned by date
The cluster uses 1 On-Demand m5.xlarge master, 2 On-Demand m5.2xlarge core nodes, and 8 Spot m5.xlarge task nodes (instance fleet with three alternative types). The cluster runs for 45 minutes and terminates automatically. Total cost: approximately 180+ per month.
Interview Notes
Q: What is the difference between core nodes and task nodes? Core nodes run YARN NodeManager and HDFS DataNode — they store data and process tasks. Task nodes run only YARN NodeManager — they process tasks but do not store HDFS data. Because task nodes hold no data, they can be Spot instances. Losing a task node slows the job; losing a core node can cause HDFS data loss if replication is insufficient.
Q: What is EMRFS, and why does it matter? EMRFS is a connector that allows EMR to use S3 as a distributed file system. It implements the Hadoop FileSystem API for S3, enabling Spark and Hive jobs to read and write S3 with the same API calls they would use for HDFS. EMRFS consistent view (now powered by S3 strong consistency) prevents issues where a file written by one node is not immediately visible to another.
Q: When would you choose EMR over Glue? Choose EMR when you need full control over Spark configuration, custom JARs or libraries not available in Glue, specific instance types, persistent clusters for interactive workloads (JupyterHub), or non-Spark frameworks like Flink or Presto. Choose Glue when you want zero cluster management and are building scheduled Spark batch ETL jobs.
Q: What is a bootstrap action in EMR? A shell script that runs on every node in the cluster before the framework software starts. Used to install additional libraries, set environment variables, configure system settings, or copy custom configuration files from S3.