Amazon Web Services
Compute
- AWS EC2
- EC2 Instance Types
- EC2 Pricing Models
- EC2 Auto Scaling
- Elastic Load Balancing-ELB
- AWS Lambda – Serverless Computing
- Amazon Lightsail
- AWS Elastic Beanstalk
- AWS Fargate
- Amazon ECS (Elastic Container Service)
- Amazon EKS (Elastic Kubernetes Service)
Storage
- S3 vs. EBS vs. EFS
- Amazon S3 (Simple Storage Service)
- Amazon S3 Storage Classes
- Amazon EBS (Elastic Block Store)
- Amazon EFS (Elastic File System)
- AWS Storage Gateway
- AWS Snowball
- Amazon FSx
- AWS Backup
Database Services
- Amazon RDS
- Amazon Aurora
- Amazon DynamoDB
- Amazon ElastiCache
- Amazon Redshift
- AWS Database Migration Service (DMS)
- Amazon Neptune
- Amazon DocumentD
Networking and Content Delivery
- Amazon VPC
- Subnets
- Internet Gateway
- AWS Direct Connect
- AWS Route 53
- AWS CloudFront
- AWS Transit Gateway
- Elastic IP Addresses
DynamoDB
- DynamoDB Global Table vs Regular DynamoDB Table
- DynamoDB Streams
- Athena query data to DynamoDB
- Athena Query Results with DynamoDB
- PySpark DataFrame to DynamoDB
Redshift
Lambda
Glue
Lambda
Security
Amazon Redshift architecture
Imagine you have a library. Not a small, cozy one, but a colossal, city-sized library containing every book, report, and receipt your massive business has ever produced. Now, a colleague asks you a complex question: “What was the total revenue from our top-selling product in the Southwest region during the last holiday season, broken down by week?”
In a traditional library (a standard database), you’d be a lone librarian. You’d sprint through endless aisles, pulling every single book (row of data) off the shelf, opening it to find the specific revenue and product lines (columns), and then adding it all up on your calculator. This would take an impossibly long time.
Amazon Redshift is not a lone librarian. It’s an army of thousands of hyper-specialized librarians, each with a specific map and a specific job, working in perfect harmony to find your answer in seconds.
This article will be your guide to this army. We will deconstruct the Amazon Redshift architecture from the ground up, exploring the core principles that make it a powerhouse for analyzing petabytes of data. We’ll avoid overly technical jargon and focus on building a clear, intuitive understanding of how it all fits together.
Part 1: The Foundational Pillars - The “Why” Behind the Design
Before we look at the components, we must understand the three core ideas that form the bedrock of Redshift’s design. These principles answer the question: “Why is it built this way?“
1. Massively Parallel Processing (MPP): Teamwork Makes the Dream Work
The single most important concept in Redshift is Massively Parallel Processing (MPP). MPP is the “army of librarians” analogy.
The core idea is simple: break a big job into many small pieces, give each piece to a separate worker to handle simultaneously, and then combine the results.
- The Big Job: A complex SQL query asking a question of your entire dataset.
- The Workers: Individual computing units called nodes.
- The Result: The final answer, delivered in a fraction of the time it would take a single machine.
This architecture is the secret sauce. It allows Redshift to scale horizontally—meaning you add more machines to the cluster to handle more data and more complex queries—rather than vertically (upgrading a single machine to be bigger and more powerful, which has physical and financial limits).
2. Columnar Data Storage: The Organized Librarian
Traditional databases store data by row. Think of a spreadsheet where each row is a complete record. For example, a row for a single sale might look like: [Sale_ID, Date, Customer_ID, Product_ID, Quantity, Price, Store_Location].
To find the total revenue (sum of Price), the database must read every single row, which means loading all the data for Sale_ID, Date, Customer_ID, etc., into memory, only to then use the one Price column. This is incredibly inefficient for analytical queries that typically scan millions of rows but only care about a few columns.
Redshift stores data by column. It keeps all the Sale_ID values together, all the Date values together, all the Price values together, and so on.
Why is this a game-changer?
- Blazing-Fast Aggregations: To calculate total revenue, Redshift only needs to read the Pricecolumn. It can stream through billions of price values sequentially without being slowed down by all the other irrelevant data. This dramatically reduces the amount of data that must be read from disk, which is often the biggest bottleneck.
- Superior Compression: Data in a single column is all of the same type (e.g., all dates, all integers, all text). Similar data compresses much more efficiently than mixed data. Redshift employs sophisticated compression encodings (like delta, run-length, and dictionary encoding) on each column, often achieving compression ratios of 3x or more. This means:
- Less Storage: Lower cost.
- Less I/O: Less data to read from disk, meaning faster queries.
- More Data in Memory: More of your frequently accessed data can fit in the cache, further speeding things up.
 
3. Separation of Compute and Storage: The Modern Blueprint
Older data warehouses tightly coupled the compute (the brains that process queries) and storage (the disks that hold the data). If you needed more storage, you were forced to buy more expensive compute, and vice versa.
Redshift’s modern architecture, particularly with its RA3 node type, separates these two things.
- Compute: Handled by the Redshift Cluster (the leader and compute nodes we’ll discuss next).
- Storage: Handled by Amazon S3, AWS’s incredibly durable, scalable, and cheap object storage service.
The compute nodes have high-performance, solid-state drives (SSDs), but these are used as a massive local cache for the “hot” data you’re actively querying. All the persistent, long-term data lives on S3. This separation is revolutionary because it allows you to:
- Scale your compute nodes up and down independently based on your query needs.
- Pay for the compute you use and the storage you consume separately, optimizing costs.
- Have a virtually limitless data capacity, as S3 can scale to exabytes.
Part 2: The Components - The “Who’s Who” in the Redshift Army
Now that we understand the principles, let’s meet the key players inside a Redshift cluster.
1. The Leader Node: The Master Strategist
The Leader Node is the brain of the operation. It’s the general who doesn’t fight on the front lines but plans the entire battle. It has two critical jobs:
- Client Communication: It is the endpoint for your connection. When your business intelligence tool (like Tableau or Looker) sends a SQL query, it talks to the Leader Node.
- Query Planning and Coordination: This is its masterpiece. The Leader Node:
- Parses the SQL query to understand what is being asked.
- Develops the Execution Plan: It breaks the monolithic query into a series of discrete steps that can be distributed to the compute nodes. It decides what each node will do, how data will move between them, and how the results will be combined. This plan is the blueprint for the entire army.
- Compiles the code for each step.
- Distributes the compiled code to the Compute Nodes.
- Coordinates the execution, managing the flow of data between nodes.
- Aggregates the final results from the Compute Nodes and sends them back to the client.
 
Crucially, the Leader Node does not store any user data. Its job is pure strategy.
2. The Compute Nodes: The Frontline Soldiers
Compute Nodes are the muscle. They are the soldiers who execute the plan given to them by the Leader Node. A cluster always has at least two compute nodes, but can have many more (dozens or even hundreds).
Each Compute Node has its own:
- CPU and Memory (RAM): For processing data and performing calculations.
- Local SSD Storage: Used to hold its assigned “slice” of the overall database and to serve as a high-speed cache for data from S3.
Compute Nodes work in parallel. Each node operates on its own portion of the data simultaneously. If a query can be split into 100 parts across 100 nodes, it will theoretically run 100 times faster than on a single node.
3. Node Slices: The Specialists Within the Soldier
Each Compute Node is further divided into smaller logical units called slices. Think of a soldier with multiple arms, each capable of working on a separate task. A node with 4 vCPUs might have 2 slices, meaning each slice gets dedicated CPU and memory resources.
Each slice is responsible for processing a portion of the data on that node. Data distribution and parallel processing happen at the slice level, ensuring all resources on a node are fully utilized.
Part 3: Data Layout & Movement - The “How” of Speed
Having an army is one thing; organizing its supplies and movements is another. Redshift provides powerful tools to organize data for maximum efficiency.
1. Data Distribution Styles: Placing the Data Wisely
When you load data into Redshift, it doesn’t just dump it randomly. You decide how to distribute the rows of a table across all the compute slices. This is the single most important factor in tuning for performance. There are four styles:
- 
KEY Distribution: This is the most powerful but requires thought. You choose a single column as the DISTKEY. Redshift hashes the value of this column and uses the result to determine which slice should store that row. The golden rule: For your largest fact tables (e.g.,sales), set theDISTKEYto the column you most commonly join on (e.g.,user_idorproduct_id). This ensures that when two tables are joined on that key, the matching rows are already on the same slice, allowing the join to happen locally without moving data between nodes—a process called collocated join. This is a huge win for performance.
- 
EVEN Distribution: The data is distributed round-robin across all slices. It’s simple and ensures a balanced load, but it’s inefficient for joins because related data is not stored together, forcing data movement. Use this for staging tables or small tables that don’t participate in joins. 
- 
ALL Distribution: A full copy of the entire table is distributed to every single compute slice. This sounds wasteful, but it’s perfect for small, critical dimension tables (e.g., a productstable or astorestable). When joining a largeDISTKEYfact table to anALLdimension table, each slice has all the dimension data it needs locally, enabling a collocated join and eliminating data movement.
- 
AUTO Distribution: Redshift manages the distribution style for you. It might choose ALLfor very small tables andEVENfor larger ones. While convenient for beginners, for large-scale performance, manually specifyingKEYorALLis almost always better.
2. Data Sorting: The Zone Maps
Even after data is on the right slice, you can organize it further within the slice. This is done using Sort Keys.
A Sort Key is one or more columns you define to physically sort the data on disk. This allows Redshift to use a powerful feature called zone maps.
How it works: For each block of data on disk, Redshift keeps metadata (the zone map) that records the minimum and maximum values for the sort key in that block.
Let’s say you have a Date column as your sort key and you run a query with WHERE Date = '2023-10-26'. Instead of scanning every single block of data, Redshift can quickly check the zone maps. It can instantly skip entire blocks where the minimum and maximum dates don’t include 2023-10-26. This process of skipping irrelevant data is called block pruning and can reduce the amount of data scanned by 99% or more on well-designed tables.
There are two types:
- Compound Sort Key: Sorts data by multiple columns in a defined order (e.g., first by year, then bymonth, then byday). Excellent for range-restricted queries.
- Interleaved Sort Key: Gives equal weight to every column in the sort key. Ideal for tables where queries might filter on any of several different columns. It’s more flexible but has higher maintenance overhead.
Part 4: The Complete Journey of a Query
Let’s tie it all together by following a single query from start to finish.
- 
Submission: You run SELECT SUM(sales.amount) FROM sales JOIN users ON sales.user_id = users.id WHERE users.region = 'West';from Tableau.
- 
Strategy (Leader Node): The query arrives at the Leader Node. It parses the SQL and develops an execution plan. It sees that: - The salestable is distributed (DISTKEY) onuser_id.
- The userstable is distributed (DISTKEY) onid.
- The join is on user_id=id, meaning the data is collocated!
- There’s a filter on users.region.
 
- The 
- 
Deployment (Leader Node): The Leader Node compiles the plan and sends instructions to every compute slice: “Filter your local slice of the userstable forregion='West', then join it to your local slice of thesalestable onuser_id, and then sum theamountfor the results.”
- 
Execution (Compute Nodes): Each slice on each Compute Node gets to work on its own piece of the data—in parallel. Because of smart DISTKEYdesign, all the necessary data is already local. No data needs to be moved between nodes.
- 
Aggregation (Leader Node): Each compute node sends its local sum back to the Leader Node. 
- 
Result (Leader Node): The Leader Node adds up all the partial sums into one grand total. 
- 
Delivery: The Leader Node returns the final result ( $12,456,789) to Tableau, which displays it on your dashboard.
This entire process, involving terabytes of data, can complete in a matter of seconds because of the coordinated, parallel architecture.
Conclusion: An Engine Built for Scale
The architecture of Amazon Redshift is not a random collection of parts; it’s a meticulously engineered system where every component plays a specific role in service of one goal: analyzing massive datasets at incredible speed.
From the strategic Leader Node to the parallel Compute Nodes, from the efficiency of columnar storage to the brilliance of MPP, and from the smart placement of distribution keys to the data-skipping power of sort keys, every feature is designed to work in concert.
For a new learner, the key takeaway is this: Redshift’s power doesn’t come from magic; it comes from parallelism and intelligent data organization. By understanding these core principles, you can move from just using Redshift to truly harnessing its capabilities, designing your tables and writing your queries in a way that makes this powerful engine sing.