Amazon Web Services
Compute
- AWS EC2
- EC2 Instance Types
- EC2 Pricing Models
- EC2 Auto Scaling
- Elastic Load Balancing-ELB
- AWS Lambda – Serverless Computing
- Amazon Lightsail
- AWS Elastic Beanstalk
- AWS Fargate
- Amazon ECS (Elastic Container Service)
- Amazon EKS (Elastic Kubernetes Service)
Storage
- S3 vs. EBS vs. EFS
- Amazon S3 (Simple Storage Service)
- Amazon S3 Storage Classes
- Amazon EBS (Elastic Block Store)
- Amazon EFS (Elastic File System)
- AWS Storage Gateway
- AWS Snowball
- Amazon FSx
- AWS Backup
Database Services
- Amazon RDS
- Amazon Aurora
- Amazon DynamoDB
- Amazon ElastiCache
- Amazon Redshift
- AWS Database Migration Service (DMS)
- Amazon Neptune
- Amazon DocumentD
Networking and Content Delivery
- Amazon VPC
- Subnets
- Internet Gateway
- AWS Direct Connect
- AWS Route 53
- AWS CloudFront
- AWS Transit Gateway
- Elastic IP Addresses
DynamoDB
- DynamoDB Global Table vs Regular DynamoDB Table
- DynamoDB Streams
- Athena query data to DynamoDB
- Athena Query Results with DynamoDB
- PySpark DataFrame to DynamoDB
Redshift
Lambda
Glue
Lambda
Security
1. Core Architectural Features
These are the fundamental building blocks that define Redshift.
- Massively Parallel Processing (MPP): Redshift distributes and parallelizes data and query load across multiple nodes. This allows it to execute complex queries on large datasets incredibly fast by having many nodes work on the problem simultaneously.
- Columnar Data Storage: Unlike traditional row-based databases, Redshift stores data by column. This is ideal for data warehousing because:
- Faster Aggregations: Queries that sum, average, or count values in a column are much faster.
- Better Compression: Data in a single column is of the same type, leading to much higher compression rates, which reduces I/O and storage costs.
 
- Node Types & Clusters:
- Leader Node: Manages client connections, parses queries, and develops the parallel query execution plan.
- Compute Nodes: Execute the compiled query plans and perform the actual data processing. They store data locally in a columnar format.
- RA3 Instances (with Managed Storage): The modern standard. You pay for the compute capacity you use, and Redshift automatically manages data storage in Amazon S3. Storage is effectively decoupled from compute, and you only pay for the compute you provision and the storage you use.
- DC2 Instances (with Local Storage): Dense compute nodes where storage is attached directly to the node. Good for performance-intensive workloads that need the lowest latency.
 
- Data Distribution Styles: To leverage its MPP architecture, you can choose how data is distributed across compute nodes:
- EVEN: The default. Data is distributed round-robin, good for staging tables.
- KEY: Rows with the same distribution key are stored on the same node. Ideal for large fact tables joined with other tables.
- ALL: A full copy of the entire table is distributed to every node. Perfect for small, frequently joined dimension tables.
 
- Data Sorting (Sort Keys): You can define one or more columns as a sort key. Physically sorting data on disk allows Redshift to use zone maps to skip reading large chunks of data that don’t satisfy a query’s filters, dramatically improving performance.
2. Performance & Optimization Features
Features designed to make queries run as fast as possible.
- Result Caching: Redshift caches the results of every query. If an identical query is submitted and the underlying data hasn’t changed, it returns the result from cache instantly (sub-millisecond response).
- Concurrency Scaling: Automatically and elastically adds additional transient compute capacity to handle a sudden, unpredictable spike in concurrent queries. You only pay for the extra capacity used during these spikes. This ensures consistent performance for all users.
- Automatic Table Optimization (Adaptive Execution Plan): The Auto Vacuum and Auto Analyze features automatically reclaim space and update statistics after data manipulation operations (like DELETE, UPDATE). More recently, Redshift’s Adaptive Query Execution can even re-optimize certain query plans mid-execution based on runtime statistics.
- Materialized Views: Pre-compute and store the result of a complex query. Subsequent queries that can use the materialized view are orders of magnitude faster. They can be configured to refresh automatically.
- Short Query Acceleration (SQA): Prioritizes short-running queries in a dedicated queue, preventing them from getting stuck behind long-running, resource-intensive queries. This improves the user experience for dashboarding and interactive analytics.
3. Management, Security, & Operations
Features that make Redshift easier to manage and secure.
- Fully Managed: AWS handles provisioning, patching, backups, and failure recovery. You don’t need to manage the underlying infrastructure.
- Automated Backups & Snapshots: Redshift automatically takes incremental snapshots of your data every 8 hours or 5GB of data change (whichever comes first) and retains them for 1 day by default. You can also create manual snapshots and configure cross-region or cross-account copying for disaster recovery.
- Security:
- Encryption at Rest: Data is encrypted by default using AES-256. Keys can be managed by AWS (KMS) or by you (using AWS CloudHSM).
- Encryption in Transit: SSL encrypts data in transit.
- Network Isolation: Deploy within an Amazon VPC to control network access.
- Fine-Grained Access Control: Integrates with AWS IAM for cluster-level security. Use SQL GRANT/REVOKE commands or use Redshift Spectrum and Lake Formation for fine-grained, column-level, and row-level security on data in S3.
 
- Redshift Serverless (Preview): A fully auto-scaling option where you don’t have to manage nodes or workloads. You simply specify a base and maximum capacity, and Redshift automatically provisions and scales capacity to meet your performance needs. You pay only for the compute used during the time queries are running.
4. Data Lake & Machine Learning Integration
Modern features that extend Redshift beyond the traditional data warehouse.
- Redshift Spectrum: Arguably one of its most powerful features. Allows you to run SQL queries directly against exabytes of structured and semi-structured data (e.g., JSON, Parquet, ORC) in Amazon S3 without needing to load it into Redshift. You pay only for the data scanned by each query. This creates a “data lake house” architecture.
- Federated Query: Extends the concept of Spectrum. Allows you to query and join live data across your Redshift cluster, your S3 data lake, and operational databases like Amazon Aurora/PostgreSQL and Amazon RDS for PostgreSQL.
- Machine Learning: Enables you to train and deploy machine learning models using familiar SQL commands. You can create models for forecasting, anomaly detection, or customer segmentation directly on your data within Redshift.
- Data Sharing: Securely share live, transactional data in read-only fashion across different Redshift clusters (in the same account or different AWS accounts) without needing to copy or move it. This is ideal for sharing data with business units, partners, or customers.
Summary Table of Key Features
| Feature Category | Key Features | Benefit | 
|---|---|---|
| Core Architecture | MPP, Columnar Storage, RA3/DC2 Nodes | Foundational speed and scalability for large datasets. | 
| Performance | Sort/Distribution Keys, Result Caching, Concurrency Scaling, SQA | Optimized query execution and consistent performance for many users. | 
| Management | Automated Backups/Snapshots, Fully Managed | Reduces operational overhead and ensures data durability. | 
| Security | Encryption (At-Rest/In-Transit), VPC, IAM Integration | Enterprise-grade security and compliance. | 
| Ecosystem | Redshift Spectrum, Federated Query, Data Sharing | Breaks down data silos; queries data anywhere (S3, RDS). | 
| Advanced Analytics | Machine Learning (SQL), Materialized Views | Enables predictive analytics and pre-computes complex results. |