Amazon Redshift
Amazon Redshift — AWS's managed cloud data warehouse for OLAP workloads. Covers architecture, cluster types, Redshift Serverless, Spectrum, and cross-cloud equivalents.
Overview
Amazon Redshift is AWS's fully managed, petabyte-scale cloud data warehouse optimized for Online Analytical Processing (OLAP) — complex queries over large, historical datasets. It is based on a columnar storage model and uses Massively Parallel Processing (MPP) to distribute and execute queries across multiple nodes simultaneously.
When to choose Redshift over other databases:
| Scenario | Service |
|---|---|
| Real-time OLTP (orders, transactions) | Amazon RDS / Aurora |
| Ad-hoc, large analytical queries over structured data | Amazon Redshift |
| Querying data directly in S3 without loading | Redshift Spectrum (see below) or Amazon Athena |
| Streaming analytics | Amazon Kinesis + Redshift or Apache Flink |
SAA/SAP Tip: Redshift is the default answer for any scenario involving a data warehouse, business intelligence (BI), or reporting over large volumes of historical data.
Architecture
Columnar Storage
Unlike row-based databases (OLTP), Redshift stores data by column, not by row:
- Only columns needed for a query are read from disk → less I/O for analytical queries
- High compression ratios — similar data types in a column compress well
- Enables vectorized query execution
Massively Parallel Processing (MPP)
A Redshift cluster consists of:
| Node Type | Role |
|---|---|
| Leader Node | Parses queries, builds execution plan, coordinates worker nodes, returns results to client |
| Compute Nodes | Execute query plan slices in parallel; each node has dedicated CPU, memory, and storage |
Data is distributed across compute node slices using a distribution key — the column that determines which node stores each row.
Distribution Styles
| Style | How Data is Distributed | Best For |
|---|---|---|
AUTO | Redshift decides (default) | General use when unsure |
EVEN | Round-robin across all nodes | Tables without a clear join key |
KEY | Rows with the same key value → same node | Large fact tables frequently joined on a specific column |
ALL | Full copy of the table on every node | Small dimension tables frequently joined to large fact tables |
Exam Trap: Choosing the wrong distribution style is a common cause of data skew — where one node is overloaded while others sit idle. This is a frequent exam scenario analysis question.
Cluster Types
RA3 Nodes (Current Generation — Recommended)
- Separates compute from storage using Redshift Managed Storage (RMS)
- Storage scales automatically in S3 without resizing the cluster
- Supports Concurrency Scaling — automatically adds transient clusters to handle query bursts
DC2 Nodes (Dense Compute — Legacy)
- High-performance SSD-based nodes
- Fixed compute + storage ratio — scaling storage requires resizing the cluster
- Being superseded by RA3 for new workloads
Redshift Serverless
A serverless option where AWS automatically provisions and scales capacity based on workload demand.
- No cluster management — no node type selection, no resizing
- Billed per Redshift Processing Unit (RPU) per second
- Ideal for: irregular workloads, development/test environments, or teams that want to avoid cluster management
SAA/SAP Tip: If a scenario mentions "no infrastructure management" or "variable analytical workloads", Redshift Serverless is a valid answer alongside Aurora Serverless. Distinguish by workload: Redshift = OLAP, Aurora = OLTP.
Key Features
Redshift Spectrum
Query data directly in Amazon S3 without loading it into Redshift. Spectrum uses external tables pointing at S3 object prefixes — useful for combining warm data (Redshift cluster) with cold/archived data (S3).
-- Example: query a Parquet file in S3 via Spectrum
SELECT * FROM spectrum_schema.my_external_table
WHERE year = 2024;Materialized Views
Pre-computed query results stored and incrementally refreshed — dramatically speeds up frequently run, complex aggregations.
Concurrency Scaling
Automatically adds transient clusters to handle bursts of concurrent queries, keeping query latency consistent. Only available with RA3 nodes.
Data Sharing
Share live data across Redshift clusters (or accounts) without copying. Enables a hub-and-spoke analytics pattern where a central producer cluster shares data with multiple consumer clusters.
Loading Data into Redshift
| Method | Description | Best For |
|---|---|---|
COPY command | Bulk load from S3, DynamoDB, EMR, or remote hosts via SSH | Best performance for large loads |
| AWS Glue | ETL (Extract, Transform, Load) service with visual editor | Complex transformations before loading |
| Amazon Kinesis Data Firehose | Stream data directly to Redshift | Near-real-time streaming ingestion |
| Federated Query | Query live data in RDS/Aurora without loading | Ad-hoc cross-database joins |
Exam Trap: Never use INSERT ... VALUES for bulk loading into Redshift —
it is extremely slow. The COPY command is the correct, optimized bulk load
mechanism.
SAA/SAP Exam Tips
SAA Tip: Redshift is single-AZ by default. For HA, enable the Multi-AZ deployment option (available for RA3) which maintains a standby in a second AZ with automatic failover.
SAP Tip: For a scenario where an application needs to query both live transactional data (RDS) and historical warehouse data (Redshift) in a single query, the answer is Redshift Federated Query.
Cross-Cloud Equivalents
| Provider | Service / Solution | Notes |
|---|---|---|
| AWS | Amazon Redshift | Baseline |
| Azure | Azure Synapse Analytics | Combines data warehouse + Spark analytics; tighter integration with Power BI |
| GCP | BigQuery | Fully serverless; separates storage and compute by default; strong SQL support |
| On-Premises | Apache Hive / Teradata / Greenplum | Hive (OSS, Hadoop-based); Teradata/Greenplum for enterprise MPP warehouses |
Pricing Model
- Provisioned: per node-hour (RA3 or DC2)
- Serverless: per RPU-second consumed
- Redshift Spectrum: per TB of data scanned in S3
- Backup storage (beyond free tier of cluster size) billed per GB-month
Related Services / See Also
- Database Performance Fundamentals — OLAP vs OLTP explained
- Amazon S3 — Redshift Spectrum queries data directly from S3
- Amazon DynamoDB and DocumentDB — NoSQL alternatives
- AWS Glue — serverless ETL for data pipelines into Redshift
- Amazon Athena — serverless SQL queries on S3; simpler alternative to Spectrum for ad-hoc queries
- Amazon Kinesis and Managed Flink — streaming ingestion and real-time processing pipelines into Redshift
Amazon Kinesis and Managed Flink
AWS streaming data services — Amazon Kinesis (Data Streams, Firehose, Video Streams) and Amazon Managed Service for Apache Flink. Covers real-time ingestion, processing patterns, and cross-cloud equivalents.
AWS Auto Scaling
Automatic capacity management — target tracking, step, scheduled, and predictive scaling policies for EC2, ECS, DynamoDB, and more.