Amazon Redshift

Amazon Redshift — AWS's managed cloud data warehouse for OLAP workloads. Covers architecture, cluster types, Redshift Serverless, Spectrum, and cross-cloud equivalents.

Overview

Amazon Redshift is AWS's fully managed, petabyte-scale cloud data warehouse optimized for Online Analytical Processing (OLAP) — complex queries over large, historical datasets. It is based on a columnar storage model and uses Massively Parallel Processing (MPP) to distribute and execute queries across multiple nodes simultaneously.

When to choose Redshift over other databases:

Scenario	Service
Real-time OLTP (orders, transactions)	Amazon RDS / Aurora
Ad-hoc, large analytical queries over structured data	Amazon Redshift
Querying data directly in S3 without loading	Redshift Spectrum (see below) or Amazon Athena
Streaming analytics	Amazon Kinesis + Redshift or Apache Flink

SAA/SAP Tip: Redshift is the default answer for any scenario involving a data warehouse, business intelligence (BI), or reporting over large volumes of historical data.

Architecture

Columnar Storage

Unlike row-based databases (OLTP), Redshift stores data by column, not by row:

Only columns needed for a query are read from disk → less I/O for analytical queries
High compression ratios — similar data types in a column compress well
Enables vectorized query execution

Massively Parallel Processing (MPP)

A Redshift cluster consists of:

Node Type	Role
Leader Node	Parses queries, builds execution plan, coordinates worker nodes, returns results to client
Compute Nodes	Execute query plan slices in parallel; each node has dedicated CPU, memory, and storage

Data is distributed across compute node slices using a distribution key — the column that determines which node stores each row.

Distribution Styles

Style	How Data is Distributed	Best For
`AUTO`	Redshift decides (default)	General use when unsure
`EVEN`	Round-robin across all nodes	Tables without a clear join key
`KEY`	Rows with the same key value → same node	Large fact tables frequently joined on a specific column
`ALL`	Full copy of the table on every node	Small dimension tables frequently joined to large fact tables

Exam Trap: Choosing the wrong distribution style is a common cause of data skew — where one node is overloaded while others sit idle. This is a frequent exam scenario analysis question.

Cluster Types

RA3 Nodes (Current Generation — Recommended)

Separates compute from storage using Redshift Managed Storage (RMS)
Storage scales automatically in S3 without resizing the cluster
Supports Concurrency Scaling — automatically adds transient clusters to handle query bursts

DC2 Nodes (Dense Compute — Legacy)

High-performance SSD-based nodes
Fixed compute + storage ratio — scaling storage requires resizing the cluster
Being superseded by RA3 for new workloads

Redshift Serverless

A serverless option where AWS automatically provisions and scales capacity based on workload demand.

No cluster management — no node type selection, no resizing
Billed per Redshift Processing Unit (RPU) per second
Ideal for: irregular workloads, development/test environments, or teams that want to avoid cluster management

SAA/SAP Tip: If a scenario mentions "no infrastructure management" or "variable analytical workloads", Redshift Serverless is a valid answer alongside Aurora Serverless. Distinguish by workload: Redshift = OLAP, Aurora = OLTP.

Key Features

Redshift Spectrum

Query data directly in Amazon S3 without loading it into Redshift. Spectrum uses external tables pointing at S3 object prefixes — useful for combining warm data (Redshift cluster) with cold/archived data (S3).

-- Example: query a Parquet file in S3 via Spectrum
SELECT * FROM spectrum_schema.my_external_table
WHERE year = 2024;

Materialized Views

Pre-computed query results stored and incrementally refreshed — dramatically speeds up frequently run, complex aggregations.

Concurrency Scaling

Automatically adds transient clusters to handle bursts of concurrent queries, keeping query latency consistent. Only available with RA3 nodes.

Share live data across Redshift clusters (or accounts) without copying. Enables a hub-and-spoke analytics pattern where a central producer cluster shares data with multiple consumer clusters.

Loading Data into Redshift

Method	Description	Best For
`COPY` command	Bulk load from S3, DynamoDB, EMR, or remote hosts via SSH	Best performance for large loads
AWS Glue	ETL (Extract, Transform, Load) service with visual editor	Complex transformations before loading
Amazon Kinesis Data Firehose	Stream data directly to Redshift	Near-real-time streaming ingestion
Federated Query	Query live data in RDS/Aurora without loading	Ad-hoc cross-database joins

Exam Trap: Never use INSERT ... VALUES for bulk loading into Redshift — it is extremely slow. The COPY command is the correct, optimized bulk load mechanism.

SAA/SAP Exam Tips

SAA Tip: Redshift is single-AZ by default. For HA, enable the Multi-AZ deployment option (available for RA3) which maintains a standby in a second AZ with automatic failover.

SAP Tip: For a scenario where an application needs to query both live transactional data (RDS) and historical warehouse data (Redshift) in a single query, the answer is Redshift Federated Query.

Cross-Cloud Equivalents

Provider	Service / Solution	Notes
AWS	Amazon Redshift	Baseline
Azure	Azure Synapse Analytics	Combines data warehouse + Spark analytics; tighter integration with Power BI
GCP	BigQuery	Fully serverless; separates storage and compute by default; strong SQL support
On-Premises	Apache Hive / Teradata / Greenplum	Hive (OSS, Hadoop-based); Teradata/Greenplum for enterprise MPP warehouses

Pricing Model

Provisioned: per node-hour (RA3 or DC2)
Serverless: per RPU-second consumed
Redshift Spectrum: per TB of data scanned in S3
Backup storage (beyond free tier of cluster size) billed per GB-month

Database Performance Fundamentals — OLAP vs OLTP explained
Amazon S3 — Redshift Spectrum queries data directly from S3
Amazon DynamoDB and DocumentDB — NoSQL alternatives
AWS Glue — serverless ETL for data pipelines into Redshift
Amazon Athena — serverless SQL queries on S3; simpler alternative to Spectrum for ad-hoc queries
Amazon Kinesis and Managed Flink — streaming ingestion and real-time processing pipelines into Redshift

Amazon Redshift

On this page