Xoxoftware - XOXO Creative Studio | Web & Mobile App Development | Fred Cheung | Hong Kong
AWSAnalytics

Amazon Redshift

Amazon Redshift — AWS's managed cloud data warehouse for OLAP workloads. Covers architecture, cluster types, Redshift Serverless, Spectrum, and cross-cloud equivalents.

Overview

Amazon Redshift is AWS's fully managed, petabyte-scale cloud data warehouse optimized for Online Analytical Processing (OLAP) — complex queries over large, historical datasets. It is based on a columnar storage model and uses Massively Parallel Processing (MPP) to distribute and execute queries across multiple nodes simultaneously.

When to choose Redshift over other databases:

ScenarioService
Real-time OLTP (orders, transactions)Amazon RDS / Aurora
Ad-hoc, large analytical queries over structured dataAmazon Redshift
Querying data directly in S3 without loadingRedshift Spectrum (see below) or Amazon Athena
Streaming analyticsAmazon Kinesis + Redshift or Apache Flink

SAA/SAP Tip: Redshift is the default answer for any scenario involving a data warehouse, business intelligence (BI), or reporting over large volumes of historical data.


Architecture

Columnar Storage

Unlike row-based databases (OLTP), Redshift stores data by column, not by row:

  • Only columns needed for a query are read from disk → less I/O for analytical queries
  • High compression ratios — similar data types in a column compress well
  • Enables vectorized query execution

Massively Parallel Processing (MPP)

A Redshift cluster consists of:

Node TypeRole
Leader NodeParses queries, builds execution plan, coordinates worker nodes, returns results to client
Compute NodesExecute query plan slices in parallel; each node has dedicated CPU, memory, and storage

Data is distributed across compute node slices using a distribution key — the column that determines which node stores each row.

Distribution Styles

StyleHow Data is DistributedBest For
AUTORedshift decides (default)General use when unsure
EVENRound-robin across all nodesTables without a clear join key
KEYRows with the same key value → same nodeLarge fact tables frequently joined on a specific column
ALLFull copy of the table on every nodeSmall dimension tables frequently joined to large fact tables

Exam Trap: Choosing the wrong distribution style is a common cause of data skew — where one node is overloaded while others sit idle. This is a frequent exam scenario analysis question.


Cluster Types

  • Separates compute from storage using Redshift Managed Storage (RMS)
  • Storage scales automatically in S3 without resizing the cluster
  • Supports Concurrency Scaling — automatically adds transient clusters to handle query bursts

DC2 Nodes (Dense Compute — Legacy)

  • High-performance SSD-based nodes
  • Fixed compute + storage ratio — scaling storage requires resizing the cluster
  • Being superseded by RA3 for new workloads

Redshift Serverless

A serverless option where AWS automatically provisions and scales capacity based on workload demand.

  • No cluster management — no node type selection, no resizing
  • Billed per Redshift Processing Unit (RPU) per second
  • Ideal for: irregular workloads, development/test environments, or teams that want to avoid cluster management

SAA/SAP Tip: If a scenario mentions "no infrastructure management" or "variable analytical workloads", Redshift Serverless is a valid answer alongside Aurora Serverless. Distinguish by workload: Redshift = OLAP, Aurora = OLTP.


Key Features

Redshift Spectrum

Query data directly in Amazon S3 without loading it into Redshift. Spectrum uses external tables pointing at S3 object prefixes — useful for combining warm data (Redshift cluster) with cold/archived data (S3).

-- Example: query a Parquet file in S3 via Spectrum
SELECT * FROM spectrum_schema.my_external_table
WHERE year = 2024;

Materialized Views

Pre-computed query results stored and incrementally refreshed — dramatically speeds up frequently run, complex aggregations.

Concurrency Scaling

Automatically adds transient clusters to handle bursts of concurrent queries, keeping query latency consistent. Only available with RA3 nodes.

Data Sharing

Share live data across Redshift clusters (or accounts) without copying. Enables a hub-and-spoke analytics pattern where a central producer cluster shares data with multiple consumer clusters.


Loading Data into Redshift

MethodDescriptionBest For
COPY commandBulk load from S3, DynamoDB, EMR, or remote hosts via SSHBest performance for large loads
AWS GlueETL (Extract, Transform, Load) service with visual editorComplex transformations before loading
Amazon Kinesis Data FirehoseStream data directly to RedshiftNear-real-time streaming ingestion
Federated QueryQuery live data in RDS/Aurora without loadingAd-hoc cross-database joins

Exam Trap: Never use INSERT ... VALUES for bulk loading into Redshift — it is extremely slow. The COPY command is the correct, optimized bulk load mechanism.


SAA/SAP Exam Tips

SAA Tip: Redshift is single-AZ by default. For HA, enable the Multi-AZ deployment option (available for RA3) which maintains a standby in a second AZ with automatic failover.

SAP Tip: For a scenario where an application needs to query both live transactional data (RDS) and historical warehouse data (Redshift) in a single query, the answer is Redshift Federated Query.


Cross-Cloud Equivalents

ProviderService / SolutionNotes
AWSAmazon RedshiftBaseline
AzureAzure Synapse AnalyticsCombines data warehouse + Spark analytics; tighter integration with Power BI
GCPBigQueryFully serverless; separates storage and compute by default; strong SQL support
On-PremisesApache Hive / Teradata / GreenplumHive (OSS, Hadoop-based); Teradata/Greenplum for enterprise MPP warehouses

Pricing Model

  • Provisioned: per node-hour (RA3 or DC2)
  • Serverless: per RPU-second consumed
  • Redshift Spectrum: per TB of data scanned in S3
  • Backup storage (beyond free tier of cluster size) billed per GB-month

Built by Fred Cheung @CookedRicer · Powered by Fumadocs & Github Copilot

On this page