Amazon Athena

Amazon Athena — serverless, interactive SQL query service that queries data directly in S3. Covers use cases, Athena vs Redshift Spectrum, federated queries, and cross-cloud equivalents.

Overview

Amazon Athena is AWS's serverless, interactive SQL query service — it reads files directly in Amazon S3 and runs standard SQL queries without provisioning any infrastructure or loading data.

The simplest mental model: "SQL directly on the data lake." There is no cluster to manage, no ETL (Extract, Transform, Load) pipeline to build first, and no running costs when idle — billing is per query (per TB of data scanned).

How It Works

Data files in S3 (CSV, Parquet, JSON, ORC, Avro...)
          │
          ▼
  AWS Glue Data Catalog  ←─ defines table schemas (column names, types, partitions)
          │
          ▼
    Amazon Athena  ←─ SQL queries run here
          │
          ▼
    Query results  →  S3 output bucket  (or displayed in console / BI tool)

Data files are stored in S3 (any format — CSV, Parquet, JSON, etc.)
A table schema is defined in the AWS Glue Data Catalog (or Athena's built-in catalog)
SQL is executed directly against the S3 files using distributed compute
Results land in an S3 output bucket; Athena also integrates with BI tools (QuickSight, Tableau) via JDBC/ODBC

No server or cluster is involved. Cold start to first query result: seconds.

Key Properties

Property	Detail
Engine	Based on Presto (distributed SQL engine) + Apache Spark (for notebooks)
Data formats	CSV, TSV, JSON, Parquet, ORC, Avro, Apache Iceberg tables
Schema on read	No ETL needed — schema is applied at query time, not when data is stored
Serverless	No clusters, no capacity planning; scales automatically
Pricing	$5 per TB of data scanned (compressed, columnar formats like Parquet significantly reduce cost)

SAA/SAP Tip: Store data in Parquet or ORC format and use partitioning (e.g. by date) to reduce data scanned — this is the primary cost and performance optimization for Athena. A well-partitioned Parquet table can reduce scan cost by 10–100×.

Athena vs. Redshift Spectrum — What's the Difference?

Both query data in S3 using SQL. The key distinction is whether an existing Redshift cluster is available:

	Amazon Athena	Redshift Spectrum
Requires Redshift cluster?	No — fully standalone	Yes — Spectrum is a Redshift feature
Setup	Point at S3, define schema, query	Requires existing Redshift cluster + external table definition
Cost model	Per TB scanned	Per TB scanned (Spectrum charge) + Redshift cluster hours
Best for	Ad-hoc S3 querying; no existing Redshift	Combining cold S3 data with warm Redshift warehouse data in one query
Performance	Good for ad-hoc; slower for complex joins across huge datasets	Better for JOINs between S3 data and Redshift tables
Typical use	Data exploration, log analysis, one-off analytics	Extending a data warehouse with archived/cold data in S3

SAA/SAP Tip: No existing warehouse and just need ad-hoc S3 queries → Athena. Already using Redshift and need to query cold S3 data alongside warehouse data in the same query → Redshift Spectrum.

Athena Federated Query

Athena can query data sources beyond S3 — RDS, Aurora, DynamoDB, DocumentDB, Redis, and others — using Lambda-based data source connectors.

This allows a single SQL query to JOIN data across multiple sources:

-- Join S3 data lake with live RDS transactional data
SELECT s3.product_id, s3.revenue, rds.product_name
FROM s3_sales_data.sales s3
JOIN rds_connector.products rds ON s3.product_id = rds.id
WHERE s3.year = 2024;

Common Use Cases

Use Case	Why Athena
Log analysis	Query CloudTrail, ALB access logs, VPC Flow Logs stored in S3 — no pipeline needed
Data lake exploration	Explore raw data before deciding whether to build a full warehouse
Ad-hoc business queries	One-off analytical questions over historical CSV/Parquet dumps
Security investigation	Query CloudTrail audit logs or GuardDuty findings directly
Cost optimization	Replace a Redshift cluster when queries are infrequent (pay per query vs. per hour)

Exam Trap: Athena is not suited for high-concurrency, sub-second dashboard queries over large datasets. For that, use Redshift (materialized views, concurrency scaling) or a purpose-built OLAP service. Athena's strength is flexibility and zero infrastructure — not raw query speed.

SAA/SAP Exam Tips

SAA/SAP Tip: The canonical Athena exam scenario: "analyze Application Load Balancer (ALB) / CloudTrail / VPC Flow Logs stored in S3 with minimum operational overhead" → Amazon Athena.

SAP Tip: When a scenario requires querying data across S3, RDS, and DynamoDB in a single SQL statement without moving data → Athena Federated Query.

Cross-Cloud Equivalents

Provider	Service / Solution	Notes
AWS	Amazon Athena	Baseline serverless S3 SQL
Azure	Azure Synapse Analytics (Serverless SQL Pool)	Synapse's serverless tier queries Azure Data Lake Storage similarly
GCP	BigQuery	Fully serverless; closer in model to Athena than Redshift; per-query pricing option
On-Premises	Apache Presto / Trino	Athena is built on Presto; Trino is the open-source successor

Pricing Model

$5.00 per TB of data scanned
Cancelled or failed queries are not charged
DDL statements (CREATE TABLE, DROP TABLE) are free
Cost reduction strategies: use Parquet/ORC (columnar, compressed), partition data by common filter columns (date, region), use partition projection

Amazon Redshift — full data warehouse; use Spectrum for combined S3 + warehouse queries
Amazon S3 — Athena's primary data source
AWS Glue — catalog and ETL; Glue Data Catalog is the metadata store Athena uses
Amazon QuickSight — AWS's BI visualization tool; connects to Athena for dashboards

Amazon Athena

On this page