Amazon Athena
Amazon Athena — serverless, interactive SQL query service that queries data directly in S3. Covers use cases, Athena vs Redshift Spectrum, federated queries, and cross-cloud equivalents.
Overview
Amazon Athena is AWS's serverless, interactive SQL query service — it reads files directly in Amazon S3 and runs standard SQL queries without provisioning any infrastructure or loading data.
The simplest mental model: "SQL directly on the data lake." There is no cluster to manage, no ETL (Extract, Transform, Load) pipeline to build first, and no running costs when idle — billing is per query (per TB of data scanned).
How It Works
Data files in S3 (CSV, Parquet, JSON, ORC, Avro...)
│
▼
AWS Glue Data Catalog ←─ defines table schemas (column names, types, partitions)
│
▼
Amazon Athena ←─ SQL queries run here
│
▼
Query results → S3 output bucket (or displayed in console / BI tool)- Data files are stored in S3 (any format — CSV, Parquet, JSON, etc.)
- A table schema is defined in the AWS Glue Data Catalog (or Athena's built-in catalog)
- SQL is executed directly against the S3 files using distributed compute
- Results land in an S3 output bucket; Athena also integrates with BI tools (QuickSight, Tableau) via JDBC/ODBC
No server or cluster is involved. Cold start to first query result: seconds.
Key Properties
| Property | Detail |
|---|---|
| Engine | Based on Presto (distributed SQL engine) + Apache Spark (for notebooks) |
| Data formats | CSV, TSV, JSON, Parquet, ORC, Avro, Apache Iceberg tables |
| Schema on read | No ETL needed — schema is applied at query time, not when data is stored |
| Serverless | No clusters, no capacity planning; scales automatically |
| Pricing | $5 per TB of data scanned (compressed, columnar formats like Parquet significantly reduce cost) |
SAA/SAP Tip: Store data in Parquet or ORC format and use partitioning (e.g. by date) to reduce data scanned — this is the primary cost and performance optimization for Athena. A well-partitioned Parquet table can reduce scan cost by 10–100×.
Athena vs. Redshift Spectrum — What's the Difference?
Both query data in S3 using SQL. The key distinction is whether an existing Redshift cluster is available:
| Amazon Athena | Redshift Spectrum | |
|---|---|---|
| Requires Redshift cluster? | No — fully standalone | Yes — Spectrum is a Redshift feature |
| Setup | Point at S3, define schema, query | Requires existing Redshift cluster + external table definition |
| Cost model | Per TB scanned | Per TB scanned (Spectrum charge) + Redshift cluster hours |
| Best for | Ad-hoc S3 querying; no existing Redshift | Combining cold S3 data with warm Redshift warehouse data in one query |
| Performance | Good for ad-hoc; slower for complex joins across huge datasets | Better for JOINs between S3 data and Redshift tables |
| Typical use | Data exploration, log analysis, one-off analytics | Extending a data warehouse with archived/cold data in S3 |
SAA/SAP Tip: No existing warehouse and just need ad-hoc S3 queries → Athena. Already using Redshift and need to query cold S3 data alongside warehouse data in the same query → Redshift Spectrum.
Athena Federated Query
Athena can query data sources beyond S3 — RDS, Aurora, DynamoDB, DocumentDB, Redis, and others — using Lambda-based data source connectors.
This allows a single SQL query to JOIN data across multiple sources:
-- Join S3 data lake with live RDS transactional data
SELECT s3.product_id, s3.revenue, rds.product_name
FROM s3_sales_data.sales s3
JOIN rds_connector.products rds ON s3.product_id = rds.id
WHERE s3.year = 2024;Common Use Cases
| Use Case | Why Athena |
|---|---|
| Log analysis | Query CloudTrail, ALB access logs, VPC Flow Logs stored in S3 — no pipeline needed |
| Data lake exploration | Explore raw data before deciding whether to build a full warehouse |
| Ad-hoc business queries | One-off analytical questions over historical CSV/Parquet dumps |
| Security investigation | Query CloudTrail audit logs or GuardDuty findings directly |
| Cost optimization | Replace a Redshift cluster when queries are infrequent (pay per query vs. per hour) |
Exam Trap: Athena is not suited for high-concurrency, sub-second dashboard queries over large datasets. For that, use Redshift (materialized views, concurrency scaling) or a purpose-built OLAP service. Athena's strength is flexibility and zero infrastructure — not raw query speed.
SAA/SAP Exam Tips
SAA/SAP Tip: The canonical Athena exam scenario: "analyze Application Load Balancer (ALB) / CloudTrail / VPC Flow Logs stored in S3 with minimum operational overhead" → Amazon Athena.
SAP Tip: When a scenario requires querying data across S3, RDS, and DynamoDB in a single SQL statement without moving data → Athena Federated Query.
Cross-Cloud Equivalents
| Provider | Service / Solution | Notes |
|---|---|---|
| AWS | Amazon Athena | Baseline serverless S3 SQL |
| Azure | Azure Synapse Analytics (Serverless SQL Pool) | Synapse's serverless tier queries Azure Data Lake Storage similarly |
| GCP | BigQuery | Fully serverless; closer in model to Athena than Redshift; per-query pricing option |
| On-Premises | Apache Presto / Trino | Athena is built on Presto; Trino is the open-source successor |
Pricing Model
- $5.00 per TB of data scanned
- Cancelled or failed queries are not charged
- DDL statements (CREATE TABLE, DROP TABLE) are free
- Cost reduction strategies: use Parquet/ORC (columnar, compressed), partition data by common filter columns (date, region), use partition projection
Related Services / See Also
- Amazon Redshift — full data warehouse; use Spectrum for combined S3 + warehouse queries
- Amazon S3 — Athena's primary data source
- AWS Glue — catalog and ETL; Glue Data Catalog is the metadata store Athena uses
- Amazon QuickSight — AWS's BI visualization tool; connects to Athena for dashboards