Study Notes
AWSAnalytics

Amazon Athena

Amazon Athena — serverless, interactive SQL query service that queries data directly in S3. Covers use cases, Athena vs Redshift Spectrum, federated queries, and cross-cloud equivalents.

Overview

Amazon Athena is AWS's serverless, interactive SQL query service — it reads files directly in Amazon S3 and runs standard SQL queries without provisioning any infrastructure or loading data.

The simplest mental model: "SQL directly on the data lake." There is no cluster to manage, no ETL (Extract, Transform, Load) pipeline to build first, and no running costs when idle — billing is per query (per TB of data scanned).


How It Works

Data files in S3 (CSV, Parquet, JSON, ORC, Avro...)


  AWS Glue Data Catalog  ←─ defines table schemas (column names, types, partitions)


    Amazon Athena  ←─ SQL queries run here


    Query results  →  S3 output bucket  (or displayed in console / BI tool)
  1. Data files are stored in S3 (any format — CSV, Parquet, JSON, etc.)
  2. A table schema is defined in the AWS Glue Data Catalog (or Athena's built-in catalog)
  3. SQL is executed directly against the S3 files using distributed compute
  4. Results land in an S3 output bucket; Athena also integrates with BI tools (QuickSight, Tableau) via JDBC/ODBC

No server or cluster is involved. Cold start to first query result: seconds.


Key Properties

PropertyDetail
EngineBased on Presto (distributed SQL engine) + Apache Spark (for notebooks)
Data formatsCSV, TSV, JSON, Parquet, ORC, Avro, Apache Iceberg tables
Schema on readNo ETL needed — schema is applied at query time, not when data is stored
ServerlessNo clusters, no capacity planning; scales automatically
Pricing$5 per TB of data scanned (compressed, columnar formats like Parquet significantly reduce cost)

SAA/SAP Tip: Store data in Parquet or ORC format and use partitioning (e.g. by date) to reduce data scanned — this is the primary cost and performance optimization for Athena. A well-partitioned Parquet table can reduce scan cost by 10–100×.


Athena vs. Redshift Spectrum — What's the Difference?

Both query data in S3 using SQL. The key distinction is whether an existing Redshift cluster is available:

Amazon AthenaRedshift Spectrum
Requires Redshift cluster?No — fully standaloneYes — Spectrum is a Redshift feature
SetupPoint at S3, define schema, queryRequires existing Redshift cluster + external table definition
Cost modelPer TB scannedPer TB scanned (Spectrum charge) + Redshift cluster hours
Best forAd-hoc S3 querying; no existing RedshiftCombining cold S3 data with warm Redshift warehouse data in one query
PerformanceGood for ad-hoc; slower for complex joins across huge datasetsBetter for JOINs between S3 data and Redshift tables
Typical useData exploration, log analysis, one-off analyticsExtending a data warehouse with archived/cold data in S3

SAA/SAP Tip: No existing warehouse and just need ad-hoc S3 queries → Athena. Already using Redshift and need to query cold S3 data alongside warehouse data in the same query → Redshift Spectrum.


Athena Federated Query

Athena can query data sources beyond S3 — RDS, Aurora, DynamoDB, DocumentDB, Redis, and others — using Lambda-based data source connectors.

This allows a single SQL query to JOIN data across multiple sources:

-- Join S3 data lake with live RDS transactional data
SELECT s3.product_id, s3.revenue, rds.product_name
FROM s3_sales_data.sales s3
JOIN rds_connector.products rds ON s3.product_id = rds.id
WHERE s3.year = 2024;

Common Use Cases

Use CaseWhy Athena
Log analysisQuery CloudTrail, ALB access logs, VPC Flow Logs stored in S3 — no pipeline needed
Data lake explorationExplore raw data before deciding whether to build a full warehouse
Ad-hoc business queriesOne-off analytical questions over historical CSV/Parquet dumps
Security investigationQuery CloudTrail audit logs or GuardDuty findings directly
Cost optimizationReplace a Redshift cluster when queries are infrequent (pay per query vs. per hour)

Exam Trap: Athena is not suited for high-concurrency, sub-second dashboard queries over large datasets. For that, use Redshift (materialized views, concurrency scaling) or a purpose-built OLAP service. Athena's strength is flexibility and zero infrastructure — not raw query speed.


SAA/SAP Exam Tips

SAA/SAP Tip: The canonical Athena exam scenario: "analyze Application Load Balancer (ALB) / CloudTrail / VPC Flow Logs stored in S3 with minimum operational overhead" → Amazon Athena.

SAP Tip: When a scenario requires querying data across S3, RDS, and DynamoDB in a single SQL statement without moving data → Athena Federated Query.


Cross-Cloud Equivalents

ProviderService / SolutionNotes
AWSAmazon AthenaBaseline serverless S3 SQL
AzureAzure Synapse Analytics (Serverless SQL Pool)Synapse's serverless tier queries Azure Data Lake Storage similarly
GCPBigQueryFully serverless; closer in model to Athena than Redshift; per-query pricing option
On-PremisesApache Presto / TrinoAthena is built on Presto; Trino is the open-source successor

Pricing Model

  • $5.00 per TB of data scanned
  • Cancelled or failed queries are not charged
  • DDL statements (CREATE TABLE, DROP TABLE) are free
  • Cost reduction strategies: use Parquet/ORC (columnar, compressed), partition data by common filter columns (date, region), use partition projection

  • Amazon Redshift — full data warehouse; use Spectrum for combined S3 + warehouse queries
  • Amazon S3 — Athena's primary data source
  • AWS Glue — catalog and ETL; Glue Data Catalog is the metadata store Athena uses
  • Amazon QuickSight — AWS's BI visualization tool; connects to Athena for dashboards

On this page