AWS Glue
AWS Glue — serverless ETL and data cataloging service. Covers Glue Data Catalog, Crawlers, ETL jobs, DataBrew, and how Glue fits into the AWS data lake and analytics ecosystem.
Overview
AWS Glue is a serverless data integration service with two distinct and independent functions — a Data Catalog (metadata repository) and an ETL engine (Apache Spark-based data transformation). Both functions are independent and can be used separately or together.
The simplest mental model:
"Glue Catalog" = the map → describes WHERE data is and WHAT shape it's in
"Glue ETL" = the truck → moves and transforms the data itselfHow Glue Works Under the Hood
Glue never copies or stores actual data. It only reads and stores metadata — descriptions of the data.
The Glue Data Catalog is a library index, not a library
Glue Data Catalog (metadata only — tiny, cheap)
├── Database: "analytics"
│ ├── Table: "orders"
│ │ ├── column: id → INT
│ │ ├── column: amount → DECIMAL
│ │ ├── column: date → STRING
│ │ ├── location: s3://my-bucket/orders/ ← a pointer, not a copy
│ │ └── format: parquet
│ └── Table: "customers"
│ └── location: s3://my-bucket/customers/The Catalog holds the schema and the S3 location. The actual data stays exactly where it was.
When Athena runs SELECT * FROM analytics.orders:
- Athena looks up
ordersin the Glue Catalog → gets the S3 path and column types - Athena goes directly to S3 to read the actual Parquet files
- Results returned — Glue Catalog was only consulted for the schema, never touched the data
What the Crawler actually does
A Crawler does not import or replicate data. It:
- Connects to the data source (S3, RDS via JDBC, DynamoDB, etc.)
- Samples a small portion of the files — just enough rows to infer column names and types
- Writes the discovered schema as a table definition into the Glue Catalog
- Disconnects — the source data is untouched
S3: s3://bucket/orders/year=2024/month=01/data.parquet
│
│ Crawler reads ~100 rows to infer schema
│ (does NOT copy the file)
▼
Glue Catalog: table "orders" { id INT, amount DECIMAL, ... }
location: s3://bucket/orders/
partitions: year=2024/month=01What Glue ETL Jobs actually do
ETL Jobs are where data does move — but only because the script explicitly reads from a source and writes to a destination. Glue:
- Spins up a managed Apache Spark cluster
- Runs the script (reads data, transforms it, writes output)
- Tears the Spark cluster down when done
The output goes wherever the script specifies — S3, Redshift, RDS. There is no "Glue storage." Glue is the execution environment, not the storage.
Glue ETL Job execution:
[Spark cluster spins up]
│
├── reads from: S3 CSV files / RDS via JDBC
├── transforms: rename columns, convert to Parquet, deduplicate
└── writes to: S3 Parquet / Redshift table
[Spark cluster tears down]
(billed only for cluster runtime)Key mental model: The Glue Data Catalog = a library index card (tells where the book is and what it contains). Glue ETL = a courier service (actually moves things). The data = the books, always staying on the shelves (S3/database) unless a courier job explicitly moves them.
The Two Jobs of AWS Glue
| Function | What It Does | Used By |
|---|---|---|
| Glue Data Catalog | Central metadata store: database/table names, schemas, column types, S3 locations, partition info | Athena, Redshift Spectrum, EMR, Lake Formation |
| Glue ETL Jobs | Serverless Spark scripts that extract, transform, and load data between sources | Redshift (loading), S3 (format conversion), any data pipeline |
This split is critical — many services use the Catalog without ever using Glue ETL, and Glue ETL jobs can run without relying on the Catalog.
Part 1: Glue Data Catalog
The AWS Glue Data Catalog is a managed metadata repository — essentially a database of databases. It stores:
- Databases — logical groupings of tables
- Tables — schema definitions (column names, data types, partition keys, S3 location of the actual data)
- Partitions — sub-paths within a table (e.g.
s3://bucket/year=2024/month=01/) - Connections — credentials and network config for JDBC sources (RDS, Redshift, etc.)
The Catalog itself stores only metadata — it never holds actual data. Data stays in S3 (or wherever it lives).
Who Uses the Glue Data Catalog
| Service | How It Uses the Catalog |
|---|---|
| Amazon Athena | Reads table schemas from the Catalog to understand what SQL columns/types map to which S3 files |
| Redshift Spectrum | Creates external schemas pointing at Catalog databases; queries S3 through those schemas |
| Amazon EMR | Uses Catalog as a Hive Metastore replacement |
| AWS Lake Formation | Builds on the Catalog; adds column/row-level access control |
SAA/SAP Tip: The Glue Data Catalog is the central schema registry for the AWS data lake. When a scenario involves multiple services needing consistent table definitions (Athena + Redshift Spectrum + EMR all pointing at the same S3 data), the answer is the Glue Data Catalog.
Glue Crawlers
Glue Crawlers automate schema discovery, eliminating the need to write CREATE TABLE DDL manually:
- Configure a Crawler with a target S3 path (or JDBC database, DynamoDB table, etc.)
- The Crawler samples the data and infers the schema (column names, types, partition structure)
- The discovered table definition is written into the Glue Data Catalog automatically
- Crawlers can be scheduled to run periodically to detect new partitions or schema changes
S3 bucket: s3://my-data-lake/orders/year=2024/month=01/orders.parquet
─────────────────────────────────────
Crawler discovers this partition structure
and writes it to the Catalog automaticallyExam Trap: Crawlers detect new partitions added to S3 but do not auto-detect schema changes within existing files by default. For partitions added by Firehose or other automated pipelines, schedule a Crawler to run daily (or use partition projection in Athena to skip Crawlers entirely for predictable partition patterns).
Part 2: Glue ETL Jobs
A Glue ETL Job is a managed Apache Spark (or Python shell) script that extracts data from a source, transforms it, and loads it to a destination — without provisioning a Spark cluster.
How a Glue Job Works
Source Glue ETL Job Destination
S3 (CSV/JSON) ──▶ [ Extract → Transform → Load ] ──▶ S3 (Parquet)
RDS / Aurora ──▶ [ Spark transforms (Python/Scala)]──▶ Redshift
DynamoDB ──▶ [ Generated or custom code ] ──▶ S3 data lakeCommon Transformations
| Transformation | Example |
|---|---|
| Format conversion | CSV → Parquet or ORC (reduces Athena scan cost by ~10×) |
| Schema mapping | Rename columns, cast types, drop nulls |
| Deduplication | Remove duplicate records before loading to warehouse |
| Enrichment / Join | Join two datasets (e.g. orders + products) before loading |
| Partitioning | Partition output by date column for efficient querying |
| Filtering | Drop rows that don't meet data quality rules |
Job Types
| Type | Runtime | Best For |
|---|---|---|
| Spark | Distributed Apache Spark | Large-scale data transformations (GBs–TBs) |
| Spark Streaming | Spark Structured Streaming | Near-real-time ETL from Kinesis or Kafka |
| Python Shell | Single Python process | Small datasets, API calls, lightweight scripts |
| Ray | Distributed Python (Ray) | ML preprocessing, Python-native distributed workloads |
Glue Studio
A visual drag-and-drop interface for building Glue ETL jobs without writing Spark code. Source, transform, and destination nodes connect visually; Studio generates the underlying PySpark script.
DynamicFrame vs. DataFrame
Glue introduces its own data abstraction called a DynamicFrame (on top of Spark's DataFrame):
- Handles schema inconsistencies gracefully — different records can have different columns without crashing
- Useful for messy real-world data where CSV/JSON schemas vary between files
- Can be converted to a standard Spark DataFrame when needed for Spark SQL operations
Part 3: AWS Glue DataBrew
Glue DataBrew is a visual, no-code data preparation tool aimed at data analysts. Key capabilities:
- Preview a dataset sample in a spreadsheet-like UI
- Apply 250+ pre-built transformations (trim whitespace, normalize dates, mask Personally Identifiable Information (PII) data, etc.)
- Profile data quality (detect nulls, outliers, distributions)
- Execute transformations as a job on the full S3 dataset
DataBrew vs. Glue Studio:
| Glue DataBrew | Glue Studio | |
|---|---|---|
| User | Data analysts, business users | Data engineers |
| Coding | No code — visual point-and-click | Visual + generated PySpark code |
| Flexibility | Limited to built-in transforms | Full Spark — any transformation |
| Data profiling | Yes — built-in | No |
| Best for | Quick data cleaning, PII masking | Custom complex ETL pipelines |
How Glue Fits Into the Analytics Ecosystem
┌─────────────────────────────────────┐
│ Data Sources │
│ S3 │ RDS │ DynamoDB │ JDBC │
└──────────────┬──────────────────────┘
│
┌─────────▼──────────┐
│ Glue Crawler │ ← auto-discovers schema
│ (metadata only) │
└─────────┬──────────┘
│ writes to
┌─────────▼──────────┐
│ Glue Data Catalog │ ← central schema registry
└──┬──────────────┬───┘
│ │
┌─────────────▼──┐ ┌─────▼──────────────┐
│ Amazon Athena │ │ Redshift Spectrum │
│ (SQL on S3) │ │ (SQL via Redshift) │
└────────────────┘ └────────────────────┘
┌─────────────────────┐
│ Glue ETL Jobs │ ← transforms & moves data
│ (Spark / PySpark) │
└──────────┬──────────┘
│
┌────────────────▼────────────────┐
│ Destinations │
│ S3 (Parquet) │ Redshift │
│ RDS │ OpenSearch │
└─────────────────────────────────┘SAA/SAP Exam Tips
SAA/SAP Tip: The most common Glue exam scenario: "automatically discover and catalog schemas from S3 data" → Glue Crawler + Data Catalog. The Catalog then enables Athena or Redshift Spectrum to query without manual DDL.
SAA/SAP Tip: "Convert CSV files in S3 to Parquet to reduce Athena query costs" → Glue ETL Job. This is a canonical pattern: raw data lands in S3 as CSV, Glue converts it to partitioned Parquet, Athena queries efficiently.
Exam Trap: Glue ETL is built on Spark — it has a 10-minute minimum billing unit and a cold-start time. It is not suitable for real-time or low-latency processing. For real-time transformation, use Lambda (simple, stateless) or Managed Flink (stateful, complex). Glue ETL is for batch.
SAP Tip: When a scenario involves PII (Personally Identifiable Information) detection and masking in S3 datasets with no coding, the answer is Glue DataBrew (has built-in PII detection transforms).
Cross-Cloud Equivalents
| Provider | Service / Solution | Notes |
|---|---|---|
| AWS | AWS Glue | Baseline; Catalog + ETL + DataBrew |
| Azure | Azure Data Factory + Purview | ADF for ETL pipelines; Purview for data catalog |
| GCP | Cloud Dataflow + Dataplex | Dataflow (Beam-based ETL); Dataplex for data catalog and governance |
| On-Premises | Apache Spark + Apache Hive Metastore | Spark for ETL; Hive Metastore is the OSS equivalent of Glue Catalog |
Pricing Model
- Glue ETL Jobs: per Data Processing Unit (DPU) per hour (1 DPU = 4 vCPUs + 16 GB RAM); 10-minute minimum per job run
- Glue Crawlers: per DPU-hour consumed during crawl
- Glue Data Catalog: first 1 million objects free; then per 100,000 objects/month stored; per million requests
- Glue DataBrew: per node-hour for job runs; per session-hour for interactive sessions
Related Services / See Also
- Amazon Athena — primary consumer of the Glue Data Catalog for S3 SQL queries
- Amazon Redshift — Glue ETL loads transformed data into Redshift; Spectrum uses the Catalog
- Amazon S3 — Glue's primary data source and destination
- Amazon Kinesis and Managed Flink — streaming ETL alternative to Glue for real-time pipelines
- AWS Lake Formation — builds on Glue Catalog; adds fine-grained access control for data lakes
Amazon Athena
Amazon Athena — serverless, interactive SQL query service that queries data directly in S3. Covers use cases, Athena vs Redshift Spectrum, federated queries, and cross-cloud equivalents.
Amazon Kinesis and Managed Flink
AWS streaming data services — Amazon Kinesis (Data Streams, Firehose, Video Streams) and Amazon Managed Service for Apache Flink. Covers real-time ingestion, processing patterns, and cross-cloud equivalents.