AWS Glue

AWS Glue — serverless ETL and data cataloging service. Covers Glue Data Catalog, Crawlers, ETL jobs, DataBrew, and how Glue fits into the AWS data lake and analytics ecosystem.

Overview

AWS Glue is a serverless data integration service with two distinct and independent functions — a Data Catalog (metadata repository) and an ETL engine (Apache Spark-based data transformation). Both functions are independent and can be used separately or together.

The simplest mental model:

"Glue Catalog" = the map  →  describes WHERE data is and WHAT shape it's in
"Glue ETL"     = the truck →  moves and transforms the data itself

How Glue Works Under the Hood

Glue never copies or stores actual data. It only reads and stores metadata — descriptions of the data.

The Glue Data Catalog is a library index, not a library

Glue Data Catalog (metadata only — tiny, cheap)
├── Database: "analytics"
│   ├── Table: "orders"
│   │   ├── column: id        → INT
│   │   ├── column: amount    → DECIMAL
│   │   ├── column: date      → STRING
│   │   ├── location: s3://my-bucket/orders/   ← a pointer, not a copy
│   │   └── format: parquet
│   └── Table: "customers"
│       └── location: s3://my-bucket/customers/

The Catalog holds the schema and the S3 location. The actual data stays exactly where it was.

When Athena runs SELECT * FROM analytics.orders:

Athena looks up orders in the Glue Catalog → gets the S3 path and column types
Athena goes directly to S3 to read the actual Parquet files
Results returned — Glue Catalog was only consulted for the schema, never touched the data

What the Crawler actually does

A Crawler does not import or replicate data. It:

Connects to the data source (S3, RDS via JDBC, DynamoDB, etc.)
Samples a small portion of the files — just enough rows to infer column names and types
Writes the discovered schema as a table definition into the Glue Catalog
Disconnects — the source data is untouched

S3: s3://bucket/orders/year=2024/month=01/data.parquet
         │
         │  Crawler reads ~100 rows to infer schema
         │  (does NOT copy the file)
         ▼
Glue Catalog: table "orders" { id INT, amount DECIMAL, ... }
              location: s3://bucket/orders/
              partitions: year=2024/month=01

What Glue ETL Jobs actually do

ETL Jobs are where data does move — but only because the script explicitly reads from a source and writes to a destination. Glue:

Spins up a managed Apache Spark cluster
Runs the script (reads data, transforms it, writes output)
Tears the Spark cluster down when done

The output goes wherever the script specifies — S3, Redshift, RDS. There is no "Glue storage." Glue is the execution environment, not the storage.

Glue ETL Job execution:
  [Spark cluster spins up]
      │
      ├── reads from: S3 CSV files / RDS via JDBC
      ├── transforms: rename columns, convert to Parquet, deduplicate
      └── writes to:  S3 Parquet / Redshift table
  [Spark cluster tears down]
  (billed only for cluster runtime)

Key mental model: The Glue Data Catalog = a library index card (tells where the book is and what it contains). Glue ETL = a courier service (actually moves things). The data = the books, always staying on the shelves (S3/database) unless a courier job explicitly moves them.

The Two Jobs of AWS Glue

Function	What It Does	Used By
Glue Data Catalog	Central metadata store: database/table names, schemas, column types, S3 locations, partition info	Athena, Redshift Spectrum, EMR, Lake Formation
Glue ETL Jobs	Serverless Spark scripts that extract, transform, and load data between sources	Redshift (loading), S3 (format conversion), any data pipeline

This split is critical — many services use the Catalog without ever using Glue ETL, and Glue ETL jobs can run without relying on the Catalog.

Part 1: Glue Data Catalog

The AWS Glue Data Catalog is a managed metadata repository — essentially a database of databases. It stores:

Databases — logical groupings of tables
Tables — schema definitions (column names, data types, partition keys, S3 location of the actual data)
Partitions — sub-paths within a table (e.g. s3://bucket/year=2024/month=01/)
Connections — credentials and network config for JDBC sources (RDS, Redshift, etc.)

The Catalog itself stores only metadata — it never holds actual data. Data stays in S3 (or wherever it lives).

Who Uses the Glue Data Catalog

Service	How It Uses the Catalog
Amazon Athena	Reads table schemas from the Catalog to understand what SQL columns/types map to which S3 files
Redshift Spectrum	Creates external schemas pointing at Catalog databases; queries S3 through those schemas
Amazon EMR	Uses Catalog as a Hive Metastore replacement
AWS Lake Formation	Builds on the Catalog; adds column/row-level access control

SAA/SAP Tip: The Glue Data Catalog is the central schema registry for the AWS data lake. When a scenario involves multiple services needing consistent table definitions (Athena + Redshift Spectrum + EMR all pointing at the same S3 data), the answer is the Glue Data Catalog.

Glue Crawlers

Glue Crawlers automate schema discovery, eliminating the need to write CREATE TABLE DDL manually:

Configure a Crawler with a target S3 path (or JDBC database, DynamoDB table, etc.)
The Crawler samples the data and infers the schema (column names, types, partition structure)
The discovered table definition is written into the Glue Data Catalog automatically
Crawlers can be scheduled to run periodically to detect new partitions or schema changes

S3 bucket: s3://my-data-lake/orders/year=2024/month=01/orders.parquet
                                   ─────────────────────────────────────
                                   Crawler discovers this partition structure
                                   and writes it to the Catalog automatically

Exam Trap: Crawlers detect new partitions added to S3 but do not auto-detect schema changes within existing files by default. For partitions added by Firehose or other automated pipelines, schedule a Crawler to run daily (or use partition projection in Athena to skip Crawlers entirely for predictable partition patterns).

Part 2: Glue ETL Jobs

A Glue ETL Job is a managed Apache Spark (or Python shell) script that extracts data from a source, transforms it, and loads it to a destination — without provisioning a Spark cluster.

How a Glue Job Works

Source                  Glue ETL Job                 Destination
S3 (CSV/JSON)  ──▶  [ Extract → Transform → Load ]  ──▶  S3 (Parquet)
RDS / Aurora   ──▶  [ Spark transforms (Python/Scala)]──▶  Redshift
DynamoDB       ──▶  [ Generated or custom code      ] ──▶  S3 data lake

Common Transformations

Transformation	Example
Format conversion	CSV → Parquet or ORC (reduces Athena scan cost by ~10×)
Schema mapping	Rename columns, cast types, drop nulls
Deduplication	Remove duplicate records before loading to warehouse
Enrichment / Join	Join two datasets (e.g. orders + products) before loading
Partitioning	Partition output by date column for efficient querying
Filtering	Drop rows that don't meet data quality rules

Job Types

Type	Runtime	Best For
Spark	Distributed Apache Spark	Large-scale data transformations (GBs–TBs)
Spark Streaming	Spark Structured Streaming	Near-real-time ETL from Kinesis or Kafka
Python Shell	Single Python process	Small datasets, API calls, lightweight scripts
Ray	Distributed Python (Ray)	ML preprocessing, Python-native distributed workloads

Glue Studio

A visual drag-and-drop interface for building Glue ETL jobs without writing Spark code. Source, transform, and destination nodes connect visually; Studio generates the underlying PySpark script.

DynamicFrame vs. DataFrame

Glue introduces its own data abstraction called a DynamicFrame (on top of Spark's DataFrame):

Handles schema inconsistencies gracefully — different records can have different columns without crashing
Useful for messy real-world data where CSV/JSON schemas vary between files
Can be converted to a standard Spark DataFrame when needed for Spark SQL operations

Part 3: AWS Glue DataBrew

Glue DataBrew is a visual, no-code data preparation tool aimed at data analysts. Key capabilities:

Preview a dataset sample in a spreadsheet-like UI
Apply 250+ pre-built transformations (trim whitespace, normalize dates, mask Personally Identifiable Information (PII) data, etc.)
Profile data quality (detect nulls, outliers, distributions)
Execute transformations as a job on the full S3 dataset

DataBrew vs. Glue Studio:

	Glue DataBrew	Glue Studio
User	Data analysts, business users	Data engineers
Coding	No code — visual point-and-click	Visual + generated PySpark code
Flexibility	Limited to built-in transforms	Full Spark — any transformation
Data profiling	Yes — built-in	No
Best for	Quick data cleaning, PII masking	Custom complex ETL pipelines

How Glue Fits Into the Analytics Ecosystem

                     ┌─────────────────────────────────────┐
                     │         Data Sources                │
                     │  S3  │  RDS  │  DynamoDB  │  JDBC  │
                     └──────────────┬──────────────────────┘
                                    │
                          ┌─────────▼──────────┐
                          │  Glue Crawler       │  ← auto-discovers schema
                          │  (metadata only)    │
                          └─────────┬──────────┘
                                    │ writes to
                          ┌─────────▼──────────┐
                          │  Glue Data Catalog  │  ← central schema registry
                          └──┬──────────────┬───┘
                             │              │
               ┌─────────────▼──┐     ┌─────▼──────────────┐
               │  Amazon Athena │     │  Redshift Spectrum  │
               │  (SQL on S3)   │     │  (SQL via Redshift) │
               └────────────────┘     └────────────────────┘

                          ┌─────────────────────┐
                          │  Glue ETL Jobs       │  ← transforms & moves data
                          │  (Spark / PySpark)   │
                          └──────────┬──────────┘
                                     │
                    ┌────────────────▼────────────────┐
                    │  Destinations                    │
                    │  S3 (Parquet)  │  Redshift       │
                    │  RDS           │  OpenSearch      │
                    └─────────────────────────────────┘

SAA/SAP Exam Tips

SAA/SAP Tip: The most common Glue exam scenario: "automatically discover and catalog schemas from S3 data" → Glue Crawler + Data Catalog. The Catalog then enables Athena or Redshift Spectrum to query without manual DDL.

SAA/SAP Tip: "Convert CSV files in S3 to Parquet to reduce Athena query costs" → Glue ETL Job. This is a canonical pattern: raw data lands in S3 as CSV, Glue converts it to partitioned Parquet, Athena queries efficiently.

Exam Trap: Glue ETL is built on Spark — it has a 10-minute minimum billing unit and a cold-start time. It is not suitable for real-time or low-latency processing. For real-time transformation, use Lambda (simple, stateless) or Managed Flink (stateful, complex). Glue ETL is for batch.

SAP Tip: When a scenario involves PII (Personally Identifiable Information) detection and masking in S3 datasets with no coding, the answer is Glue DataBrew (has built-in PII detection transforms).

Cross-Cloud Equivalents

Provider	Service / Solution	Notes
AWS	AWS Glue	Baseline; Catalog + ETL + DataBrew
Azure	Azure Data Factory + Purview	ADF for ETL pipelines; Purview for data catalog
GCP	Cloud Dataflow + Dataplex	Dataflow (Beam-based ETL); Dataplex for data catalog and governance
On-Premises	Apache Spark + Apache Hive Metastore	Spark for ETL; Hive Metastore is the OSS equivalent of Glue Catalog

Pricing Model

Glue ETL Jobs: per Data Processing Unit (DPU) per hour (1 DPU = 4 vCPUs + 16 GB RAM); 10-minute minimum per job run
Glue Crawlers: per DPU-hour consumed during crawl
Glue Data Catalog: first 1 million objects free; then per 100,000 objects/month stored; per million requests
Glue DataBrew: per node-hour for job runs; per session-hour for interactive sessions

Amazon Athena — primary consumer of the Glue Data Catalog for S3 SQL queries
Amazon Redshift — Glue ETL loads transformed data into Redshift; Spectrum uses the Catalog
Amazon S3 — Glue's primary data source and destination
Amazon Kinesis and Managed Flink — streaming ETL alternative to Glue for real-time pipelines
AWS Lake Formation — builds on Glue Catalog; adds fine-grained access control for data lakes

AWS Glue

On this page