Xoxoftware - XOXO Creative Studio | Web & Mobile App Development | Fred Cheung | Hong Kong
AWSAnalytics

AWS Glue

AWS Glue — serverless ETL and data cataloging service. Covers Glue Data Catalog, Crawlers, ETL jobs, DataBrew, and how Glue fits into the AWS data lake and analytics ecosystem.

Overview

AWS Glue is a serverless data integration service with two distinct and independent functions — a Data Catalog (metadata repository) and an ETL engine (Apache Spark-based data transformation). Both functions are independent and can be used separately or together.

The simplest mental model:

"Glue Catalog" = the map  →  describes WHERE data is and WHAT shape it's in
"Glue ETL"     = the truck →  moves and transforms the data itself

How Glue Works Under the Hood

Glue never copies or stores actual data. It only reads and stores metadata — descriptions of the data.

The Glue Data Catalog is a library index, not a library

Glue Data Catalog (metadata only — tiny, cheap)
├── Database: "analytics"
│   ├── Table: "orders"
│   │   ├── column: id        → INT
│   │   ├── column: amount    → DECIMAL
│   │   ├── column: date      → STRING
│   │   ├── location: s3://my-bucket/orders/   ← a pointer, not a copy
│   │   └── format: parquet
│   └── Table: "customers"
│       └── location: s3://my-bucket/customers/

The Catalog holds the schema and the S3 location. The actual data stays exactly where it was.

When Athena runs SELECT * FROM analytics.orders:

  1. Athena looks up orders in the Glue Catalog → gets the S3 path and column types
  2. Athena goes directly to S3 to read the actual Parquet files
  3. Results returned — Glue Catalog was only consulted for the schema, never touched the data

What the Crawler actually does

A Crawler does not import or replicate data. It:

  1. Connects to the data source (S3, RDS via JDBC, DynamoDB, etc.)
  2. Samples a small portion of the files — just enough rows to infer column names and types
  3. Writes the discovered schema as a table definition into the Glue Catalog
  4. Disconnects — the source data is untouched
S3: s3://bucket/orders/year=2024/month=01/data.parquet

         │  Crawler reads ~100 rows to infer schema
         │  (does NOT copy the file)

Glue Catalog: table "orders" { id INT, amount DECIMAL, ... }
              location: s3://bucket/orders/
              partitions: year=2024/month=01

What Glue ETL Jobs actually do

ETL Jobs are where data does move — but only because the script explicitly reads from a source and writes to a destination. Glue:

  1. Spins up a managed Apache Spark cluster
  2. Runs the script (reads data, transforms it, writes output)
  3. Tears the Spark cluster down when done

The output goes wherever the script specifies — S3, Redshift, RDS. There is no "Glue storage." Glue is the execution environment, not the storage.

Glue ETL Job execution:
  [Spark cluster spins up]

      ├── reads from: S3 CSV files / RDS via JDBC
      ├── transforms: rename columns, convert to Parquet, deduplicate
      └── writes to:  S3 Parquet / Redshift table
  [Spark cluster tears down]
  (billed only for cluster runtime)

Key mental model: The Glue Data Catalog = a library index card (tells where the book is and what it contains). Glue ETL = a courier service (actually moves things). The data = the books, always staying on the shelves (S3/database) unless a courier job explicitly moves them.


The Two Jobs of AWS Glue

FunctionWhat It DoesUsed By
Glue Data CatalogCentral metadata store: database/table names, schemas, column types, S3 locations, partition infoAthena, Redshift Spectrum, EMR, Lake Formation
Glue ETL JobsServerless Spark scripts that extract, transform, and load data between sourcesRedshift (loading), S3 (format conversion), any data pipeline

This split is critical — many services use the Catalog without ever using Glue ETL, and Glue ETL jobs can run without relying on the Catalog.


Part 1: Glue Data Catalog

The AWS Glue Data Catalog is a managed metadata repository — essentially a database of databases. It stores:

  • Databases — logical groupings of tables
  • Tables — schema definitions (column names, data types, partition keys, S3 location of the actual data)
  • Partitions — sub-paths within a table (e.g. s3://bucket/year=2024/month=01/)
  • Connections — credentials and network config for JDBC sources (RDS, Redshift, etc.)

The Catalog itself stores only metadata — it never holds actual data. Data stays in S3 (or wherever it lives).

Who Uses the Glue Data Catalog

ServiceHow It Uses the Catalog
Amazon AthenaReads table schemas from the Catalog to understand what SQL columns/types map to which S3 files
Redshift SpectrumCreates external schemas pointing at Catalog databases; queries S3 through those schemas
Amazon EMRUses Catalog as a Hive Metastore replacement
AWS Lake FormationBuilds on the Catalog; adds column/row-level access control

SAA/SAP Tip: The Glue Data Catalog is the central schema registry for the AWS data lake. When a scenario involves multiple services needing consistent table definitions (Athena + Redshift Spectrum + EMR all pointing at the same S3 data), the answer is the Glue Data Catalog.

Glue Crawlers

Glue Crawlers automate schema discovery, eliminating the need to write CREATE TABLE DDL manually:

  1. Configure a Crawler with a target S3 path (or JDBC database, DynamoDB table, etc.)
  2. The Crawler samples the data and infers the schema (column names, types, partition structure)
  3. The discovered table definition is written into the Glue Data Catalog automatically
  4. Crawlers can be scheduled to run periodically to detect new partitions or schema changes
S3 bucket: s3://my-data-lake/orders/year=2024/month=01/orders.parquet
                                   ─────────────────────────────────────
                                   Crawler discovers this partition structure
                                   and writes it to the Catalog automatically

Exam Trap: Crawlers detect new partitions added to S3 but do not auto-detect schema changes within existing files by default. For partitions added by Firehose or other automated pipelines, schedule a Crawler to run daily (or use partition projection in Athena to skip Crawlers entirely for predictable partition patterns).


Part 2: Glue ETL Jobs

A Glue ETL Job is a managed Apache Spark (or Python shell) script that extracts data from a source, transforms it, and loads it to a destination — without provisioning a Spark cluster.

How a Glue Job Works

Source                  Glue ETL Job                 Destination
S3 (CSV/JSON)  ──▶  [ Extract → Transform → Load ]  ──▶  S3 (Parquet)
RDS / Aurora   ──▶  [ Spark transforms (Python/Scala)]──▶  Redshift
DynamoDB       ──▶  [ Generated or custom code      ] ──▶  S3 data lake

Common Transformations

TransformationExample
Format conversionCSV → Parquet or ORC (reduces Athena scan cost by ~10×)
Schema mappingRename columns, cast types, drop nulls
DeduplicationRemove duplicate records before loading to warehouse
Enrichment / JoinJoin two datasets (e.g. orders + products) before loading
PartitioningPartition output by date column for efficient querying
FilteringDrop rows that don't meet data quality rules

Job Types

TypeRuntimeBest For
SparkDistributed Apache SparkLarge-scale data transformations (GBs–TBs)
Spark StreamingSpark Structured StreamingNear-real-time ETL from Kinesis or Kafka
Python ShellSingle Python processSmall datasets, API calls, lightweight scripts
RayDistributed Python (Ray)ML preprocessing, Python-native distributed workloads

Glue Studio

A visual drag-and-drop interface for building Glue ETL jobs without writing Spark code. Source, transform, and destination nodes connect visually; Studio generates the underlying PySpark script.

DynamicFrame vs. DataFrame

Glue introduces its own data abstraction called a DynamicFrame (on top of Spark's DataFrame):

  • Handles schema inconsistencies gracefully — different records can have different columns without crashing
  • Useful for messy real-world data where CSV/JSON schemas vary between files
  • Can be converted to a standard Spark DataFrame when needed for Spark SQL operations

Part 3: AWS Glue DataBrew

Glue DataBrew is a visual, no-code data preparation tool aimed at data analysts. Key capabilities:

  • Preview a dataset sample in a spreadsheet-like UI
  • Apply 250+ pre-built transformations (trim whitespace, normalize dates, mask Personally Identifiable Information (PII) data, etc.)
  • Profile data quality (detect nulls, outliers, distributions)
  • Execute transformations as a job on the full S3 dataset

DataBrew vs. Glue Studio:

Glue DataBrewGlue Studio
UserData analysts, business usersData engineers
CodingNo code — visual point-and-clickVisual + generated PySpark code
FlexibilityLimited to built-in transformsFull Spark — any transformation
Data profilingYes — built-inNo
Best forQuick data cleaning, PII maskingCustom complex ETL pipelines

How Glue Fits Into the Analytics Ecosystem

                     ┌─────────────────────────────────────┐
                     │         Data Sources                │
                     │  S3  │  RDS  │  DynamoDB  │  JDBC  │
                     └──────────────┬──────────────────────┘

                          ┌─────────▼──────────┐
                          │  Glue Crawler       │  ← auto-discovers schema
                          │  (metadata only)    │
                          └─────────┬──────────┘
                                    │ writes to
                          ┌─────────▼──────────┐
                          │  Glue Data Catalog  │  ← central schema registry
                          └──┬──────────────┬───┘
                             │              │
               ┌─────────────▼──┐     ┌─────▼──────────────┐
               │  Amazon Athena │     │  Redshift Spectrum  │
               │  (SQL on S3)   │     │  (SQL via Redshift) │
               └────────────────┘     └────────────────────┘

                          ┌─────────────────────┐
                          │  Glue ETL Jobs       │  ← transforms & moves data
                          │  (Spark / PySpark)   │
                          └──────────┬──────────┘

                    ┌────────────────▼────────────────┐
                    │  Destinations                    │
                    │  S3 (Parquet)  │  Redshift       │
                    │  RDS           │  OpenSearch      │
                    └─────────────────────────────────┘

SAA/SAP Exam Tips

SAA/SAP Tip: The most common Glue exam scenario: "automatically discover and catalog schemas from S3 data" → Glue Crawler + Data Catalog. The Catalog then enables Athena or Redshift Spectrum to query without manual DDL.

SAA/SAP Tip: "Convert CSV files in S3 to Parquet to reduce Athena query costs" → Glue ETL Job. This is a canonical pattern: raw data lands in S3 as CSV, Glue converts it to partitioned Parquet, Athena queries efficiently.

Exam Trap: Glue ETL is built on Spark — it has a 10-minute minimum billing unit and a cold-start time. It is not suitable for real-time or low-latency processing. For real-time transformation, use Lambda (simple, stateless) or Managed Flink (stateful, complex). Glue ETL is for batch.

SAP Tip: When a scenario involves PII (Personally Identifiable Information) detection and masking in S3 datasets with no coding, the answer is Glue DataBrew (has built-in PII detection transforms).


Cross-Cloud Equivalents

ProviderService / SolutionNotes
AWSAWS GlueBaseline; Catalog + ETL + DataBrew
AzureAzure Data Factory + PurviewADF for ETL pipelines; Purview for data catalog
GCPCloud Dataflow + DataplexDataflow (Beam-based ETL); Dataplex for data catalog and governance
On-PremisesApache Spark + Apache Hive MetastoreSpark for ETL; Hive Metastore is the OSS equivalent of Glue Catalog

Pricing Model

  • Glue ETL Jobs: per Data Processing Unit (DPU) per hour (1 DPU = 4 vCPUs + 16 GB RAM); 10-minute minimum per job run
  • Glue Crawlers: per DPU-hour consumed during crawl
  • Glue Data Catalog: first 1 million objects free; then per 100,000 objects/month stored; per million requests
  • Glue DataBrew: per node-hour for job runs; per session-hour for interactive sessions

Built by Fred Cheung @CookedRicer · Powered by Fumadocs & Github Copilot

On this page