ETL and ELT Fundamentals

Extract, Transform, Load (ETL) and its modern variant ELT — the core data pipeline patterns behind every data warehouse and analytics system.

Overview

Extract, Transform, Load (ETL) is a data pipeline pattern for moving data from one or more source systems into a destination — normalising and cleaning it along the way so the destination can serve analytical queries efficiently.

Every data pipeline implicitly does ETL. The value of the term is a shared vocabulary that maps to tool categories, job types, and service billing models — including the AWS exam keyword that points directly to AWS Glue.

The Three Phases

Extract — Read from sources

The pipeline connects to source systems and reads raw data without modifying the source.

Common sources:

Relational databases (MySQL, PostgreSQL, Oracle) via JDBC
Object storage (S3 CSVs, JSON files, Parquet)
REST APIs and SaaS systems (Salesforce, Stripe)
Event streams (Kinesis, Kafka)
NoSQL databases (DynamoDB, MongoDB)

Simple example:

-- Extract: read all orders placed in the last 24 hours from a MySQL source
SELECT order_id, customer_id, amount, status, created_at
FROM orders
WHERE created_at >= NOW() - INTERVAL 1 DAY;

Transform — Clean and reshape

The Transform phase applies business logic to the extracted data before it reaches the destination. This is the most variable phase — it can be trivial or arbitrarily complex.

Data cleansing examples:

Problem in Source Data	Transform Applied
`amount` stored as `"$12.50"` (string)	Strip `$`, cast to `DECIMAL(10,2)`
`status` column has `NULL` rows	Replace `NULL` with `'unknown'`
`customer_id` in orders doesn't match `id` in customers	Inner join to drop orphaned rows
`created_at` in UTC, reporting requires AEST	Convert timezone offset
Duplicate rows from source replication lag	`DISTINCT` or `ROW_NUMBER()` deduplication
Free-text `country` field: `"AU"`, `"Australia"`, `"aus"`	Normalise to ISO 3166-1 alpha-2 codes

AWS Glue example — rename column and cast type in a DynamicFrame:

from awsglue.transforms import ApplyMapping

mapped = ApplyMapping.apply(
    frame=raw_orders,
    mappings=[
        ("order_id",    "string",  "order_id",    "string"),
        ("customer_id", "string",  "customer_id", "string"),
        ("amount",      "string",  "amount",      "decimal(10,2)"),  # cast
        ("created_at",  "string",  "event_time",  "timestamp"),      # rename + cast
    ]
)

Load — Write to destination

The processed data is written to the target system in its final form.

Common destinations:

Data warehouses: Amazon Redshift, Snowflake, BigQuery
Data lakes: Amazon S3 (Parquet, ORC, or Avro format)
Search indexes: Amazon OpenSearch Service
Reporting stores: Amazon Redshift, RDS read replicas

Simple example:

-- Load: insert transformed rows into the Redshift fact table
INSERT INTO analytics.fact_orders (order_id, customer_id, amount, event_time)
SELECT order_id, customer_id, amount, event_time
FROM staging.orders_clean;

ETL vs. ELT

The modern variant — ELT (Extract, Load, Transform) — reverses the last two phases. Raw data lands in the warehouse first; transformations run in-place using SQL.

Aspect	ETL	ELT
Transform location	Outside the destination (Spark, Glue job)	Inside the destination (SQL in Redshift, BigQuery, Snowflake)
When to use	Complex transformations, legacy warehouses, regulated data masking before load	Cloud data warehouses with high compute capacity; SQL-first teams
Tooling (AWS)	AWS Glue (Spark), AWS Lambda, Step Functions	dbt on Redshift, Redshift stored procedures
Data latency	Higher — transformation must complete before load	Lower — raw data available immediately after load
Schema required at load?	Yes — data is structured before it lands	No — raw data lands as-is; schema applied at query time (schema-on-read)

Full End-to-End Example

A retail company ingests daily order exports from MySQL into Redshift for BI dashboards.

MySQL (OLTP)
    │
    │  Extract: JDBC read — orders from last 24h
    ▼
AWS Glue ETL Job (Apache Spark)
    │  Transform:
    │    - cast amount string  → DECIMAL
    │    - drop NULL customer_id rows
    │    - normalise country codes
    │    - deduplicate by order_id
    ▼
S3 staging bucket (Parquet)
    │
    │  Load: COPY command
    ▼
Amazon Redshift — fact_orders table
    │
    ▼
Amazon QuickSight dashboard

Quick Reference

Term	Meaning
ETL	Extract → Transform → Load; transform happens before load
ELT	Extract → Load → Transform; transform happens after load, inside the warehouse
Data cleansing	Removing or correcting bad, missing, or inconsistent data during Transform
Staging area	Intermediate storage (often S3) between extraction and final load
Schema-on-write	ETL pattern — data is structured before it lands in the destination
Schema-on-read	ELT pattern — raw data lands first; structure is applied at query time
Idempotency	A well-designed ETL job produces the same result if re-run on the same data

AWS Implementation Options

Pattern	AWS Services
Batch ETL (Spark-based)	AWS Glue ETL Jobs
Streaming ETL	AWS Glue Streaming, Amazon Kinesis Data Firehose
Lightweight / event-driven transform	AWS Lambda + S3 event triggers
Orchestration	AWS Step Functions, Amazon MWAA (Apache Airflow)
ELT inside warehouse	dbt on Amazon Redshift, Redshift stored procedures
No-code ETL	AWS Glue Studio, AWS Glue DataBrew
Managed data ingestion (SaaS sources)	Amazon AppFlow

AWS Glue — serverless ETL engine and data catalog; primary AWS ETL service
Amazon Redshift — cloud data warehouse; common ETL destination
Amazon Kinesis and Managed Flink — streaming ingestion and real-time transformation
Amazon Athena — query S3 directly; supports schema-on-read ELT patterns
Database Performance Fundamentals — OLTP vs. OLAP workload differences

ETL and ELT Fundamentals

On this page