ETL and ELT Fundamentals
Extract, Transform, Load (ETL) and its modern variant ELT — the core data pipeline patterns behind every data warehouse and analytics system.
Overview
Extract, Transform, Load (ETL) is a data pipeline pattern for moving data from one or more source systems into a destination — normalising and cleaning it along the way so the destination can serve analytical queries efficiently.
Every data pipeline implicitly does ETL. The value of the term is a shared vocabulary that maps to tool categories, job types, and service billing models — including the AWS exam keyword that points directly to AWS Glue.
The Three Phases
Extract — Read from sources
The pipeline connects to source systems and reads raw data without modifying the source.
Common sources:
- Relational databases (MySQL, PostgreSQL, Oracle) via JDBC
- Object storage (S3 CSVs, JSON files, Parquet)
- REST APIs and SaaS systems (Salesforce, Stripe)
- Event streams (Kinesis, Kafka)
- NoSQL databases (DynamoDB, MongoDB)
Simple example:
-- Extract: read all orders placed in the last 24 hours from a MySQL source
SELECT order_id, customer_id, amount, status, created_at
FROM orders
WHERE created_at >= NOW() - INTERVAL 1 DAY;Transform — Clean and reshape
The Transform phase applies business logic to the extracted data before it reaches the destination. This is the most variable phase — it can be trivial or arbitrarily complex.
Data cleansing examples:
| Problem in Source Data | Transform Applied |
|---|---|
amount stored as "$12.50" (string) | Strip $, cast to DECIMAL(10,2) |
status column has NULL rows | Replace NULL with 'unknown' |
customer_id in orders doesn't match id in customers | Inner join to drop orphaned rows |
created_at in UTC, reporting requires AEST | Convert timezone offset |
| Duplicate rows from source replication lag | DISTINCT or ROW_NUMBER() deduplication |
Free-text country field: "AU", "Australia", "aus" | Normalise to ISO 3166-1 alpha-2 codes |
AWS Glue example — rename column and cast type in a DynamicFrame:
from awsglue.transforms import ApplyMapping
mapped = ApplyMapping.apply(
frame=raw_orders,
mappings=[
("order_id", "string", "order_id", "string"),
("customer_id", "string", "customer_id", "string"),
("amount", "string", "amount", "decimal(10,2)"), # cast
("created_at", "string", "event_time", "timestamp"), # rename + cast
]
)Load — Write to destination
The processed data is written to the target system in its final form.
Common destinations:
- Data warehouses: Amazon Redshift, Snowflake, BigQuery
- Data lakes: Amazon S3 (Parquet, ORC, or Avro format)
- Search indexes: Amazon OpenSearch Service
- Reporting stores: Amazon Redshift, RDS read replicas
Simple example:
-- Load: insert transformed rows into the Redshift fact table
INSERT INTO analytics.fact_orders (order_id, customer_id, amount, event_time)
SELECT order_id, customer_id, amount, event_time
FROM staging.orders_clean;ETL vs. ELT
The modern variant — ELT (Extract, Load, Transform) — reverses the last two phases. Raw data lands in the warehouse first; transformations run in-place using SQL.
| Aspect | ETL | ELT |
|---|---|---|
| Transform location | Outside the destination (Spark, Glue job) | Inside the destination (SQL in Redshift, BigQuery, Snowflake) |
| When to use | Complex transformations, legacy warehouses, regulated data masking before load | Cloud data warehouses with high compute capacity; SQL-first teams |
| Tooling (AWS) | AWS Glue (Spark), AWS Lambda, Step Functions | dbt on Redshift, Redshift stored procedures |
| Data latency | Higher — transformation must complete before load | Lower — raw data available immediately after load |
| Schema required at load? | Yes — data is structured before it lands | No — raw data lands as-is; schema applied at query time (schema-on-read) |
Full End-to-End Example
A retail company ingests daily order exports from MySQL into Redshift for BI dashboards.
MySQL (OLTP)
│
│ Extract: JDBC read — orders from last 24h
▼
AWS Glue ETL Job (Apache Spark)
│ Transform:
│ - cast amount string → DECIMAL
│ - drop NULL customer_id rows
│ - normalise country codes
│ - deduplicate by order_id
▼
S3 staging bucket (Parquet)
│
│ Load: COPY command
▼
Amazon Redshift — fact_orders table
│
▼
Amazon QuickSight dashboardQuick Reference
| Term | Meaning |
|---|---|
| ETL | Extract → Transform → Load; transform happens before load |
| ELT | Extract → Load → Transform; transform happens after load, inside the warehouse |
| Data cleansing | Removing or correcting bad, missing, or inconsistent data during Transform |
| Staging area | Intermediate storage (often S3) between extraction and final load |
| Schema-on-write | ETL pattern — data is structured before it lands in the destination |
| Schema-on-read | ELT pattern — raw data lands first; structure is applied at query time |
| Idempotency | A well-designed ETL job produces the same result if re-run on the same data |
AWS Implementation Options
| Pattern | AWS Services |
|---|---|
| Batch ETL (Spark-based) | AWS Glue ETL Jobs |
| Streaming ETL | AWS Glue Streaming, Amazon Kinesis Data Firehose |
| Lightweight / event-driven transform | AWS Lambda + S3 event triggers |
| Orchestration | AWS Step Functions, Amazon MWAA (Apache Airflow) |
| ELT inside warehouse | dbt on Amazon Redshift, Redshift stored procedures |
| No-code ETL | AWS Glue Studio, AWS Glue DataBrew |
| Managed data ingestion (SaaS sources) | Amazon AppFlow |
Related Services / See Also
- AWS Glue — serverless ETL engine and data catalog; primary AWS ETL service
- Amazon Redshift — cloud data warehouse; common ETL destination
- Amazon Kinesis and Managed Flink — streaming ingestion and real-time transformation
- Amazon Athena — query S3 directly; supports schema-on-read ELT patterns
- Database Performance Fundamentals — OLTP vs. OLAP workload differences
Disaster Recovery Strategies
DR strategy comparison (Backup & Restore, Pilot Light, Warm Standby, Multi-Site Active/Active) with RTO and RPO definitions. Core SAA/SAP topic.
Migration Strategies
The 7 Rs of cloud migration — a decision framework for moving workloads to AWS, from simple rehosting to complete retirement.