Xoxoftware - XOXO Creative Studio | Web & Mobile App Development | Fred Cheung | Hong Kong
Concepts

ETL and ELT Fundamentals

Extract, Transform, Load (ETL) and its modern variant ELT — the core data pipeline patterns behind every data warehouse and analytics system.

Overview

Extract, Transform, Load (ETL) is a data pipeline pattern for moving data from one or more source systems into a destination — normalising and cleaning it along the way so the destination can serve analytical queries efficiently.

Every data pipeline implicitly does ETL. The value of the term is a shared vocabulary that maps to tool categories, job types, and service billing models — including the AWS exam keyword that points directly to AWS Glue.


The Three Phases

Extract — Read from sources

The pipeline connects to source systems and reads raw data without modifying the source.

Common sources:

  • Relational databases (MySQL, PostgreSQL, Oracle) via JDBC
  • Object storage (S3 CSVs, JSON files, Parquet)
  • REST APIs and SaaS systems (Salesforce, Stripe)
  • Event streams (Kinesis, Kafka)
  • NoSQL databases (DynamoDB, MongoDB)

Simple example:

-- Extract: read all orders placed in the last 24 hours from a MySQL source
SELECT order_id, customer_id, amount, status, created_at
FROM orders
WHERE created_at >= NOW() - INTERVAL 1 DAY;

Transform — Clean and reshape

The Transform phase applies business logic to the extracted data before it reaches the destination. This is the most variable phase — it can be trivial or arbitrarily complex.

Data cleansing examples:

Problem in Source DataTransform Applied
amount stored as "$12.50" (string)Strip $, cast to DECIMAL(10,2)
status column has NULL rowsReplace NULL with 'unknown'
customer_id in orders doesn't match id in customersInner join to drop orphaned rows
created_at in UTC, reporting requires AESTConvert timezone offset
Duplicate rows from source replication lagDISTINCT or ROW_NUMBER() deduplication
Free-text country field: "AU", "Australia", "aus"Normalise to ISO 3166-1 alpha-2 codes

AWS Glue example — rename column and cast type in a DynamicFrame:

from awsglue.transforms import ApplyMapping

mapped = ApplyMapping.apply(
    frame=raw_orders,
    mappings=[
        ("order_id",    "string",  "order_id",    "string"),
        ("customer_id", "string",  "customer_id", "string"),
        ("amount",      "string",  "amount",      "decimal(10,2)"),  # cast
        ("created_at",  "string",  "event_time",  "timestamp"),      # rename + cast
    ]
)

Load — Write to destination

The processed data is written to the target system in its final form.

Common destinations:

  • Data warehouses: Amazon Redshift, Snowflake, BigQuery
  • Data lakes: Amazon S3 (Parquet, ORC, or Avro format)
  • Search indexes: Amazon OpenSearch Service
  • Reporting stores: Amazon Redshift, RDS read replicas

Simple example:

-- Load: insert transformed rows into the Redshift fact table
INSERT INTO analytics.fact_orders (order_id, customer_id, amount, event_time)
SELECT order_id, customer_id, amount, event_time
FROM staging.orders_clean;

ETL vs. ELT

The modern variant — ELT (Extract, Load, Transform) — reverses the last two phases. Raw data lands in the warehouse first; transformations run in-place using SQL.

AspectETLELT
Transform locationOutside the destination (Spark, Glue job)Inside the destination (SQL in Redshift, BigQuery, Snowflake)
When to useComplex transformations, legacy warehouses, regulated data masking before loadCloud data warehouses with high compute capacity; SQL-first teams
Tooling (AWS)AWS Glue (Spark), AWS Lambda, Step Functionsdbt on Redshift, Redshift stored procedures
Data latencyHigher — transformation must complete before loadLower — raw data available immediately after load
Schema required at load?Yes — data is structured before it landsNo — raw data lands as-is; schema applied at query time (schema-on-read)

Full End-to-End Example

A retail company ingests daily order exports from MySQL into Redshift for BI dashboards.

MySQL (OLTP)

    │  Extract: JDBC read — orders from last 24h

AWS Glue ETL Job (Apache Spark)
    │  Transform:
    │    - cast amount string  → DECIMAL
    │    - drop NULL customer_id rows
    │    - normalise country codes
    │    - deduplicate by order_id

S3 staging bucket (Parquet)

    │  Load: COPY command

Amazon Redshift — fact_orders table


Amazon QuickSight dashboard

Quick Reference

TermMeaning
ETLExtract → Transform → Load; transform happens before load
ELTExtract → Load → Transform; transform happens after load, inside the warehouse
Data cleansingRemoving or correcting bad, missing, or inconsistent data during Transform
Staging areaIntermediate storage (often S3) between extraction and final load
Schema-on-writeETL pattern — data is structured before it lands in the destination
Schema-on-readELT pattern — raw data lands first; structure is applied at query time
IdempotencyA well-designed ETL job produces the same result if re-run on the same data

AWS Implementation Options

PatternAWS Services
Batch ETL (Spark-based)AWS Glue ETL Jobs
Streaming ETLAWS Glue Streaming, Amazon Kinesis Data Firehose
Lightweight / event-driven transformAWS Lambda + S3 event triggers
OrchestrationAWS Step Functions, Amazon MWAA (Apache Airflow)
ELT inside warehousedbt on Amazon Redshift, Redshift stored procedures
No-code ETLAWS Glue Studio, AWS Glue DataBrew
Managed data ingestion (SaaS sources)Amazon AppFlow

Built by Fred Cheung @CookedRicer · Powered by Fumadocs & Github Copilot

On this page