AWS Step Functions
Serverless orchestration — visual workflows that coordinate AWS services using state machines defined in Amazon States Language (ASL).
Overview
AWS Step Functions is a serverless orchestration service that coordinates multiple AWS services into visual workflows — state machines are defined in Amazon States Language (ASL) and execute as a series of steps with built-in error handling, retries, and parallel branching.
Core Concepts
| Concept | Description |
|---|---|
| State Machine | A workflow definition in ASL that describes a sequence of states and transitions |
| State | A single step in the workflow (Task, Choice, Wait, Parallel, Map, Pass, Succeed, Fail) |
| Task State | Invokes a service (Lambda, ECS, DynamoDB, SNS, SQS, Glue, etc.) |
| Choice State | Branching logic — routes execution based on input conditions |
| Parallel State | Executes multiple branches concurrently and waits for all to complete |
| Map State | Iterates over an array, processing each element (inline or distributed mode) |
| Wait State | Pauses execution for a specified duration or until a timestamp |
| Execution | A single run of a state machine with its own input, history, and status |
| ASL | Amazon States Language — JSON-based language that defines state machine structure |
How Step Functions Works
Start → Task (Lambda) → Choice
├── Condition A → Task (DynamoDB Put) → Succeed
├── Condition B → Parallel
│ ├── Branch 1 (SNS Notify)
│ └── Branch 2 (SQS Send)
│ → Wait (30s) → Task (ECS Run) → Succeed
└── Default → FailStandard vs Express Workflows
| Feature | Standard | Express |
|---|---|---|
| Max duration | 1 year | 5 minutes |
| Execution model | Exactly-once | At-least-once (async) / At-most-once (sync) |
| Pricing | Per state transition | Per execution + duration + memory |
| Execution history | Full history in Step Functions console | CloudWatch Logs only |
| Throughput | 2,000 state transitions/s (can be raised) | 100,000+ executions/s |
| Best for | Long-running, auditable workflows | High-volume, short-duration event processing |
State Types
| State | Purpose | Example |
|---|---|---|
| Task | Call an AWS service or activity | Invoke Lambda, run ECS task, query DynamoDB |
| Choice | Conditional branching | Route based on order amount or status |
| Parallel | Run branches concurrently | Send notification + update database simultaneously |
| Map | Iterate over a collection | Process each item in an S3 manifest |
| Wait | Delay execution | Wait 24 hours before sending reminder |
| Pass | Pass input to output with optional transformation | Inject static values or reshape data |
| Succeed | Terminal success state | End workflow successfully |
| Fail | Terminal failure state with error and cause | End with descriptive error message |
Distributed Map
S3 Bucket (millions of objects)
→ Distributed Map State
→ 10,000 concurrent child executions
→ Each child: Lambda (process one object)
→ Collect results → Next State- Processes large-scale datasets with up to 10,000 parallel child executions
- Reads items from S3 (CSV, JSON, S3 inventory) or a JSON array
- Each child execution is a separate Standard or Express workflow
Error Handling
| Mechanism | Description |
|---|---|
| Retry | Automatic retry with configurable interval, back-off rate, and max attempts |
| Catch | Route to a fallback state when all retries are exhausted |
| Timeout | TimeoutSeconds on a Task state to prevent indefinite hangs |
| Heartbeat | HeartbeatSeconds — task must send heartbeats or it times out |
Built-in error codes: States.ALL, States.Timeout, States.TaskFailed, States.Permissions.
Service Integrations
| Integration Type | Behaviour |
|---|---|
| Request-Response | Call service, get response, move to next state immediately |
| Run a Job (.sync) | Call service, wait for job to complete, then move to next state |
| Wait for Callback | Send a task token to a service; pause until the token is returned |
Over 220 AWS service actions supported natively — Lambda, ECS, Glue, EMR, DynamoDB, SQS, SNS, Batch, CodeBuild, and more.
Common Use Cases
- Order processing — Orchestrate payment validation, inventory check, shipping label creation, and notification as sequential/parallel steps.
- ETL pipeline coordination — Run Glue crawlers, Glue jobs, and Athena queries in sequence with error handling and retries.
- Human approval workflows — Pause execution with a task token; resume when an approver responds via API Gateway + Lambda.
- Large-scale data processing — Distributed Map state to process millions of S3 objects in parallel with up to 10,000 concurrent executions.
- Microservice orchestration — Central coordinator that calls multiple services and handles failures, replacing complex application-level retry logic.
SAA/SAP Exam Tips
SAA Tip: "Orchestrate multiple AWS services" or "coordinate Lambda functions with error handling" → Step Functions. It is the default answer for workflow orchestration.
Exam Trap: Standard workflows last up to 1 year but cost per state transition. Express workflows are cheaper for high-volume, short-lived workflows (≤ 5 minutes).
SAP Tip: "Wait for human approval" → Step Functions with a task token
callback pattern (.waitForTaskToken). The execution pauses until the token
is returned.
Cross-Cloud Equivalents
| Provider | Service / Solution | Notes |
|---|---|---|
| AWS | AWS Step Functions | Baseline |
| Azure | Azure Durable Functions / Azure Logic Apps | Logic Apps for low-code; Durable Functions for code-first |
| GCP | Google Cloud Workflows | YAML-based; fewer native integrations |
| On-Premises | Apache Airflow, Temporal, Camunda | Open-source workflow engines |
Pricing Model
| Dimension | Unit | Notes |
|---|---|---|
| Standard state transitions | Per 1,000 transitions | First 4,000 transitions/month free |
| Express executions | Per million executions | Plus per-GB-second of memory duration |
| Express duration | Per 100 ms (GB-second) | Based on memory allocated to the execution |
Related Services / See Also
- AWS Lambda — most common Task state target in Step Functions
- Amazon EventBridge — trigger Step Functions executions from events
- Amazon SQS and SNS — messaging targets within workflows
- AWS Glue — ETL jobs commonly orchestrated by Step Functions
Amazon SQS and SNS
Managed messaging — SQS for decoupled queue-based communication and SNS for pub/sub fan-out to multiple subscribers.
AWS DataSync and Snow Family
Data transfer — DataSync for online file/object transfers and Snow Family for offline, petabyte-scale physical data migration.