Xoxoftware - XOXO Creative Studio | Web & Mobile App Development | Fred Cheung | Hong Kong
AWSMessaging

AWS Step Functions

Serverless orchestration — visual workflows that coordinate AWS services using state machines defined in Amazon States Language (ASL).

Overview

AWS Step Functions is a serverless orchestration service that coordinates multiple AWS services into visual workflows — state machines are defined in Amazon States Language (ASL) and execute as a series of steps with built-in error handling, retries, and parallel branching.


Core Concepts

ConceptDescription
State MachineA workflow definition in ASL that describes a sequence of states and transitions
StateA single step in the workflow (Task, Choice, Wait, Parallel, Map, Pass, Succeed, Fail)
Task StateInvokes a service (Lambda, ECS, DynamoDB, SNS, SQS, Glue, etc.)
Choice StateBranching logic — routes execution based on input conditions
Parallel StateExecutes multiple branches concurrently and waits for all to complete
Map StateIterates over an array, processing each element (inline or distributed mode)
Wait StatePauses execution for a specified duration or until a timestamp
ExecutionA single run of a state machine with its own input, history, and status
ASLAmazon States Language — JSON-based language that defines state machine structure

How Step Functions Works

Start → Task (Lambda) → Choice
                          ├── Condition A → Task (DynamoDB Put) → Succeed
                          ├── Condition B → Parallel
                          │                  ├── Branch 1 (SNS Notify)
                          │                  └── Branch 2 (SQS Send)
                          │                  → Wait (30s) → Task (ECS Run) → Succeed
                          └── Default       → Fail

Standard vs Express Workflows

FeatureStandardExpress
Max duration1 year5 minutes
Execution modelExactly-onceAt-least-once (async) / At-most-once (sync)
PricingPer state transitionPer execution + duration + memory
Execution historyFull history in Step Functions consoleCloudWatch Logs only
Throughput2,000 state transitions/s (can be raised)100,000+ executions/s
Best forLong-running, auditable workflowsHigh-volume, short-duration event processing

State Types

StatePurposeExample
TaskCall an AWS service or activityInvoke Lambda, run ECS task, query DynamoDB
ChoiceConditional branchingRoute based on order amount or status
ParallelRun branches concurrentlySend notification + update database simultaneously
MapIterate over a collectionProcess each item in an S3 manifest
WaitDelay executionWait 24 hours before sending reminder
PassPass input to output with optional transformationInject static values or reshape data
SucceedTerminal success stateEnd workflow successfully
FailTerminal failure state with error and causeEnd with descriptive error message

Distributed Map

S3 Bucket (millions of objects)
    → Distributed Map State
        → 10,000 concurrent child executions
            → Each child: Lambda (process one object)
        → Collect results → Next State
  • Processes large-scale datasets with up to 10,000 parallel child executions
  • Reads items from S3 (CSV, JSON, S3 inventory) or a JSON array
  • Each child execution is a separate Standard or Express workflow

Error Handling

MechanismDescription
RetryAutomatic retry with configurable interval, back-off rate, and max attempts
CatchRoute to a fallback state when all retries are exhausted
TimeoutTimeoutSeconds on a Task state to prevent indefinite hangs
HeartbeatHeartbeatSeconds — task must send heartbeats or it times out

Built-in error codes: States.ALL, States.Timeout, States.TaskFailed, States.Permissions.


Service Integrations

Integration TypeBehaviour
Request-ResponseCall service, get response, move to next state immediately
Run a Job (.sync)Call service, wait for job to complete, then move to next state
Wait for CallbackSend a task token to a service; pause until the token is returned

Over 220 AWS service actions supported natively — Lambda, ECS, Glue, EMR, DynamoDB, SQS, SNS, Batch, CodeBuild, and more.


Common Use Cases

  • Order processing — Orchestrate payment validation, inventory check, shipping label creation, and notification as sequential/parallel steps.
  • ETL pipeline coordination — Run Glue crawlers, Glue jobs, and Athena queries in sequence with error handling and retries.
  • Human approval workflows — Pause execution with a task token; resume when an approver responds via API Gateway + Lambda.
  • Large-scale data processing — Distributed Map state to process millions of S3 objects in parallel with up to 10,000 concurrent executions.
  • Microservice orchestration — Central coordinator that calls multiple services and handles failures, replacing complex application-level retry logic.

SAA/SAP Exam Tips

SAA Tip: "Orchestrate multiple AWS services" or "coordinate Lambda functions with error handling" → Step Functions. It is the default answer for workflow orchestration.

Exam Trap: Standard workflows last up to 1 year but cost per state transition. Express workflows are cheaper for high-volume, short-lived workflows (≤ 5 minutes).

SAP Tip: "Wait for human approval" → Step Functions with a task token callback pattern (.waitForTaskToken). The execution pauses until the token is returned.


Cross-Cloud Equivalents

ProviderService / SolutionNotes
AWSAWS Step FunctionsBaseline
AzureAzure Durable Functions / Azure Logic AppsLogic Apps for low-code; Durable Functions for code-first
GCPGoogle Cloud WorkflowsYAML-based; fewer native integrations
On-PremisesApache Airflow, Temporal, CamundaOpen-source workflow engines

Pricing Model

DimensionUnitNotes
Standard state transitionsPer 1,000 transitionsFirst 4,000 transitions/month free
Express executionsPer million executionsPlus per-GB-second of memory duration
Express durationPer 100 ms (GB-second)Based on memory allocated to the execution

Built by Fred Cheung @CookedRicer · Powered by Fumadocs & Github Copilot

On this page