Amazon Kinesis and Managed Flink
AWS streaming data services — Amazon Kinesis (Data Streams, Firehose, Video Streams) and Amazon Managed Service for Apache Flink. Covers real-time ingestion, processing patterns, and cross-cloud equivalents.
Overview
Amazon Kinesis and Amazon Managed Service for Apache Flink are AWS's real-time data streaming services — Kinesis for ingesting and delivering high-throughput event streams, and Managed Flink for stateful, complex computations on those streams.
Before diving in: understand the two-stage model that all streaming architectures follow:
Data Producers Ingest Layer Process / Deliver Layer
(apps, devices, logs) ──────▶ [ Kinesis ] ──────────▶ [ Flink / Lambda / Redshift ]- Ingest — collect raw events from many sources at high speed (Kinesis's job)
- Process / Deliver — transform, aggregate, filter, or route events to a destination (Flink, Lambda, or Firehose's job)
The Kinesis Family
Kinesis is not one service — it is a family of four related services:
| Service | What It Does | Think of It As |
|---|---|---|
| Kinesis Data Streams (KDS) | Low-latency, ordered stream storage; consumers read at their own pace | A durable, replayable event log |
| Kinesis Data Firehose | Fully managed delivery pipeline; loads stream data into S3, Redshift, OpenSearch | A no-code pipe from stream to destination |
| Kinesis Video Streams | Ingest, store, and process binary video/audio streams | Kinesis for video (cameras, IoT) |
| Amazon Managed Service for Apache Flink | Run Apache Flink applications on AWS (formerly Kinesis Data Analytics) | Stateful real-time computation engine |
Kinesis Data Streams (KDS)
KDS is the core streaming buffer — a durable, ordered, replayable log of records. Producers push events in; multiple consumers read from it independently.
Key Concepts
| Concept | Description |
|---|---|
| Shard | The unit of capacity. Each shard handles 1 MB/s in, 2 MB/s out, up to 1,000 records/s in |
| Partition Key | Determines which shard a record goes to; records with the same key always go to the same shard (ordering guarantee) |
| Sequence Number | Unique ID assigned to each record within a shard |
| Retention Period | Default 24 hours; configurable up to 365 days (extended retention costs extra) |
| Consumer | Any application reading from the stream; each maintains its own position (iterator) |
How Capacity Works
Total stream throughput = Number of Shards × (1 MB/s in / 2 MB/s out)
Example: 10 shards = 10 MB/s ingestion, 20 MB/s consumption- On-Demand mode: AWS automatically scales shards based on traffic (simpler, higher cost)
- Provisioned mode: shard count is set manually (more control, lower cost)
Exam Trap: With KDS provisioned mode, if shard capacity is exceeded,
producers receive ProvisionedThroughputExceededException. The fix is to
increase shard count (resharding) or switch to On-Demand mode.
SAA/SAP Tip: Multiple independent consumers can read the same shard simultaneously without interfering — each tracks its own position. This is fundamentally different from SQS (Simple Queue Service) where a message is consumed once and deleted.
KDS vs. SQS — When to Use Which
| Kinesis Data Streams | Amazon SQS | |
|---|---|---|
| Ordering | Per-shard ordering guaranteed | Standard queue: no order; FIFO queue: per-group order |
| Replay | Yes — rewind and re-read past records | No — messages deleted after consumption |
| Multiple consumers | Yes — all consumers read independently | No — each message consumed by one consumer |
| Latency | ~70ms | Seconds |
| Use when | Multiple apps need same event; replay needed; time-series ordering matters | Task queues; decoupling; each message processed once |
Kinesis Data Firehose
Firehose is a fully managed, zero-code delivery pipeline — it reads from a stream (or accepts records directly) and delivers them to a destination, with optional transformation along the way.
Destinations
| Destination | Notes |
|---|---|
| Amazon S3 | Most common; can batch, compress (GZIP/Snappy), and convert format (to Parquet/ORC) before landing |
| Amazon Redshift | Loads via S3 staging → COPY command automatically |
| Amazon OpenSearch Service | For log analytics and search |
| HTTP endpoint / Splunk / Datadog | Third-party SIEM and observability tools |
Key Properties
- No consumer code needed — fully managed; configure a destination, and Firehose handles the rest
- Buffer — batches records before delivery (configurable: 1–900 seconds or 1–128 MB, whichever comes first)
- Lambda transformation — optionally invoke a Lambda function to transform each batch before delivery
- Not real-time — minimum ~60 second delivery latency due to buffering (this is by design)
Exam Trap: Firehose is near-real-time, not real-time. If a scenario requires sub-second processing or complex stateful computation, use KDS + Lambda or Managed Flink. If the goal is just reliably landing data in S3/Redshift with minimal code, Firehose is correct.
Amazon Managed Service for Apache Flink
What is Apache Flink?
Apache Flink is an open-source, distributed stream processing framework. It excels at stateful, complex, real-time computations on event streams — aggregations over time windows, anomaly detection, joining two streams together, and more.
AWS offers it as a fully managed service called Amazon Managed Service for Apache Flink (previously named Kinesis Data Analytics for Apache Flink).
What "Stateful" Means
Most stream operations are stateless (filter a record, transform a field). Stateful means the computation needs to remember past events:
- "How many orders in the last 5 minutes?" → requires a time window (state = running count)
- "Alert if the same card is used in 3 different countries within 1 hour" → requires keyed state per card
- "Join a click event stream with a user profile stream" → requires stream-to-stream join
Flink manages this state durably — even if the application restarts, it picks up where it left off.
Key Concepts
| Concept | Description |
|---|---|
| Source | Where data comes in (Kinesis Data Streams, MSK/Kafka, S3) |
| Operator | A transformation step (map, filter, aggregate, join) |
| Sink | Where results go (S3, Redshift, DynamoDB, OpenSearch, Kinesis) |
| Window | A logical time slice for aggregations — Tumbling (fixed, non-overlapping), Sliding (overlapping), Session (activity-based) |
| Checkpoint | Periodic snapshot of application state; enables fault tolerance and exactly-once processing |
Example: Tumbling Window Aggregation
-- Count orders per product every 1 minute (Flink SQL API)
SELECT
product_id,
COUNT(*) AS order_count,
TUMBLE_END(event_time, INTERVAL '1' MINUTE) AS window_end
FROM orders
GROUP BY product_id, TUMBLE(event_time, INTERVAL '1' MINUTE);When to Use Flink vs. Lambda
| Lambda on KDS | Managed Flink | |
|---|---|---|
| Stateless transforms | Yes — simple and cheap | Overkill |
| Stateful aggregations (windowed counts, running totals) | No — requires external state store | Yes — built-in state management |
| Stream-to-stream joins | No — very complex | Yes — native |
| Exactly-once processing | No — at-least-once by default | Yes — built-in checkpointing |
| Complex event detection (patterns over time) | No — very complex | Yes — Flink CEP library |
The Full Analytics Pipeline — How These Services Fit Together
[Producers] [Ingest] [Process] [Store / Serve]
IoT sensors ──▶ Kinesis Data ──▶ Managed Flink ──▶ DynamoDB (live state)
Mobile apps Streams (stateful S3 (data lake)
Web clickstream real-time) Redshift (warehouse)
Log events ──▶ Kinesis Data ──▶ ─────────────── ──▶ S3 / Redshift /
Firehose (no-code pipe) OpenSearchSAA/SAP Tip: The exam often presents a multi-stage analytics scenario. Map each stage: ingestion (Kinesis), real-time processing (Flink or Lambda), storage (S3/Redshift), querying (Athena/Redshift). Identify which stage the question is testing.
SAA/SAP Exam Tips
SAA Tip: "Real-time clickstream analytics" or "process events as they arrive" → Kinesis Data Streams for ingestion + Lambda or Managed Flink for processing.
SAA/SAP Tip: "Load streaming data into S3/Redshift with no custom code" → Kinesis Data Firehose.
Exam Trap: Kinesis Data Streams and SQS are both queuing/streaming services but serve different patterns. KDS = ordered, replayable, multi-consumer, time-series. SQS = task queue, message consumed once, decoupling.
SAP Tip: "Detect fraud in real-time by correlating events across a 10-minute window" → stateful stream processing → Amazon Managed Service for Apache Flink.
Cross-Cloud Equivalents
Kinesis Data Streams
| Provider | Service / Solution | Notes |
|---|---|---|
| AWS | Amazon Kinesis Data Streams | Baseline |
| Azure | Azure Event Hubs | Similar partitioned log model; Kafka-compatible API option |
| GCP | Google Cloud Pub/Sub | Message-based push/pull; less partition-explicit than KDS |
| On-Premises | Apache Kafka | Industry standard; KDS is conceptually similar but proprietary |
Kinesis Data Firehose
| Provider | Service / Solution | Notes |
|---|---|---|
| AWS | Amazon Data Firehose | Baseline |
| Azure | Azure Stream Analytics (output connectors) | More processing-capable; not purely a delivery pipe |
| GCP | Dataflow with GCS/BigQuery sink | More powerful but heavier; no direct Firehose equivalent |
| On-Premises | Apache Kafka Connect | Sink connectors deliver Kafka topics to external systems |
Apache Flink / Managed Flink
| Provider | Service / Solution | Notes |
|---|---|---|
| AWS | Amazon Managed Service for Apache Flink | Baseline |
| Azure | Azure Stream Analytics | Proprietary SQL-based; less powerful for complex stateful jobs |
| GCP | Google Cloud Dataflow | Built on Apache Beam; supports Flink runner; strong serverless option |
| On-Premises | Apache Flink (self-managed) | Same technology; run on YARN/Kubernetes |
Pricing Model
Kinesis Data Streams:
- On-Demand: per GB ingested + per GB retrieved
- Provisioned: per shard-hour + per PUT payload unit
Kinesis Data Firehose:
- Per GB of data ingested; additional charge for format conversion (Parquet/ORC)
Managed Flink:
- Per Kinesis Processing Unit (KPU) per hour (1 KPU = 1 vCPU + 4 GB RAM)
Related Services / See Also
- Amazon Redshift — common Firehose and Flink destination for analytical warehousing
- Amazon Athena — query the S3 data that Firehose lands
- Amazon SQS and SNS — message queuing / fan-out; different pattern from streaming
- Amazon MSK — managed Apache Kafka; alternative ingestion layer to KDS
AWS Glue
AWS Glue — serverless ETL and data cataloging service. Covers Glue Data Catalog, Crawlers, ETL jobs, DataBrew, and how Glue fits into the AWS data lake and analytics ecosystem.
Amazon Redshift
Amazon Redshift — AWS's managed cloud data warehouse for OLAP workloads. Covers architecture, cluster types, Redshift Serverless, Spectrum, and cross-cloud equivalents.