Xoxoftware - XOXO Creative Studio | Web & Mobile App Development | Fred Cheung | Hong Kong
AWSAnalytics

Amazon Kinesis and Managed Flink

AWS streaming data services — Amazon Kinesis (Data Streams, Firehose, Video Streams) and Amazon Managed Service for Apache Flink. Covers real-time ingestion, processing patterns, and cross-cloud equivalents.

Overview

Amazon Kinesis and Amazon Managed Service for Apache Flink are AWS's real-time data streaming services — Kinesis for ingesting and delivering high-throughput event streams, and Managed Flink for stateful, complex computations on those streams.

Before diving in: understand the two-stage model that all streaming architectures follow:

Data Producers                  Ingest Layer              Process / Deliver Layer
(apps, devices, logs)  ──────▶  [ Kinesis ]  ──────────▶  [ Flink / Lambda / Redshift ]
  1. Ingest — collect raw events from many sources at high speed (Kinesis's job)
  2. Process / Deliver — transform, aggregate, filter, or route events to a destination (Flink, Lambda, or Firehose's job)

The Kinesis Family

Kinesis is not one service — it is a family of four related services:

ServiceWhat It DoesThink of It As
Kinesis Data Streams (KDS)Low-latency, ordered stream storage; consumers read at their own paceA durable, replayable event log
Kinesis Data FirehoseFully managed delivery pipeline; loads stream data into S3, Redshift, OpenSearchA no-code pipe from stream to destination
Kinesis Video StreamsIngest, store, and process binary video/audio streamsKinesis for video (cameras, IoT)
Amazon Managed Service for Apache FlinkRun Apache Flink applications on AWS (formerly Kinesis Data Analytics)Stateful real-time computation engine

Kinesis Data Streams (KDS)

KDS is the core streaming buffer — a durable, ordered, replayable log of records. Producers push events in; multiple consumers read from it independently.

Key Concepts

ConceptDescription
ShardThe unit of capacity. Each shard handles 1 MB/s in, 2 MB/s out, up to 1,000 records/s in
Partition KeyDetermines which shard a record goes to; records with the same key always go to the same shard (ordering guarantee)
Sequence NumberUnique ID assigned to each record within a shard
Retention PeriodDefault 24 hours; configurable up to 365 days (extended retention costs extra)
ConsumerAny application reading from the stream; each maintains its own position (iterator)

How Capacity Works

Total stream throughput = Number of Shards × (1 MB/s in / 2 MB/s out)

Example: 10 shards = 10 MB/s ingestion, 20 MB/s consumption
  • On-Demand mode: AWS automatically scales shards based on traffic (simpler, higher cost)
  • Provisioned mode: shard count is set manually (more control, lower cost)

Exam Trap: With KDS provisioned mode, if shard capacity is exceeded, producers receive ProvisionedThroughputExceededException. The fix is to increase shard count (resharding) or switch to On-Demand mode.

SAA/SAP Tip: Multiple independent consumers can read the same shard simultaneously without interfering — each tracks its own position. This is fundamentally different from SQS (Simple Queue Service) where a message is consumed once and deleted.

KDS vs. SQS — When to Use Which

Kinesis Data StreamsAmazon SQS
OrderingPer-shard ordering guaranteedStandard queue: no order; FIFO queue: per-group order
ReplayYes — rewind and re-read past recordsNo — messages deleted after consumption
Multiple consumersYes — all consumers read independentlyNo — each message consumed by one consumer
Latency~70msSeconds
Use whenMultiple apps need same event; replay needed; time-series ordering mattersTask queues; decoupling; each message processed once

Kinesis Data Firehose

Firehose is a fully managed, zero-code delivery pipeline — it reads from a stream (or accepts records directly) and delivers them to a destination, with optional transformation along the way.

Destinations

DestinationNotes
Amazon S3Most common; can batch, compress (GZIP/Snappy), and convert format (to Parquet/ORC) before landing
Amazon RedshiftLoads via S3 staging → COPY command automatically
Amazon OpenSearch ServiceFor log analytics and search
HTTP endpoint / Splunk / DatadogThird-party SIEM and observability tools

Key Properties

  • No consumer code needed — fully managed; configure a destination, and Firehose handles the rest
  • Buffer — batches records before delivery (configurable: 1–900 seconds or 1–128 MB, whichever comes first)
  • Lambda transformation — optionally invoke a Lambda function to transform each batch before delivery
  • Not real-time — minimum ~60 second delivery latency due to buffering (this is by design)

Exam Trap: Firehose is near-real-time, not real-time. If a scenario requires sub-second processing or complex stateful computation, use KDS + Lambda or Managed Flink. If the goal is just reliably landing data in S3/Redshift with minimal code, Firehose is correct.


Apache Flink is an open-source, distributed stream processing framework. It excels at stateful, complex, real-time computations on event streams — aggregations over time windows, anomaly detection, joining two streams together, and more.

AWS offers it as a fully managed service called Amazon Managed Service for Apache Flink (previously named Kinesis Data Analytics for Apache Flink).

What "Stateful" Means

Most stream operations are stateless (filter a record, transform a field). Stateful means the computation needs to remember past events:

  • "How many orders in the last 5 minutes?" → requires a time window (state = running count)
  • "Alert if the same card is used in 3 different countries within 1 hour" → requires keyed state per card
  • "Join a click event stream with a user profile stream" → requires stream-to-stream join

Flink manages this state durably — even if the application restarts, it picks up where it left off.

Key Concepts

ConceptDescription
SourceWhere data comes in (Kinesis Data Streams, MSK/Kafka, S3)
OperatorA transformation step (map, filter, aggregate, join)
SinkWhere results go (S3, Redshift, DynamoDB, OpenSearch, Kinesis)
WindowA logical time slice for aggregations — Tumbling (fixed, non-overlapping), Sliding (overlapping), Session (activity-based)
CheckpointPeriodic snapshot of application state; enables fault tolerance and exactly-once processing

Example: Tumbling Window Aggregation

-- Count orders per product every 1 minute (Flink SQL API)
SELECT
  product_id,
  COUNT(*) AS order_count,
  TUMBLE_END(event_time, INTERVAL '1' MINUTE) AS window_end
FROM orders
GROUP BY product_id, TUMBLE(event_time, INTERVAL '1' MINUTE);
Lambda on KDSManaged Flink
Stateless transformsYes — simple and cheapOverkill
Stateful aggregations (windowed counts, running totals)No — requires external state storeYes — built-in state management
Stream-to-stream joinsNo — very complexYes — native
Exactly-once processingNo — at-least-once by defaultYes — built-in checkpointing
Complex event detection (patterns over time)No — very complexYes — Flink CEP library

The Full Analytics Pipeline — How These Services Fit Together

[Producers]          [Ingest]              [Process]           [Store / Serve]
IoT sensors    ──▶  Kinesis Data    ──▶   Managed Flink   ──▶  DynamoDB (live state)
Mobile apps          Streams               (stateful               S3 (data lake)
Web clickstream                            real-time)          Redshift (warehouse)
Log events     ──▶  Kinesis Data    ──▶   ─────────────── ──▶  S3 / Redshift /
                    Firehose                (no-code pipe)       OpenSearch

SAA/SAP Tip: The exam often presents a multi-stage analytics scenario. Map each stage: ingestion (Kinesis), real-time processing (Flink or Lambda), storage (S3/Redshift), querying (Athena/Redshift). Identify which stage the question is testing.


SAA/SAP Exam Tips

SAA Tip: "Real-time clickstream analytics" or "process events as they arrive" → Kinesis Data Streams for ingestion + Lambda or Managed Flink for processing.

SAA/SAP Tip: "Load streaming data into S3/Redshift with no custom code" → Kinesis Data Firehose.

Exam Trap: Kinesis Data Streams and SQS are both queuing/streaming services but serve different patterns. KDS = ordered, replayable, multi-consumer, time-series. SQS = task queue, message consumed once, decoupling.

SAP Tip: "Detect fraud in real-time by correlating events across a 10-minute window" → stateful stream processing → Amazon Managed Service for Apache Flink.


Cross-Cloud Equivalents

Kinesis Data Streams

ProviderService / SolutionNotes
AWSAmazon Kinesis Data StreamsBaseline
AzureAzure Event HubsSimilar partitioned log model; Kafka-compatible API option
GCPGoogle Cloud Pub/SubMessage-based push/pull; less partition-explicit than KDS
On-PremisesApache KafkaIndustry standard; KDS is conceptually similar but proprietary

Kinesis Data Firehose

ProviderService / SolutionNotes
AWSAmazon Data FirehoseBaseline
AzureAzure Stream Analytics (output connectors)More processing-capable; not purely a delivery pipe
GCPDataflow with GCS/BigQuery sinkMore powerful but heavier; no direct Firehose equivalent
On-PremisesApache Kafka ConnectSink connectors deliver Kafka topics to external systems
ProviderService / SolutionNotes
AWSAmazon Managed Service for Apache FlinkBaseline
AzureAzure Stream AnalyticsProprietary SQL-based; less powerful for complex stateful jobs
GCPGoogle Cloud DataflowBuilt on Apache Beam; supports Flink runner; strong serverless option
On-PremisesApache Flink (self-managed)Same technology; run on YARN/Kubernetes

Pricing Model

Kinesis Data Streams:

  • On-Demand: per GB ingested + per GB retrieved
  • Provisioned: per shard-hour + per PUT payload unit

Kinesis Data Firehose:

  • Per GB of data ingested; additional charge for format conversion (Parquet/ORC)

Managed Flink:

  • Per Kinesis Processing Unit (KPU) per hour (1 KPU = 1 vCPU + 4 GB RAM)

  • Amazon Redshift — common Firehose and Flink destination for analytical warehousing
  • Amazon Athena — query the S3 data that Firehose lands
  • Amazon SQS and SNS — message queuing / fan-out; different pattern from streaming
  • Amazon MSK — managed Apache Kafka; alternative ingestion layer to KDS
Built by Fred Cheung @CookedRicer · Powered by Fumadocs & Github Copilot

On this page