AWS Kinesis and Real-Time Data Streaming
AWS Kinesis is a suite of services for real-time data streaming — collecting, processing, and analyzing high volumes of data as it arrives, within milliseconds. Instead of storing data first and analyzing it later (batch processing), Kinesis processes data as it flows in — enabling instant insights and real-time reactions to events.
What Is Real-Time Data Streaming?
Real-time streaming means data is continuously produced by many sources and must be processed immediately — not collected for hours or days and then analyzed in bulk.
Examples of real-time streaming data:
- Stock prices updating every second from thousands of trading systems
- Clickstream data from millions of users browsing a website simultaneously
- IoT sensor readings from thousands of factory machines every few seconds
- Log data from hundreds of servers generated continuously
- GPS location updates from a fleet of delivery vehicles every 10 seconds
AWS Kinesis Services
| Service | Purpose |
|---|---|
| Kinesis Data Streams | Ingest and temporarily store streaming data for custom real-time processing |
| Kinesis Data Firehose | Fully managed pipeline to deliver streaming data to S3, Redshift, or Elasticsearch |
| Kinesis Data Analytics | Run SQL or Apache Flink queries on streaming data in real time |
| Kinesis Video Streams | Ingest and process live video streams from cameras and devices |
Kinesis Data Streams — Core Concepts
Producers and Consumers
[Producers] [Kinesis Data Stream] [Consumers]
Web servers ---PUT--> +-------------------+ --GET--> Lambda function
Mobile apps ---PUT--> | Shard 1 | --GET--> EC2 application
IoT devices ---PUT--> | Shard 2 | --GET--> Kinesis Analytics
Log agents ---PUT--> | Shard 3 | --GET--> Kinesis Firehose
(records) +-------------------+ (process records)
(retains for 24hrs–365 days)
- Producers write records (data) to the stream using the Kinesis Producer Library (KPL) or AWS SDK.
- Each record has a partition key (determines which shard it goes to), a sequence number (auto-assigned), and data (up to 1 MB per record).
- Consumers read records from the stream using Kinesis Client Library (KCL) or Lambda triggers.
Shards
A shard is the base throughput unit of a Kinesis Data Stream. Each shard provides:
- Write: 1,000 records/second or 1 MB/second
- Read: 2 MB/second (standard) or 2 MB/second per consumer (enhanced fan-out)
A stream starts with a defined number of shards and can be scaled by splitting or merging shards. A stream with 5 shards handles up to 5,000 records/second write and 10 MB/second read.
Data Retention
Records are retained in a Kinesis Data Stream for 24 hours by default, extendable to 365 days (at extra cost). Unlike SQS where messages are deleted after consumption, Kinesis retains all records so multiple consumers can independently process the same data.
One Stream → Multiple independent consumers Kinesis Data Stream: web-events | +-- Consumer 1: Lambda → update real-time dashboard +-- Consumer 2: Kinesis Analytics → detect fraud patterns +-- Consumer 3: Kinesis Firehose → store raw data in S3
Kinesis Data Firehose
Kinesis Data Firehose is a fully managed delivery service. It automatically receives streaming data, optionally transforms it using Lambda, buffers it briefly, and delivers it to a destination — with no code required for the delivery pipeline itself.
Supported Destinations
- Amazon S3 (most common)
- Amazon Redshift (via S3)
- Amazon OpenSearch Service (Elasticsearch)
- Splunk
- HTTP endpoint (any custom destination)
Firehose Architecture
[Data Sources: Apps, Logs, Events]
|
[Kinesis Data Firehose]
|
Optional: Lambda transforms records
(convert JSON → Parquet, filter fields, enrich data)
|
Buffering: by time (60s–900s) or by size (1MB–128MB)
|
Destination:
[S3 Bucket: raw-events/year=2024/month=03/day=15/]
Firehose handles everything automatically — no shards to manage, no consumers to write, no infrastructure to operate. Pay only for data ingested (per GB).
Kinesis Data Analytics
Kinesis Data Analytics allows running SQL queries or Apache Flink applications on streaming data in real time — without spinning up any servers.
Example: A ride-sharing platform wants to alert operations if more than 100 rides are cancelled in any 5-minute window (possible system issue). Kinesis Data Analytics runs a continuous SQL query on the incoming events stream:
CREATE OR REPLACE STREAM "ALERT_STREAM" (
window_start TIMESTAMP,
cancellations INTEGER
);
CREATE OR REPLACE PUMP "ALERT_PUMP" AS
INSERT INTO "ALERT_STREAM"
SELECT STREAM FLOOR(ROWTIME TO MINUTE) as window_start,
COUNT(*) as cancellations
FROM "SOURCE_SQL_STREAM_001"
WHERE event_type = 'CANCELLATION'
GROUP BY FLOOR(ROWTIME TO MINUTE)
HAVING COUNT(*) > 100;
When the count exceeds 100, the output is delivered to a Lambda function that sends an alert to the operations team.
Kinesis vs SQS — When to Use Which
| Feature | Kinesis Data Streams | SQS |
|---|---|---|
| Message ordering | Ordered within a shard | Standard: not ordered; FIFO: ordered |
| Multiple consumers | Yes — many consumers read same data | No — one consumer deletes the message |
| Data retention | 24 hours to 365 days | Up to 14 days |
| Replay capability | Yes — can re-read old records | No — deleted after consumption |
| Throughput | Very high (per shard basis) | Unlimited (Standard) |
| Best for | Analytics, log processing, real-time dashboards | Task queues, work distribution |
Real-World Example — E-Commerce Real-Time Analytics
An e-commerce platform uses Kinesis to power real-time business intelligence:
[Website: product views, cart adds, purchases, searches]
|
[Kinesis Data Stream: user-events]
/ | \
[Consumer 1] [Consumer 2] [Consumer 3]
Lambda: Kinesis Kinesis
Update Analytics: Firehose:
live counter detect store all
on product abandoned events in
page carts S3 for
in real time reporting
- The product detail page shows a live "X people viewing this right now" counter — updated by Consumer 1.
- Abandoned cart detection fires within seconds — if a user adds items but does not checkout within 3 minutes, Consumer 2 triggers a notification Lambda.
- Consumer 3 stores all raw events in S3, partitioned by date — available for batch analysis and machine learning the next day.
Summary
- Kinesis processes high-volume data streams in real time — enabling millisecond latency analysis and reactions.
- Kinesis Data Streams provides customizable real-time streaming with multiple independent consumers and replay capability.
- Kinesis Data Firehose delivers streaming data to S3, Redshift, or OpenSearch fully managed — zero infrastructure to operate.
- Kinesis Data Analytics runs SQL or Flink queries on live data streams without server management.
- Kinesis differs from SQS primarily in its support for multiple consumers, ordered records, data replay, and high-throughput analytics use cases.
