AWS Kinesis and Real-Time Data Streaming

AWS Kinesis is a suite of services for real-time data streaming — collecting, processing, and analyzing high volumes of data as it arrives, within milliseconds. Instead of storing data first and analyzing it later (batch processing), Kinesis processes data as it flows in — enabling instant insights and real-time reactions to events.

What Is Real-Time Data Streaming?

Real-time streaming means data is continuously produced by many sources and must be processed immediately — not collected for hours or days and then analyzed in bulk.

Examples of real-time streaming data:

  • Stock prices updating every second from thousands of trading systems
  • Clickstream data from millions of users browsing a website simultaneously
  • IoT sensor readings from thousands of factory machines every few seconds
  • Log data from hundreds of servers generated continuously
  • GPS location updates from a fleet of delivery vehicles every 10 seconds

AWS Kinesis Services

ServicePurpose
Kinesis Data StreamsIngest and temporarily store streaming data for custom real-time processing
Kinesis Data FirehoseFully managed pipeline to deliver streaming data to S3, Redshift, or Elasticsearch
Kinesis Data AnalyticsRun SQL or Apache Flink queries on streaming data in real time
Kinesis Video StreamsIngest and process live video streams from cameras and devices

Kinesis Data Streams — Core Concepts

Producers and Consumers

[Producers]                  [Kinesis Data Stream]         [Consumers]
Web servers    ---PUT-->     +-------------------+  --GET--> Lambda function
Mobile apps    ---PUT-->     | Shard 1           |  --GET--> EC2 application
IoT devices    ---PUT-->     | Shard 2           |  --GET--> Kinesis Analytics
Log agents     ---PUT-->     | Shard 3           |  --GET--> Kinesis Firehose
               (records)     +-------------------+          (process records)
                             (retains for 24hrs–365 days)
  • Producers write records (data) to the stream using the Kinesis Producer Library (KPL) or AWS SDK.
  • Each record has a partition key (determines which shard it goes to), a sequence number (auto-assigned), and data (up to 1 MB per record).
  • Consumers read records from the stream using Kinesis Client Library (KCL) or Lambda triggers.

Shards

A shard is the base throughput unit of a Kinesis Data Stream. Each shard provides:

  • Write: 1,000 records/second or 1 MB/second
  • Read: 2 MB/second (standard) or 2 MB/second per consumer (enhanced fan-out)

A stream starts with a defined number of shards and can be scaled by splitting or merging shards. A stream with 5 shards handles up to 5,000 records/second write and 10 MB/second read.

Data Retention

Records are retained in a Kinesis Data Stream for 24 hours by default, extendable to 365 days (at extra cost). Unlike SQS where messages are deleted after consumption, Kinesis retains all records so multiple consumers can independently process the same data.

One Stream → Multiple independent consumers

Kinesis Data Stream: web-events
  |
  +-- Consumer 1: Lambda → update real-time dashboard
  +-- Consumer 2: Kinesis Analytics → detect fraud patterns
  +-- Consumer 3: Kinesis Firehose → store raw data in S3

Kinesis Data Firehose

Kinesis Data Firehose is a fully managed delivery service. It automatically receives streaming data, optionally transforms it using Lambda, buffers it briefly, and delivers it to a destination — with no code required for the delivery pipeline itself.

Supported Destinations

  • Amazon S3 (most common)
  • Amazon Redshift (via S3)
  • Amazon OpenSearch Service (Elasticsearch)
  • Splunk
  • HTTP endpoint (any custom destination)

Firehose Architecture

[Data Sources: Apps, Logs, Events]
           |
   [Kinesis Data Firehose]
           |
   Optional: Lambda transforms records
   (convert JSON → Parquet, filter fields, enrich data)
           |
   Buffering: by time (60s–900s) or by size (1MB–128MB)
           |
    Destination:
    [S3 Bucket: raw-events/year=2024/month=03/day=15/]

Firehose handles everything automatically — no shards to manage, no consumers to write, no infrastructure to operate. Pay only for data ingested (per GB).

Kinesis Data Analytics

Kinesis Data Analytics allows running SQL queries or Apache Flink applications on streaming data in real time — without spinning up any servers.

Example: A ride-sharing platform wants to alert operations if more than 100 rides are cancelled in any 5-minute window (possible system issue). Kinesis Data Analytics runs a continuous SQL query on the incoming events stream:

CREATE OR REPLACE STREAM "ALERT_STREAM" (
  window_start TIMESTAMP,
  cancellations INTEGER
);

CREATE OR REPLACE PUMP "ALERT_PUMP" AS
INSERT INTO "ALERT_STREAM"
SELECT STREAM FLOOR(ROWTIME TO MINUTE) as window_start,
              COUNT(*) as cancellations
FROM "SOURCE_SQL_STREAM_001"
WHERE event_type = 'CANCELLATION'
GROUP BY FLOOR(ROWTIME TO MINUTE)
HAVING COUNT(*) > 100;

When the count exceeds 100, the output is delivered to a Lambda function that sends an alert to the operations team.

Kinesis vs SQS — When to Use Which

FeatureKinesis Data StreamsSQS
Message orderingOrdered within a shardStandard: not ordered; FIFO: ordered
Multiple consumersYes — many consumers read same dataNo — one consumer deletes the message
Data retention24 hours to 365 daysUp to 14 days
Replay capabilityYes — can re-read old recordsNo — deleted after consumption
ThroughputVery high (per shard basis)Unlimited (Standard)
Best forAnalytics, log processing, real-time dashboardsTask queues, work distribution

Real-World Example — E-Commerce Real-Time Analytics

An e-commerce platform uses Kinesis to power real-time business intelligence:

[Website: product views, cart adds, purchases, searches]
           |
   [Kinesis Data Stream: user-events]
      /          |           \
[Consumer 1] [Consumer 2] [Consumer 3]
Lambda:       Kinesis        Kinesis
Update        Analytics:     Firehose:
live counter  detect         store all
on product    abandoned      events in
page          carts          S3 for
              in real time   reporting
  • The product detail page shows a live "X people viewing this right now" counter — updated by Consumer 1.
  • Abandoned cart detection fires within seconds — if a user adds items but does not checkout within 3 minutes, Consumer 2 triggers a notification Lambda.
  • Consumer 3 stores all raw events in S3, partitioned by date — available for batch analysis and machine learning the next day.

Summary

  • Kinesis processes high-volume data streams in real time — enabling millisecond latency analysis and reactions.
  • Kinesis Data Streams provides customizable real-time streaming with multiple independent consumers and replay capability.
  • Kinesis Data Firehose delivers streaming data to S3, Redshift, or OpenSearch fully managed — zero infrastructure to operate.
  • Kinesis Data Analytics runs SQL or Flink queries on live data streams without server management.
  • Kinesis differs from SQS primarily in its support for multiple consumers, ordered records, data replay, and high-throughput analytics use cases.

Leave a Comment