GCP Cloud Dataflow

Cloud Dataflow is GCP's fully managed data processing service for running Apache Beam pipelines. It handles both real-time stream processing (data arriving continuously, like clicks or sensor readings) and batch processing (large historical datasets processed all at once) using the same programming model. Dataflow automatically provisions, scales, and manages the underlying compute resources.

Imagine a water treatment plant. Water (data) flows in continuously from rivers (Pub/Sub streams) and stored tanks (Cloud Storage files). The treatment plant (Dataflow pipeline) filters, cleans, and transforms the water, then delivers it to homes (BigQuery, Bigtable). The plant adjusts its capacity automatically when more water arrives.

Stream vs Batch Processing

Concept	Batch Processing	Stream Processing
Data arrival	All at once (files, database exports)	Continuously (events, messages, logs)
Latency	Minutes to hours	Milliseconds to seconds
Use case	Nightly sales reports, ETL jobs	Real-time fraud detection, live dashboards
GCP source	Cloud Storage, BigQuery	Pub/Sub, Kafka
GCP destination	BigQuery, Cloud Storage	BigQuery, Bigtable, Cloud Storage

Apache Beam and Dataflow

Dataflow runs Apache Beam pipelines. Apache Beam is an open-source SDK for defining data processing pipelines. The same Beam code runs on Dataflow, Apache Spark, or locally — the execution engine is swappable. Beam supports Python, Java, and Go.

Apache Beam Pipeline Concepts:
┌───────────────────────────────────────────────────────────────────┐
│                                                                   │
│  PCollection (distributed dataset — like a stream or batch)       │
│                                                                   │
│   Source ──▶ PCollection ──▶ Transform ──▶ PCollection ──▶ Sink   │
│                                                                   │
│  Source:     Read from Cloud Storage / Pub/Sub / BigQuery         │
│  Transform:  Map, Filter, GroupBy, Combine, Window                │
│  Sink:       Write to BigQuery / Cloud Storage / Bigtable         │
│                                                                   │
└───────────────────────────────────────────────────────────────────┘

Building a Simple Batch Pipeline

This pipeline reads a CSV file from Cloud Storage, counts word frequencies, and writes results to BigQuery.

# requirements.txt
apache-beam[gcp]==2.53.0

# word_count_pipeline.py
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

options = PipelineOptions(
    runner="DataflowRunner",
    project="my-project",
    region="us-central1",
    temp_location="gs://my-bucket/temp",
    staging_location="gs://my-bucket/staging",
    job_name="word-count-job"
)

with beam.Pipeline(options=options) as pipeline:
    (
        pipeline
        # Read text file from Cloud Storage
        | "Read File"     >> beam.io.ReadFromText("gs://my-bucket/input/text.txt")

        # Split each line into individual words
        | "Split Words"   >> beam.FlatMap(lambda line: line.lower().split())

        # Filter out empty strings
        | "Filter Empty"  >> beam.Filter(lambda word: len(word) > 0)

        # Count each word
        | "Count Words"   >> beam.combiners.Count.PerElement()

        # Format as dictionary for BigQuery
        | "Format Row"    >> beam.Map(lambda item: {"word": item[0], "count": item[1]})

        # Write to BigQuery
        | "Write BQ"      >> beam.io.WriteToBigQuery(
            "my-project:word_counts.results",
            schema="word:STRING, count:INTEGER",
            write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE
        )
    )

# Submit the pipeline to Dataflow
python word_count_pipeline.py

Building a Streaming Pipeline (Pub/Sub to BigQuery)

This pipeline reads real-time events from a Pub/Sub topic, parses JSON messages, filters invalid ones, and writes to BigQuery.

# streaming_pipeline.py
import apache_beam as beam
import json
from apache_beam.options.pipeline_options import PipelineOptions, StandardOptions

options = PipelineOptions(
    runner="DataflowRunner",
    project="my-project",
    region="us-central1",
    temp_location="gs://my-bucket/temp",
    job_name="events-streaming-job"
)
options.view_as(StandardOptions).streaming = True  # Enable streaming mode

def parse_message(message):
    """Parse a Pub/Sub message (bytes) into a Python dict."""
    try:
        return json.loads(message.decode("utf-8"))
    except Exception:
        return None

def is_valid(event):
    """Filter out None values and events missing required fields."""
    return event is not None and "user_id" in event and "event_type" in event

with beam.Pipeline(options=options) as pipeline:
    (
        pipeline
        # Read from Pub/Sub topic
        | "Read Pub/Sub"    >> beam.io.ReadFromPubSub(topic="projects/my-project/topics/events")

        # Parse JSON bytes into dict
        | "Parse JSON"      >> beam.Map(parse_message)

        # Filter invalid events
        | "Filter Valid"    >> beam.Filter(is_valid)

        # Write to BigQuery (streaming insert)
        | "Write BigQuery"  >> beam.io.WriteToBigQuery(
            "my-project:analytics.events",
            schema="user_id:STRING, event_type:STRING, timestamp:TIMESTAMP",
            write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
            create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
        )
    )

Windowing in Stream Processing

In stream processing, data arrives continuously. Windowing divides the infinite stream into finite time buckets for aggregation.

Stream of events:  --e1--e2--e3--e4--e5--e6--e7--e8--e9-->

Fixed Window (every 1 minute):
Window 1: [e1,e2,e3]  →  count=3
Window 2: [e4,e5,e6]  →  count=3
Window 3: [e7,e8,e9]  →  count=3

Sliding Window (1 min window, slides every 30 seconds):
Window 1: [e1,e2,e3,e4]        →  count=4
Window 2:       [e2,e3,e4,e5]  →  count=4
Window 3:              [e4,e5,e6,e7] → count=4

Session Window (groups events within 5-min gap):
User A: [e1,e2,e3]──5min gap──[e7,e8] → Two sessions

# Apply a fixed 1-minute window to a streaming PCollection
windowed = (
    events
    | "Window" >> beam.WindowInto(beam.window.FixedWindows(60))  # 60 seconds
    | "Count per window" >> beam.combiners.Count.PerElement()
)

Dataflow Monitoring

The Dataflow console provides a visual pipeline graph where each transform is shown as a node. Metrics like elements processed, throughput, and system lag are visible in real time.

Dataflow Job Graph (visual view in Console):
Read Pub/Sub
     │  250,000 elements/sec
     ▼
Parse JSON
     │  245,000 elements/sec (5,000 filtered)
     ▼
Filter Valid
     │  244,500 elements/sec
     ▼
Write BigQuery
     │  244,500 rows/sec inserted

Dataflow Templates

GCP provides pre-built Dataflow templates for common use cases — no coding required.

Template	Description
Pub/Sub to BigQuery	Stream Pub/Sub messages into a BigQuery table
Cloud Storage to BigQuery	Load CSV/JSON files from GCS into BigQuery
BigQuery to Cloud Storage	Export BigQuery query results to GCS files
Cloud Storage Text to Cloud Pub/Sub	Read text files and publish to a Pub/Sub topic
Datastream to BigQuery	Replicate database changes to BigQuery in real time

# Launch a Pub/Sub to BigQuery template (no code needed)
gcloud dataflow jobs run pubsub-to-bq \
  --gcs-location gs://dataflow-templates/latest/PubSub_to_BigQuery \
  --region us-central1 \
  --parameters \
    inputTopic=projects/my-project/topics/events,\
    outputTableSpec=my-project:analytics.events

Key Takeaways

Cloud Dataflow runs Apache Beam pipelines for both batch and stream data processing.
The same Beam code handles both bounded (batch) and unbounded (streaming) data.
Dataflow auto-scales workers up and down based on data volume.
Windowing divides continuous streams into time buckets for aggregation.
Pre-built Dataflow templates enable common ETL pipelines without writing any code.
Typical pipeline: Source (Pub/Sub / GCS) → Transform (filter, parse, enrich) → Sink (BigQuery / Bigtable).

Previous lessons

Back to courses

Next lessons