GCP BigQuery

BigQuery is GCP's fully managed, serverless data warehouse designed for analyzing massive datasets — from millions to billions of rows — using standard SQL. There are no servers to configure, no indexes to maintain, and no database tuning required. Upload data and start querying in seconds.

A data warehouse is a central repository for large amounts of structured data used for reporting and analysis. While Cloud SQL is optimized for transactional workloads (many small reads and writes), BigQuery is optimized for analytical workloads (scanning enormous datasets to answer complex business questions).

BigQuery vs Cloud SQL

Feature	Cloud SQL	BigQuery
Use Case	Transactional (OLTP) — app database	Analytical (OLAP) — reporting, BI
Query Speed	Fast for small queries	Fast for queries scanning billions of rows
Data Size	GBs to TBs	TBs to PBs
Storage Model	Row-based	Column-based (columnar storage)
Updates	Frequent INSERT/UPDATE/DELETE	Batch loads and streaming inserts

BigQuery Architecture

Data Sources
├── CSV / JSON files from Cloud Storage
├── Streaming inserts from applications
├── Google Sheets
├── Other GCP services (Pub/Sub, Dataflow)
└── External databases
        │
        ▼
BigQuery Storage (Colossus — Google's distributed file system)
        │
        ▼
BigQuery Query Engine (Dremel — distributed SQL engine)
        │
        ▼
Results in seconds (even for billions of rows)

BigQuery Data Organization

Project: my-analytics-project
└── Dataset: sales_data
    ├── Table: orders_2023
    ├── Table: orders_2024
    ├── Table: customers
    └── View: monthly_revenue  (virtual table, defined by a query)

Project: The top-level container for billing and access.
Dataset: A collection of tables, similar to a schema in a traditional database.
Table: Structured data with rows and columns. Tables can be native (stored in BigQuery), external (data stays in Cloud Storage), or partitioned.

Running a Query in BigQuery

Via the Console

Go to BigQuery → SQL Workspace
Click the + tab to open a new query editor
Enter a SQL query and click Run

Example – Querying a Public Dataset

BigQuery provides many public datasets for learning. This query analyses the NYC taxi rides dataset:

SELECT
  EXTRACT(HOUR FROM pickup_datetime) AS hour_of_day,
  COUNT(*) AS total_rides,
  ROUND(AVG(fare_amount), 2) AS avg_fare
FROM
  `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022`
WHERE
  fare_amount > 0
  AND trip_distance > 0
GROUP BY
  hour_of_day
ORDER BY
  total_rides DESC
LIMIT 10;

Result (sample):

+──────────────+─────────────+──────────+
│ hour_of_day  │ total_rides │ avg_fare │
+──────────────+─────────────+──────────+
│     18       │   1,240,832 │   15.20  │
│     17       │   1,189,456 │   14.75  │
│     19       │   1,102,109 │   15.90  │
+──────────────+─────────────+──────────+

Loading Data into BigQuery

From Cloud Storage (Batch Load)

# Load a CSV file from Cloud Storage into a BigQuery table
bq load \
  --source_format=CSV \
  --skip_leading_rows=1 \
  --autodetect \
  my-dataset.customers \
  gs://my-bucket/customers.csv

Streaming Insert (Real-Time)

Applications can insert rows into BigQuery in real time using the Streaming API. Rows are available for querying within seconds.

# Python example — streaming insert
from google.cloud import bigquery

client = bigquery.Client()
table_id = "my-project.my-dataset.events"

rows = [
    {"event": "page_view", "user_id": "u123", "timestamp": "2024-01-15T10:30:00"},
    {"event": "purchase",  "user_id": "u456", "timestamp": "2024-01-15T10:31:00"},
]

errors = client.insert_rows_json(table_id, rows)
if not errors:
    print("Rows inserted successfully.")

Partitioned Tables

Partitioning divides a large table into smaller segments (partitions) based on a date or integer column. BigQuery only reads the relevant partitions for a query — dramatically reducing cost and query time.

Table: website_events (partitioned by event_date)
├── Partition: 2024-01-01 (Jan 1 data)
├── Partition: 2024-01-02 (Jan 2 data)
├── Partition: 2024-01-03 (Jan 3 data)
└── ...

Query: WHERE event_date = '2024-01-02'
→ BigQuery only reads the 2024-01-02 partition
→ Skips all other partitions = faster and cheaper

CREATE TABLE my-dataset.website_events
PARTITION BY DATE(event_date)
AS
SELECT * FROM my-dataset.raw_events;

Clustered Tables

Clustering sorts data within partitions by one or more columns. BigQuery skips blocks of data that do not match the query filter, further reducing bytes scanned.

CREATE TABLE my-dataset.orders
PARTITION BY DATE(order_date)
CLUSTER BY region, product_category
AS
SELECT * FROM my-dataset.raw_orders;

BigQuery Cost Model

On-demand pricing: Pay per TB of data scanned by queries. The first 1 TB per month is free.
Capacity pricing (slots): Reserve a fixed number of query processing units (slots) for a monthly or annual commitment. Better for large, consistent workloads.
Storage costs are separate — approximately $0.02 per GB/month for active storage.

Tip: Always use SELECT specific_columns instead of SELECT *. In BigQuery's columnar storage, selecting fewer columns scans less data and costs less.

Key Takeaways

BigQuery is a serverless data warehouse for analyzing billions of rows with SQL.
It uses columnar storage optimized for analytical queries, not transactional operations.
Data is organized into Projects → Datasets → Tables.
Partitioned tables speed up queries and reduce costs by limiting data scanned.
Clustered tables further optimize queries filtering on specific columns.
On-demand billing charges per TB scanned — avoid SELECT * to save costs.

Previous lesson

Back to course

Next lesson