GCP BigQuery
BigQuery is GCP's fully managed, serverless data warehouse designed for analyzing massive datasets — from millions to billions of rows — using standard SQL. There are no servers to configure, no indexes to maintain, and no database tuning required. Upload data and start querying in seconds.
A data warehouse is a central repository for large amounts of structured data used for reporting and analysis. While Cloud SQL is optimized for transactional workloads (many small reads and writes), BigQuery is optimized for analytical workloads (scanning enormous datasets to answer complex business questions).
BigQuery vs Cloud SQL
| Feature | Cloud SQL | BigQuery |
|---|---|---|
| Use Case | Transactional (OLTP) — app database | Analytical (OLAP) — reporting, BI |
| Query Speed | Fast for small queries | Fast for queries scanning billions of rows |
| Data Size | GBs to TBs | TBs to PBs |
| Storage Model | Row-based | Column-based (columnar storage) |
| Updates | Frequent INSERT/UPDATE/DELETE | Batch loads and streaming inserts |
BigQuery Architecture
Data Sources
├── CSV / JSON files from Cloud Storage
├── Streaming inserts from applications
├── Google Sheets
├── Other GCP services (Pub/Sub, Dataflow)
└── External databases
│
▼
BigQuery Storage (Colossus — Google's distributed file system)
│
▼
BigQuery Query Engine (Dremel — distributed SQL engine)
│
▼
Results in seconds (even for billions of rows)
BigQuery Data Organization
Project: my-analytics-project
└── Dataset: sales_data
├── Table: orders_2023
├── Table: orders_2024
├── Table: customers
└── View: monthly_revenue (virtual table, defined by a query)
- Project: The top-level container for billing and access.
- Dataset: A collection of tables, similar to a schema in a traditional database.
- Table: Structured data with rows and columns. Tables can be native (stored in BigQuery), external (data stays in Cloud Storage), or partitioned.
Running a Query in BigQuery
Via the Console
- Go to BigQuery → SQL Workspace
- Click the + tab to open a new query editor
- Enter a SQL query and click Run
Example – Querying a Public Dataset
BigQuery provides many public datasets for learning. This query analyses the NYC taxi rides dataset:
SELECT EXTRACT(HOUR FROM pickup_datetime) AS hour_of_day, COUNT(*) AS total_rides, ROUND(AVG(fare_amount), 2) AS avg_fare FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022` WHERE fare_amount > 0 AND trip_distance > 0 GROUP BY hour_of_day ORDER BY total_rides DESC LIMIT 10;
Result (sample):
+──────────────+─────────────+──────────+ │ hour_of_day │ total_rides │ avg_fare │ +──────────────+─────────────+──────────+ │ 18 │ 1,240,832 │ 15.20 │ │ 17 │ 1,189,456 │ 14.75 │ │ 19 │ 1,102,109 │ 15.90 │ +──────────────+─────────────+──────────+
Loading Data into BigQuery
From Cloud Storage (Batch Load)
# Load a CSV file from Cloud Storage into a BigQuery table bq load \ --source_format=CSV \ --skip_leading_rows=1 \ --autodetect \ my-dataset.customers \ gs://my-bucket/customers.csv
Streaming Insert (Real-Time)
Applications can insert rows into BigQuery in real time using the Streaming API. Rows are available for querying within seconds.
# Python example — streaming insert
from google.cloud import bigquery
client = bigquery.Client()
table_id = "my-project.my-dataset.events"
rows = [
{"event": "page_view", "user_id": "u123", "timestamp": "2024-01-15T10:30:00"},
{"event": "purchase", "user_id": "u456", "timestamp": "2024-01-15T10:31:00"},
]
errors = client.insert_rows_json(table_id, rows)
if not errors:
print("Rows inserted successfully.")
Partitioned Tables
Partitioning divides a large table into smaller segments (partitions) based on a date or integer column. BigQuery only reads the relevant partitions for a query — dramatically reducing cost and query time.
Table: website_events (partitioned by event_date) ├── Partition: 2024-01-01 (Jan 1 data) ├── Partition: 2024-01-02 (Jan 2 data) ├── Partition: 2024-01-03 (Jan 3 data) └── ... Query: WHERE event_date = '2024-01-02' → BigQuery only reads the 2024-01-02 partition → Skips all other partitions = faster and cheaper
CREATE TABLE my-dataset.website_events PARTITION BY DATE(event_date) AS SELECT * FROM my-dataset.raw_events;
Clustered Tables
Clustering sorts data within partitions by one or more columns. BigQuery skips blocks of data that do not match the query filter, further reducing bytes scanned.
CREATE TABLE my-dataset.orders PARTITION BY DATE(order_date) CLUSTER BY region, product_category AS SELECT * FROM my-dataset.raw_orders;
BigQuery Cost Model
- On-demand pricing: Pay per TB of data scanned by queries. The first 1 TB per month is free.
- Capacity pricing (slots): Reserve a fixed number of query processing units (slots) for a monthly or annual commitment. Better for large, consistent workloads.
- Storage costs are separate — approximately $0.02 per GB/month for active storage.
Tip: Always use SELECT specific_columns instead of SELECT *. In BigQuery's columnar storage, selecting fewer columns scans less data and costs less.
Key Takeaways
- BigQuery is a serverless data warehouse for analyzing billions of rows with SQL.
- It uses columnar storage optimized for analytical queries, not transactional operations.
- Data is organized into Projects → Datasets → Tables.
- Partitioned tables speed up queries and reduce costs by limiting data scanned.
- Clustered tables further optimize queries filtering on specific columns.
- On-demand billing charges per TB scanned — avoid SELECT * to save costs.
