Snowflake Building a Data Warehouse Project
Building a real data warehouse in Snowflake means connecting all the concepts you have learned into a working system. This page walks through a complete project: designing the architecture, setting up environments, loading data, transforming it into analytical layers, and serving it to end users. The example uses a fictional e-commerce company called ShopStream, which sells products online and needs a data warehouse to answer business questions.
Project Overview: What ShopStream Needs
ShopStream has three source systems:
- An orders database (PostgreSQL) with customer orders, order items, and payments
- A product catalog API (JSON files) with product details, categories, and pricing
- A clickstream log (CSV files, one per hour) with user behavior on the website
Business questions they need to answer:
- What is our total revenue by region and product category this month?
- Which products have the highest return rates?
- What is the average customer lifetime value by acquisition channel?
- Which customers are at risk of churning?
Step 1: Design the Architecture
Diagram: The Three-Layer Warehouse
Picture a flour mill. Raw wheat comes in from the field (source data). It goes through initial cleaning to remove dirt and debris (raw layer). Then it gets processed into different flour grades (staging layer). Finally, it gets packaged into labeled bags for consumers (presentation layer). A data warehouse works identically.
The standard Snowflake data warehouse uses three layers:
- RAW layer: Data lands here exactly as it came from the source. No transformations. Think of this as the loading dock.
- STAGING layer: Data gets cleaned, standardized, and lightly transformed. Business logic is not applied yet. This is the processing floor.
- PRESENTATION layer: Business-ready data organized for reporting and analysis. Dimensions, facts, and aggregates live here. This is the showroom.
Step 2: Set Up Snowflake Infrastructure
-- Step 2a: Create databases for each layer
CREATE DATABASE shopstream_raw COMMENT = 'Raw data as received from source systems';
CREATE DATABASE shopstream_staging COMMENT = 'Cleaned and standardized data';
CREATE DATABASE shopstream_presentation COMMENT = 'Business-ready analytical data';
CREATE DATABASE shopstream_admin COMMENT = 'Monitoring, logging, and configuration';
-- Step 2b: Create schemas within each database
USE DATABASE shopstream_raw;
CREATE SCHEMA orders;
CREATE SCHEMA products;
CREATE SCHEMA clickstream;
USE DATABASE shopstream_staging;
CREATE SCHEMA orders;
CREATE SCHEMA products;
CREATE SCHEMA clickstream;
USE DATABASE shopstream_presentation;
CREATE SCHEMA dim; -- Dimension tables
CREATE SCHEMA fact; -- Fact tables
CREATE SCHEMA mart; -- Business-specific aggregations
-- Step 2c: Create warehouses for different workloads
CREATE WAREHOUSE etl_wh
WAREHOUSE_SIZE = 'MEDIUM'
AUTO_SUSPEND = 60
AUTO_RESUME = TRUE
COMMENT = 'Used for data loading and transformation jobs';
CREATE WAREHOUSE reporting_wh
WAREHOUSE_SIZE = 'SMALL'
AUTO_SUSPEND = 300
AUTO_RESUME = TRUE
COMMENT = 'Used by BI tools and analyst queries';
CREATE WAREHOUSE adhoc_wh
WAREHOUSE_SIZE = 'XSMALL'
AUTO_SUSPEND = 120
AUTO_RESUME = TRUE
COMMENT = 'Used for ad-hoc analyst exploration';
-- Step 2d: Create roles following least-privilege principle
CREATE ROLE etl_role COMMENT = 'Data pipeline processes';
CREATE ROLE analyst_role COMMENT = 'Business analysts';
CREATE ROLE data_engineer_role COMMENT = 'Data engineering team';
CREATE ROLE reporting_role COMMENT = 'BI tools and dashboards';
-- Grant warehouse usage
GRANT USAGE ON WAREHOUSE etl_wh TO ROLE etl_role;
GRANT USAGE ON WAREHOUSE reporting_wh TO ROLE reporting_role;
GRANT USAGE ON WAREHOUSE reporting_wh TO ROLE analyst_role;
GRANT USAGE ON WAREHOUSE adhoc_wh TO ROLE analyst_role;
-- Grant database access
GRANT USAGE ON DATABASE shopstream_raw TO ROLE etl_role;
GRANT USAGE ON DATABASE shopstream_staging TO ROLE etl_role;
GRANT USAGE ON DATABASE shopstream_presentation TO ROLE etl_role;
GRANT USAGE ON DATABASE shopstream_presentation TO ROLE analyst_role;
GRANT USAGE ON DATABASE shopstream_presentation TO ROLE reporting_role;
-- Grant all database access to data engineers
GRANT USAGE ON DATABASE shopstream_raw TO ROLE data_engineer_role;
GRANT USAGE ON DATABASE shopstream_staging TO ROLE data_engineer_role;
GRANT USAGE ON DATABASE shopstream_presentation TO ROLE data_engineer_role;Step 3: Create the Raw Landing Tables
USE DATABASE shopstream_raw;
USE SCHEMA orders;
-- Raw orders table - store exactly what comes from PostgreSQL
CREATE TABLE raw_orders (
order_id VARCHAR(50),
customer_id VARCHAR(50),
order_date VARCHAR(30), -- Keep as string; clean in staging
status VARCHAR(30),
total_amount VARCHAR(20), -- Keep as string; validate in staging
currency VARCHAR(10),
region VARCHAR(50),
created_at VARCHAR(30),
updated_at VARCHAR(30),
_loaded_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP, -- Audit column
_source_file VARCHAR(500) -- Track which file/batch this came from
);
-- Raw order items
CREATE TABLE raw_order_items (
item_id VARCHAR(50),
order_id VARCHAR(50),
product_id VARCHAR(50),
quantity VARCHAR(20),
unit_price VARCHAR(20),
discount_pct VARCHAR(20),
_loaded_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP,
_source_file VARCHAR(500)
);
USE SCHEMA products;
-- Raw product data from JSON API
CREATE TABLE raw_products (
raw_json VARIANT, -- Store entire JSON document
_loaded_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP,
_source_file VARCHAR(500)
);
USE DATABASE shopstream_raw;
USE SCHEMA clickstream;
-- Raw clickstream events
CREATE TABLE raw_click_events (
event_id VARCHAR(100),
session_id VARCHAR(100),
user_id VARCHAR(100),
event_type VARCHAR(50),
page_url VARCHAR(2000),
event_timestamp VARCHAR(50),
device_type VARCHAR(30),
_loaded_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP
);Step 4: Create External Stages and Load Raw Data
-- Create an S3 stage for incoming files
CREATE STAGE shopstream_raw.orders.orders_stage
URL = 's3://shopstream-data/orders/'
CREDENTIALS = (AWS_KEY_ID = 'your_key' AWS_SECRET_KEY = 'your_secret')
FILE_FORMAT = (TYPE = 'CSV' FIELD_OPTIONALLY_ENCLOSED_BY = '"' SKIP_HEADER = 1);
CREATE STAGE shopstream_raw.products.products_stage
URL = 's3://shopstream-data/products/'
CREDENTIALS = (AWS_KEY_ID = 'your_key' AWS_SECRET_KEY = 'your_secret')
FILE_FORMAT = (TYPE = 'JSON');
-- Load orders data
COPY INTO shopstream_raw.orders.raw_orders (
order_id, customer_id, order_date, status, total_amount,
currency, region, created_at, updated_at, _source_file
)
FROM (
SELECT
$1, $2, $3, $4, $5, $6, $7, $8, $9,
METADATA$FILENAME
FROM @shopstream_raw.orders.orders_stage
)
FILE_FORMAT = (TYPE = 'CSV' FIELD_OPTIONALLY_ENCLOSED_BY = '"' SKIP_HEADER = 1)
ON_ERROR = 'CONTINUE'; -- Log errors but don't stop the load
-- Load JSON products data
COPY INTO shopstream_raw.products.raw_products (raw_json, _source_file)
FROM (
SELECT $1, METADATA$FILENAME
FROM @shopstream_raw.products.products_stage
)
FILE_FORMAT = (TYPE = 'JSON')
ON_ERROR = 'CONTINUE';Step 5: Build the Staging Layer
USE DATABASE shopstream_staging;
USE SCHEMA orders;
-- Cleaned orders staging table
CREATE OR REPLACE TABLE stg_orders AS
SELECT
order_id,
customer_id,
TRY_TO_DATE(order_date, 'YYYY-MM-DD') AS order_date,
UPPER(TRIM(status)) AS order_status,
TRY_TO_NUMBER(total_amount, 10, 2) AS total_amount_usd,
UPPER(TRIM(currency)) AS currency,
UPPER(TRIM(region)) AS region,
TRY_TO_TIMESTAMP(created_at) AS created_at,
TRY_TO_TIMESTAMP(updated_at) AS updated_at,
_loaded_at,
_source_file
FROM shopstream_raw.orders.raw_orders
WHERE order_id IS NOT NULL
AND customer_id IS NOT NULL
AND TRY_TO_DATE(order_date, 'YYYY-MM-DD') IS NOT NULL; -- Drop rows with invalid dates
-- Cleaned products from JSON
USE SCHEMA products;
CREATE OR REPLACE TABLE stg_products AS
SELECT
raw_json:product_id::VARCHAR(50) AS product_id,
raw_json:product_name::VARCHAR(200) AS product_name,
UPPER(raw_json:category::VARCHAR(100)) AS category,
UPPER(raw_json:subcategory::VARCHAR(100)) AS subcategory,
raw_json:unit_cost::FLOAT AS unit_cost,
raw_json:list_price::FLOAT AS list_price,
raw_json:is_active::BOOLEAN AS is_active,
raw_json:launch_date::DATE AS launch_date,
_loaded_at
FROM shopstream_raw.products.raw_products
WHERE raw_json:product_id IS NOT NULL;Step 6: Build Dimension Tables
USE DATABASE shopstream_presentation;
USE SCHEMA dim;
-- Date dimension (covers 10 years)
CREATE OR REPLACE TABLE dim_date AS
SELECT
TO_NUMBER(TO_CHAR(d.date_val, 'YYYYMMDD')) AS date_key,
d.date_val AS full_date,
YEAR(d.date_val) AS year,
QUARTER(d.date_val) AS quarter,
MONTH(d.date_val) AS month_num,
MONTHNAME(d.date_val) AS month_name,
DAY(d.date_val) AS day_of_month,
DAYOFWEEK(d.date_val) AS day_of_week,
DAYNAME(d.date_val) AS day_name,
CASE WHEN DAYOFWEEK(d.date_val) IN (0, 6) THEN TRUE ELSE FALSE END AS is_weekend,
'Q' || QUARTER(d.date_val) || '-' || YEAR(d.date_val) AS quarter_label
FROM (
SELECT DATEADD(day, SEQ4(), '2020-01-01')::DATE AS date_val
FROM TABLE(GENERATOR(ROWCOUNT => 3650))
) d;
-- Customer dimension
CREATE OR REPLACE TABLE dim_customer AS
SELECT
customer_id AS customer_key,
customer_id,
customer_name,
email,
UPPER(region) AS region,
UPPER(country) AS country,
acquisition_channel,
first_order_date,
DATEDIFF('day', first_order_date, CURRENT_DATE) AS days_as_customer,
CURRENT_TIMESTAMP AS dw_created_at
FROM shopstream_staging.orders.stg_customers;
-- Product dimension (Type 2 SCD - track historical changes)
CREATE OR REPLACE TABLE dim_product (
product_key INTEGER AUTOINCREMENT PRIMARY KEY,
product_id VARCHAR(50),
product_name VARCHAR(200),
category VARCHAR(100),
subcategory VARCHAR(100),
unit_cost FLOAT,
list_price FLOAT,
is_active BOOLEAN,
dw_start_date DATE DEFAULT CURRENT_DATE,
dw_end_date DATE DEFAULT '9999-12-31',
dw_is_current BOOLEAN DEFAULT TRUE,
dw_created_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP
);Step 7: Build the Fact Table
USE DATABASE shopstream_presentation;
USE SCHEMA fact;
-- Order facts (grain: one row per order item)
CREATE OR REPLACE TABLE fact_order_items AS
SELECT
-- Surrogate keys joining to dimensions
TO_NUMBER(TO_CHAR(o.order_date, 'YYYYMMDD')) AS date_key,
c.customer_key,
p.product_key,
-- Degenerate dimensions (IDs that don't need their own dimension table)
o.order_id,
oi.item_id,
-- Measures
oi.quantity,
oi.unit_price,
oi.discount_pct,
oi.unit_price * oi.quantity AS gross_revenue,
oi.unit_price * oi.quantity * (1 - oi.discount_pct / 100) AS net_revenue,
p.unit_cost * oi.quantity AS total_cost,
(oi.unit_price * oi.quantity * (1 - oi.discount_pct / 100)) - (p.unit_cost * oi.quantity) AS gross_profit,
-- Audit
CURRENT_TIMESTAMP AS dw_loaded_at
FROM shopstream_staging.orders.stg_order_items oi
JOIN shopstream_staging.orders.stg_orders o ON oi.order_id = o.order_id
JOIN shopstream_presentation.dim.dim_customer c ON o.customer_id = c.customer_id
JOIN shopstream_presentation.dim.dim_product p
ON oi.product_id = p.product_id
AND p.dw_is_current = TRUE
WHERE o.order_status = 'COMPLETED';Step 8: Build the Mart Layer (Business Views)
USE DATABASE shopstream_presentation;
USE SCHEMA mart;
-- Monthly revenue by region and category
CREATE OR REPLACE VIEW v_monthly_revenue_by_region AS
SELECT
dd.year,
dd.month_num,
dd.month_name,
dc.region,
dp.category,
SUM(f.net_revenue) AS total_net_revenue,
SUM(f.gross_profit) AS total_gross_profit,
ROUND(SUM(f.gross_profit) / NULLIF(SUM(f.net_revenue), 0) * 100, 1) AS gross_margin_pct,
COUNT(DISTINCT f.order_id) AS total_orders,
COUNT(DISTINCT f.customer_key) AS unique_customers
FROM fact.fact_order_items f
JOIN dim.dim_date dd ON f.date_key = dd.date_key
JOIN dim.dim_customer dc ON f.customer_key = dc.customer_key
JOIN dim.dim_product dp ON f.product_key = dp.product_key
GROUP BY 1, 2, 3, 4, 5;
-- Customer lifetime value
CREATE OR REPLACE VIEW v_customer_ltv AS
SELECT
c.customer_id,
c.customer_name,
c.region,
c.acquisition_channel,
c.days_as_customer,
SUM(f.net_revenue) AS lifetime_revenue,
COUNT(DISTINCT f.order_id) AS total_orders,
ROUND(SUM(f.net_revenue) / NULLIF(c.days_as_customer, 0) * 30, 2) AS monthly_revenue_rate,
MAX(dd.full_date) AS last_order_date,
DATEDIFF('day', MAX(dd.full_date), CURRENT_DATE) AS days_since_last_order
FROM fact.fact_order_items f
JOIN dim.dim_customer c ON f.customer_key = c.customer_key
JOIN dim.dim_date dd ON f.date_key = dd.date_key
GROUP BY 1, 2, 3, 4, 5;Step 9: Automate the Pipeline with Tasks
-- Task 1: Load raw data from stage (runs every hour)
CREATE OR REPLACE TASK load_raw_orders
WAREHOUSE = etl_wh
SCHEDULE = 'USING CRON 0 * * * * UTC'
AS
COPY INTO shopstream_raw.orders.raw_orders
FROM @shopstream_raw.orders.orders_stage
FILE_FORMAT = (TYPE = 'CSV' SKIP_HEADER = 1)
ON_ERROR = 'CONTINUE';
-- Task 2: Refresh staging (depends on Task 1)
CREATE OR REPLACE TASK refresh_staging_orders
WAREHOUSE = etl_wh
AFTER load_raw_orders
AS
INSERT INTO shopstream_staging.orders.stg_orders
SELECT
order_id,
customer_id,
TRY_TO_DATE(order_date, 'YYYY-MM-DD'),
UPPER(TRIM(status)),
TRY_TO_NUMBER(total_amount, 10, 2),
UPPER(TRIM(currency)),
UPPER(TRIM(region)),
_loaded_at,
_source_file
FROM shopstream_raw.orders.raw_orders
WHERE _loaded_at > (SELECT MAX(_loaded_at) FROM shopstream_staging.orders.stg_orders);
-- Enable the task chain
ALTER TASK refresh_staging_orders RESUME;
ALTER TASK load_raw_orders RESUME;Step 10: Validate and Test the Pipeline
-- Row count reconciliation check
SELECT 'raw_orders' AS layer, COUNT(*) AS row_count FROM shopstream_raw.orders.raw_orders
UNION ALL
SELECT 'stg_orders', COUNT(*) FROM shopstream_staging.orders.stg_orders
UNION ALL
SELECT 'fact_order_items', COUNT(*) FROM shopstream_presentation.fact.fact_order_items;
-- Revenue reconciliation between staging and fact
SELECT
ROUND(SUM(total_amount_usd), 2) AS staging_revenue
FROM shopstream_staging.orders.stg_orders
WHERE order_status = 'COMPLETED';
SELECT
ROUND(SUM(net_revenue), 2) AS fact_revenue
FROM shopstream_presentation.fact.fact_order_items;
-- Check for null foreign keys in fact table (data quality check)
SELECT COUNT(*) AS orphaned_records
FROM shopstream_presentation.fact.fact_order_items
WHERE customer_key IS NULL OR product_key IS NULL OR date_key IS NULL;Key Points Summary
- Structure your warehouse in three layers: Raw → Staging → Presentation. Each layer serves a distinct purpose.
- Land source data in the raw layer with no transformations. Preserve exactly what the source sent, including audit columns like
_loaded_atand_source_file. - Apply cleaning, standardization, and type conversion in the staging layer. Use
TRY_TO_DATEandTRY_TO_NUMBERto handle bad data gracefully. - Build dimension and fact tables in the presentation layer to serve BI tools and analyst queries.
- Create mart views on top of fact tables to serve specific business questions without duplicating data.
- Use Snowflake Tasks with dependencies (AFTER keyword) to chain pipeline steps automatically.
- Set up separate roles and warehouses for ETL, reporting, and ad-hoc workloads to isolate costs and performance.
- Always run row count and revenue reconciliation checks to validate that data moved correctly through each layer.
