Snowflake Building a Data Warehouse Project

Building a real data warehouse in Snowflake means connecting all the concepts you have learned into a working system. This page walks through a complete project: designing the architecture, setting up environments, loading data, transforming it into analytical layers, and serving it to end users. The example uses a fictional e-commerce company called ShopStream, which sells products online and needs a data warehouse to answer business questions.

Project Overview: What ShopStream Needs

ShopStream has three source systems:

An orders database (PostgreSQL) with customer orders, order items, and payments
A product catalog API (JSON files) with product details, categories, and pricing
A clickstream log (CSV files, one per hour) with user behavior on the website

Business questions they need to answer:

What is our total revenue by region and product category this month?
Which products have the highest return rates?
What is the average customer lifetime value by acquisition channel?
Which customers are at risk of churning?

Step 1: Design the Architecture

Diagram: The Three-Layer Warehouse

Picture a flour mill. Raw wheat comes in from the field (source data). It goes through initial cleaning to remove dirt and debris (raw layer). Then it gets processed into different flour grades (staging layer). Finally, it gets packaged into labeled bags for consumers (presentation layer). A data warehouse works identically.

The standard Snowflake data warehouse uses three layers:

RAW layer: Data lands here exactly as it came from the source. No transformations. Think of this as the loading dock.
STAGING layer: Data gets cleaned, standardized, and lightly transformed. Business logic is not applied yet. This is the processing floor.
PRESENTATION layer: Business-ready data organized for reporting and analysis. Dimensions, facts, and aggregates live here. This is the showroom.

Step 2: Set Up Snowflake Infrastructure

-- Step 2a: Create databases for each layer
CREATE DATABASE shopstream_raw COMMENT = 'Raw data as received from source systems';
CREATE DATABASE shopstream_staging COMMENT = 'Cleaned and standardized data';
CREATE DATABASE shopstream_presentation COMMENT = 'Business-ready analytical data';
CREATE DATABASE shopstream_admin COMMENT = 'Monitoring, logging, and configuration';

-- Step 2b: Create schemas within each database
USE DATABASE shopstream_raw;
CREATE SCHEMA orders;
CREATE SCHEMA products;
CREATE SCHEMA clickstream;

USE DATABASE shopstream_staging;
CREATE SCHEMA orders;
CREATE SCHEMA products;
CREATE SCHEMA clickstream;

USE DATABASE shopstream_presentation;
CREATE SCHEMA dim;    -- Dimension tables
CREATE SCHEMA fact;   -- Fact tables
CREATE SCHEMA mart;   -- Business-specific aggregations

-- Step 2c: Create warehouses for different workloads
CREATE WAREHOUSE etl_wh
  WAREHOUSE_SIZE = 'MEDIUM'
  AUTO_SUSPEND = 60
  AUTO_RESUME = TRUE
  COMMENT = 'Used for data loading and transformation jobs';

CREATE WAREHOUSE reporting_wh
  WAREHOUSE_SIZE = 'SMALL'
  AUTO_SUSPEND = 300
  AUTO_RESUME = TRUE
  COMMENT = 'Used by BI tools and analyst queries';

CREATE WAREHOUSE adhoc_wh
  WAREHOUSE_SIZE = 'XSMALL'
  AUTO_SUSPEND = 120
  AUTO_RESUME = TRUE
  COMMENT = 'Used for ad-hoc analyst exploration';

-- Step 2d: Create roles following least-privilege principle
CREATE ROLE etl_role COMMENT = 'Data pipeline processes';
CREATE ROLE analyst_role COMMENT = 'Business analysts';
CREATE ROLE data_engineer_role COMMENT = 'Data engineering team';
CREATE ROLE reporting_role COMMENT = 'BI tools and dashboards';

-- Grant warehouse usage
GRANT USAGE ON WAREHOUSE etl_wh TO ROLE etl_role;
GRANT USAGE ON WAREHOUSE reporting_wh TO ROLE reporting_role;
GRANT USAGE ON WAREHOUSE reporting_wh TO ROLE analyst_role;
GRANT USAGE ON WAREHOUSE adhoc_wh TO ROLE analyst_role;

-- Grant database access
GRANT USAGE ON DATABASE shopstream_raw TO ROLE etl_role;
GRANT USAGE ON DATABASE shopstream_staging TO ROLE etl_role;
GRANT USAGE ON DATABASE shopstream_presentation TO ROLE etl_role;
GRANT USAGE ON DATABASE shopstream_presentation TO ROLE analyst_role;
GRANT USAGE ON DATABASE shopstream_presentation TO ROLE reporting_role;

-- Grant all database access to data engineers
GRANT USAGE ON DATABASE shopstream_raw TO ROLE data_engineer_role;
GRANT USAGE ON DATABASE shopstream_staging TO ROLE data_engineer_role;
GRANT USAGE ON DATABASE shopstream_presentation TO ROLE data_engineer_role;

Step 3: Create the Raw Landing Tables

USE DATABASE shopstream_raw;
USE SCHEMA orders;

-- Raw orders table - store exactly what comes from PostgreSQL
CREATE TABLE raw_orders (
    order_id        VARCHAR(50),
    customer_id     VARCHAR(50),
    order_date      VARCHAR(30),        -- Keep as string; clean in staging
    status          VARCHAR(30),
    total_amount    VARCHAR(20),        -- Keep as string; validate in staging
    currency        VARCHAR(10),
    region          VARCHAR(50),
    created_at      VARCHAR(30),
    updated_at      VARCHAR(30),
    _loaded_at      TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP,  -- Audit column
    _source_file    VARCHAR(500)        -- Track which file/batch this came from
);

-- Raw order items
CREATE TABLE raw_order_items (
    item_id         VARCHAR(50),
    order_id        VARCHAR(50),
    product_id      VARCHAR(50),
    quantity        VARCHAR(20),
    unit_price      VARCHAR(20),
    discount_pct    VARCHAR(20),
    _loaded_at      TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP,
    _source_file    VARCHAR(500)
);

USE SCHEMA products;
-- Raw product data from JSON API
CREATE TABLE raw_products (
    raw_json        VARIANT,            -- Store entire JSON document
    _loaded_at      TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP,
    _source_file    VARCHAR(500)
);

USE DATABASE shopstream_raw;
USE SCHEMA clickstream;
-- Raw clickstream events
CREATE TABLE raw_click_events (
    event_id        VARCHAR(100),
    session_id      VARCHAR(100),
    user_id         VARCHAR(100),
    event_type      VARCHAR(50),
    page_url        VARCHAR(2000),
    event_timestamp VARCHAR(50),
    device_type     VARCHAR(30),
    _loaded_at      TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP
);

Step 4: Create External Stages and Load Raw Data

-- Create an S3 stage for incoming files
CREATE STAGE shopstream_raw.orders.orders_stage
    URL = 's3://shopstream-data/orders/'
    CREDENTIALS = (AWS_KEY_ID = 'your_key' AWS_SECRET_KEY = 'your_secret')
    FILE_FORMAT = (TYPE = 'CSV' FIELD_OPTIONALLY_ENCLOSED_BY = '"' SKIP_HEADER = 1);

CREATE STAGE shopstream_raw.products.products_stage
    URL = 's3://shopstream-data/products/'
    CREDENTIALS = (AWS_KEY_ID = 'your_key' AWS_SECRET_KEY = 'your_secret')
    FILE_FORMAT = (TYPE = 'JSON');

-- Load orders data
COPY INTO shopstream_raw.orders.raw_orders (
    order_id, customer_id, order_date, status, total_amount, 
    currency, region, created_at, updated_at, _source_file
)
FROM (
    SELECT 
        $1, $2, $3, $4, $5, $6, $7, $8, $9,
        METADATA$FILENAME
    FROM @shopstream_raw.orders.orders_stage
)
FILE_FORMAT = (TYPE = 'CSV' FIELD_OPTIONALLY_ENCLOSED_BY = '"' SKIP_HEADER = 1)
ON_ERROR = 'CONTINUE';  -- Log errors but don't stop the load

-- Load JSON products data
COPY INTO shopstream_raw.products.raw_products (raw_json, _source_file)
FROM (
    SELECT $1, METADATA$FILENAME
    FROM @shopstream_raw.products.products_stage
)
FILE_FORMAT = (TYPE = 'JSON')
ON_ERROR = 'CONTINUE';

Step 5: Build the Staging Layer

USE DATABASE shopstream_staging;
USE SCHEMA orders;

-- Cleaned orders staging table
CREATE OR REPLACE TABLE stg_orders AS
SELECT
    order_id,
    customer_id,
    TRY_TO_DATE(order_date, 'YYYY-MM-DD') AS order_date,
    UPPER(TRIM(status)) AS order_status,
    TRY_TO_NUMBER(total_amount, 10, 2) AS total_amount_usd,
    UPPER(TRIM(currency)) AS currency,
    UPPER(TRIM(region)) AS region,
    TRY_TO_TIMESTAMP(created_at) AS created_at,
    TRY_TO_TIMESTAMP(updated_at) AS updated_at,
    _loaded_at,
    _source_file
FROM shopstream_raw.orders.raw_orders
WHERE order_id IS NOT NULL
  AND customer_id IS NOT NULL
  AND TRY_TO_DATE(order_date, 'YYYY-MM-DD') IS NOT NULL;  -- Drop rows with invalid dates

-- Cleaned products from JSON
USE SCHEMA products;
CREATE OR REPLACE TABLE stg_products AS
SELECT
    raw_json:product_id::VARCHAR(50) AS product_id,
    raw_json:product_name::VARCHAR(200) AS product_name,
    UPPER(raw_json:category::VARCHAR(100)) AS category,
    UPPER(raw_json:subcategory::VARCHAR(100)) AS subcategory,
    raw_json:unit_cost::FLOAT AS unit_cost,
    raw_json:list_price::FLOAT AS list_price,
    raw_json:is_active::BOOLEAN AS is_active,
    raw_json:launch_date::DATE AS launch_date,
    _loaded_at
FROM shopstream_raw.products.raw_products
WHERE raw_json:product_id IS NOT NULL;

Step 6: Build Dimension Tables

USE DATABASE shopstream_presentation;
USE SCHEMA dim;

-- Date dimension (covers 10 years)
CREATE OR REPLACE TABLE dim_date AS
SELECT
    TO_NUMBER(TO_CHAR(d.date_val, 'YYYYMMDD')) AS date_key,
    d.date_val AS full_date,
    YEAR(d.date_val) AS year,
    QUARTER(d.date_val) AS quarter,
    MONTH(d.date_val) AS month_num,
    MONTHNAME(d.date_val) AS month_name,
    DAY(d.date_val) AS day_of_month,
    DAYOFWEEK(d.date_val) AS day_of_week,
    DAYNAME(d.date_val) AS day_name,
    CASE WHEN DAYOFWEEK(d.date_val) IN (0, 6) THEN TRUE ELSE FALSE END AS is_weekend,
    'Q' || QUARTER(d.date_val) || '-' || YEAR(d.date_val) AS quarter_label
FROM (
    SELECT DATEADD(day, SEQ4(), '2020-01-01')::DATE AS date_val
    FROM TABLE(GENERATOR(ROWCOUNT => 3650))
) d;

-- Customer dimension
CREATE OR REPLACE TABLE dim_customer AS
SELECT
    customer_id AS customer_key,
    customer_id,
    customer_name,
    email,
    UPPER(region) AS region,
    UPPER(country) AS country,
    acquisition_channel,
    first_order_date,
    DATEDIFF('day', first_order_date, CURRENT_DATE) AS days_as_customer,
    CURRENT_TIMESTAMP AS dw_created_at
FROM shopstream_staging.orders.stg_customers;

-- Product dimension (Type 2 SCD - track historical changes)
CREATE OR REPLACE TABLE dim_product (
    product_key         INTEGER AUTOINCREMENT PRIMARY KEY,
    product_id          VARCHAR(50),
    product_name        VARCHAR(200),
    category            VARCHAR(100),
    subcategory         VARCHAR(100),
    unit_cost           FLOAT,
    list_price          FLOAT,
    is_active           BOOLEAN,
    dw_start_date       DATE DEFAULT CURRENT_DATE,
    dw_end_date         DATE DEFAULT '9999-12-31',
    dw_is_current       BOOLEAN DEFAULT TRUE,
    dw_created_at       TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP
);

Step 7: Build the Fact Table

USE DATABASE shopstream_presentation;
USE SCHEMA fact;

-- Order facts (grain: one row per order item)
CREATE OR REPLACE TABLE fact_order_items AS
SELECT
    -- Surrogate keys joining to dimensions
    TO_NUMBER(TO_CHAR(o.order_date, 'YYYYMMDD')) AS date_key,
    c.customer_key,
    p.product_key,
    
    -- Degenerate dimensions (IDs that don't need their own dimension table)
    o.order_id,
    oi.item_id,
    
    -- Measures
    oi.quantity,
    oi.unit_price,
    oi.discount_pct,
    oi.unit_price * oi.quantity AS gross_revenue,
    oi.unit_price * oi.quantity * (1 - oi.discount_pct / 100) AS net_revenue,
    p.unit_cost * oi.quantity AS total_cost,
    (oi.unit_price * oi.quantity * (1 - oi.discount_pct / 100)) - (p.unit_cost * oi.quantity) AS gross_profit,
    
    -- Audit
    CURRENT_TIMESTAMP AS dw_loaded_at
    
FROM shopstream_staging.orders.stg_order_items oi
JOIN shopstream_staging.orders.stg_orders o ON oi.order_id = o.order_id
JOIN shopstream_presentation.dim.dim_customer c ON o.customer_id = c.customer_id
JOIN shopstream_presentation.dim.dim_product p 
    ON oi.product_id = p.product_id 
    AND p.dw_is_current = TRUE
WHERE o.order_status = 'COMPLETED';

Step 8: Build the Mart Layer (Business Views)

USE DATABASE shopstream_presentation;
USE SCHEMA mart;

-- Monthly revenue by region and category
CREATE OR REPLACE VIEW v_monthly_revenue_by_region AS
SELECT
    dd.year,
    dd.month_num,
    dd.month_name,
    dc.region,
    dp.category,
    SUM(f.net_revenue) AS total_net_revenue,
    SUM(f.gross_profit) AS total_gross_profit,
    ROUND(SUM(f.gross_profit) / NULLIF(SUM(f.net_revenue), 0) * 100, 1) AS gross_margin_pct,
    COUNT(DISTINCT f.order_id) AS total_orders,
    COUNT(DISTINCT f.customer_key) AS unique_customers
FROM fact.fact_order_items f
JOIN dim.dim_date dd ON f.date_key = dd.date_key
JOIN dim.dim_customer dc ON f.customer_key = dc.customer_key
JOIN dim.dim_product dp ON f.product_key = dp.product_key
GROUP BY 1, 2, 3, 4, 5;

-- Customer lifetime value
CREATE OR REPLACE VIEW v_customer_ltv AS
SELECT
    c.customer_id,
    c.customer_name,
    c.region,
    c.acquisition_channel,
    c.days_as_customer,
    SUM(f.net_revenue) AS lifetime_revenue,
    COUNT(DISTINCT f.order_id) AS total_orders,
    ROUND(SUM(f.net_revenue) / NULLIF(c.days_as_customer, 0) * 30, 2) AS monthly_revenue_rate,
    MAX(dd.full_date) AS last_order_date,
    DATEDIFF('day', MAX(dd.full_date), CURRENT_DATE) AS days_since_last_order
FROM fact.fact_order_items f
JOIN dim.dim_customer c ON f.customer_key = c.customer_key
JOIN dim.dim_date dd ON f.date_key = dd.date_key
GROUP BY 1, 2, 3, 4, 5;

Step 9: Automate the Pipeline with Tasks

-- Task 1: Load raw data from stage (runs every hour)
CREATE OR REPLACE TASK load_raw_orders
    WAREHOUSE = etl_wh
    SCHEDULE = 'USING CRON 0 * * * * UTC'
AS
COPY INTO shopstream_raw.orders.raw_orders
FROM @shopstream_raw.orders.orders_stage
FILE_FORMAT = (TYPE = 'CSV' SKIP_HEADER = 1)
ON_ERROR = 'CONTINUE';

-- Task 2: Refresh staging (depends on Task 1)
CREATE OR REPLACE TASK refresh_staging_orders
    WAREHOUSE = etl_wh
    AFTER load_raw_orders
AS
INSERT INTO shopstream_staging.orders.stg_orders
SELECT
    order_id,
    customer_id,
    TRY_TO_DATE(order_date, 'YYYY-MM-DD'),
    UPPER(TRIM(status)),
    TRY_TO_NUMBER(total_amount, 10, 2),
    UPPER(TRIM(currency)),
    UPPER(TRIM(region)),
    _loaded_at,
    _source_file
FROM shopstream_raw.orders.raw_orders
WHERE _loaded_at > (SELECT MAX(_loaded_at) FROM shopstream_staging.orders.stg_orders);

-- Enable the task chain
ALTER TASK refresh_staging_orders RESUME;
ALTER TASK load_raw_orders RESUME;

Step 10: Validate and Test the Pipeline

-- Row count reconciliation check
SELECT 'raw_orders' AS layer, COUNT(*) AS row_count FROM shopstream_raw.orders.raw_orders
UNION ALL
SELECT 'stg_orders', COUNT(*) FROM shopstream_staging.orders.stg_orders
UNION ALL
SELECT 'fact_order_items', COUNT(*) FROM shopstream_presentation.fact.fact_order_items;

-- Revenue reconciliation between staging and fact
SELECT 
    ROUND(SUM(total_amount_usd), 2) AS staging_revenue
FROM shopstream_staging.orders.stg_orders
WHERE order_status = 'COMPLETED';

SELECT 
    ROUND(SUM(net_revenue), 2) AS fact_revenue
FROM shopstream_presentation.fact.fact_order_items;

-- Check for null foreign keys in fact table (data quality check)
SELECT COUNT(*) AS orphaned_records
FROM shopstream_presentation.fact.fact_order_items
WHERE customer_key IS NULL OR product_key IS NULL OR date_key IS NULL;

Key Points Summary

Structure your warehouse in three layers: Raw → Staging → Presentation. Each layer serves a distinct purpose.
Land source data in the raw layer with no transformations. Preserve exactly what the source sent, including audit columns like _loaded_at and _source_file.
Apply cleaning, standardization, and type conversion in the staging layer. Use TRY_TO_DATE and TRY_TO_NUMBER to handle bad data gracefully.
Build dimension and fact tables in the presentation layer to serve BI tools and analyst queries.
Create mart views on top of fact tables to serve specific business questions without duplicating data.
Use Snowflake Tasks with dependencies (AFTER keyword) to chain pipeline steps automatically.
Set up separate roles and warehouses for ETL, reporting, and ad-hoc workloads to isolate costs and performance.
Always run row count and revenue reconciliation checks to validate that data moved correctly through each layer.

Previous lessons

Back to courses

Next lessons