Snowflake Data Types and Semi-Structured

Snowflake handles both traditional structured data (rows and columns with fixed types) and modern semi-structured data (JSON, XML, Avro, Parquet, ORC) inside the same platform. The VARIANT data type is the bridge between these two worlds. It stores any value — a simple number, a nested JSON object, or an array of thousands of elements — in a single column. This topic explains every Snowflake data type, when to use each one, and how to work confidently with JSON stored in VARIANT columns.

Why Semi-Structured Data Matters

Modern applications produce data in JSON format constantly. A web analytics platform sends click events as JSON. A mobile app logs user actions as JSON arrays. An e-commerce checkout API returns order confirmations as nested JSON with arrays of line items inside an order object inside a customer object. Forcing all that into a rigid relational schema requires extensive ETL work every time the JSON structure changes.

Snowflake's VARIANT type stores JSON as-is. You load it without schema definition and query specific fields by name using dot notation — even when the structure varies from row to row. This flexibility is the reason Snowflake has become a dominant platform for data teams that deal with API and event data.

Complete Data Type Reference

Numeric Types

TYPE                    RANGE / PRECISION           BEST USE
----                    -----------------           --------
NUMBER(p, s)            Up to 38 digits total       Exact decimal math (money, ratios)
DECIMAL(p, s)           Same as NUMBER              Alias — interchangeable
INT / INTEGER           -2^63 to 2^63-1 (64-bit)   Whole numbers, IDs, counts
BIGINT                  Same as INT in Snowflake     Large IDs, timestamps as int
SMALLINT                -32,768 to 32,767           Small range integers
FLOAT / FLOAT4 / FLOAT8 ~15 significant digits       Scientific calculations
DOUBLE                  Same as FLOAT8              Alias — scientific data
REAL                    Same as FLOAT4              Alias

Example choices:
  Customer ID:          INT
  Price / Amount:       NUMBER(12, 2)   -- 10 digits before decimal, 2 after
  Percentage:           NUMBER(5, 4)    -- e.g., 0.9875 = 98.75%
  Scientific reading:   FLOAT
  Latitude/Longitude:   FLOAT           -- decimal degrees need floating point

String Types

TYPE              MAX LENGTH     NOTE
----              ----------     ----
VARCHAR(n)        16 MB          Variable-length text, n chars max
STRING            16 MB          Alias for VARCHAR with no limit specified
TEXT              16 MB          Alias for VARCHAR with no limit specified
CHAR(n)           Fixed n chars  Pads with spaces to fill length
NCHAR(n)          Fixed n chars  Unicode version of CHAR
NVARCHAR(n)       16 MB          Alias for VARCHAR

Snowflake stores all string types in Unicode (UTF-8) internally.
CHAR and VARCHAR behave identically in storage — no performance difference.
VARCHAR(255) and VARCHAR(16000000) consume the same storage for a 10-char value.
Prefer VARCHAR over CHAR to avoid unexpected trailing space issues.

Date and Time Types

TYPE              STORES                      TIMEZONE AWARE?
----              ------                      ---------------
DATE              Calendar date only           No
TIME              Time of day (HH:MI:SS.nnnnnnn) No
TIMESTAMP_NTZ     Date + time, no timezone     No  (NTZ = No TimeZone)
TIMESTAMP_LTZ     Date + time, local timezone  Yes (LTZ = Local TimeZone)
TIMESTAMP_TZ      Date + time + timezone offset Yes (TZ = TimeZone preserved)

TIMESTAMP_NTZ is the most commonly used for data engineering.
  - Stores values as-is, no timezone conversion
  - Predictable behaviour across sessions and users in different regions

TIMESTAMP_LTZ converts to/from the session's LOCAL_TIME_ZONE setting.
  - Values display differently depending on which user's session reads them
  - Can cause confusion in global teams

TIMESTAMP_TZ stores the original timezone offset alongside the timestamp.
  - Best when you need to preserve the exact timezone of the original event
  - Example: 2024-06-15 14:30:00 +05:30 (India Standard Time preserved)

Boolean Type

BOOLEAN accepts: TRUE, FALSE, NULL
String equivalents: 'TRUE'/'FALSE', 'YES'/'NO', 'ON'/'OFF', '1'/'0'

SELECT IFF(is_active = TRUE, 'Active', 'Inactive') FROM CUSTOMERS;
SELECT * FROM CUSTOMERS WHERE is_active;    -- implicit = TRUE
SELECT * FROM CUSTOMERS WHERE NOT is_active;

Binary Type

BINARY(n)    Fixed-length raw bytes
VARBINARY    Variable-length raw bytes

Used for: storing encrypted values, checksums, raw file bytes
Rarely needed in analytics workloads — mostly used in application data pipelines

Semi-Structured Types: VARIANT, ARRAY, OBJECT

TYPE        STORES                              EXAMPLE VALUE
----        ------                              -------------
VARIANT     Any value: scalar, array, or object {"name":"Alice","age":30}
ARRAY       Ordered list of values              ["red","green","blue"]
OBJECT      Key-value pairs                     {"city":"Delhi","zip":"110001"}

All three are stored internally as VARIANT.
ARRAY and OBJECT are convenience types with type-checking at insert time.
In practice, most teams use VARIANT for all semi-structured data.

Loading and Querying JSON in VARIANT Columns

Consider a table of customer events captured from a web application. Each event is a JSON object with varying fields depending on the event type.

-- Sample JSON data (each line is one event):
{
  "event_id": "ev_001",
  "user_id": 5042,
  "event_type": "purchase",
  "timestamp": "2024-06-15T14:30:00Z",
  "product": {
    "id": "PROD-998",
    "name": "Wireless Headphones",
    "category": "Electronics"
  },
  "payment": {
    "method": "credit_card",
    "amount": 149.99,
    "currency": "USD"
  },
  "tags": ["sale", "member-discount", "electronics"]
}

-- Create table with VARIANT column
CREATE TABLE WEB_EVENTS (
  event_raw    VARIANT,
  loaded_at    TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);

-- After loading with COPY INTO, query specific fields:
SELECT
  event_raw:event_id::VARCHAR           AS event_id,
  event_raw:user_id::INT                AS user_id,
  event_raw:event_type::VARCHAR         AS event_type,
  event_raw:timestamp::TIMESTAMP_NTZ    AS event_time,
  event_raw:product:id::VARCHAR         AS product_id,
  event_raw:product:name::VARCHAR       AS product_name,
  event_raw:product:category::VARCHAR   AS category,
  event_raw:payment:amount::DECIMAL(10,2) AS amount,
  event_raw:payment:currency::VARCHAR   AS currency,
  event_raw:tags[0]::VARCHAR            AS first_tag,
  event_raw:tags[1]::VARCHAR            AS second_tag
FROM WEB_EVENTS
WHERE event_raw:event_type::VARCHAR = 'purchase'
  AND event_raw:payment:amount::DECIMAL(10,2) > 100
LIMIT 20;

VARIANT Notation Guide

NOTATION                    ACCESSES
--------                    --------
col:key                     Top-level key in a JSON object
col:key:nested_key          Nested key (one level deep)
col:key:level1:level2       Deeply nested key
col:array_key[0]            First element of a JSON array
col:array_key[2]            Third element of a JSON array
col:key::VARCHAR            Cast to VARCHAR (always cast before comparing)
col:key::INT                Cast to integer
col:key::DECIMAL(10,2)      Cast to decimal
col:key::TIMESTAMP_NTZ      Cast to timestamp
col:key::BOOLEAN            Cast to boolean

IMPORTANT: Always cast VARIANT values before using them in:
  - WHERE conditions
  - JOIN conditions
  - Mathematical calculations
  - ORDER BY clauses
Without a cast, comparisons use VARIANT ordering which may surprise you.

Flattening Arrays: The FLATTEN Function

When a VARIANT column contains arrays, you often need to expand each array element into its own row. FLATTEN converts an array inside a JSON field into separate rows — one row per array element.

-- Sample data: each order has an array of line items
{
  "order_id": "ORD-001",
  "customer": "Alice",
  "items": [
    {"product": "Headphones", "qty": 1, "price": 149.99},
    {"product": "Phone Case",  "qty": 2, "price": 19.99},
    {"product": "Cable",       "qty": 3, "price": 9.99}
  ]
}

FLATTEN DIAGRAM
================

Before FLATTEN:
  order_id | items (array with 3 objects)
  ORD-001  | [{headphones},{phone case},{cable}]

After FLATTEN:
  order_id | item_index | product      | qty | price
  ORD-001  | 0          | Headphones   | 1   | 149.99
  ORD-001  | 1          | Phone Case   | 2   |  19.99
  ORD-001  | 2          | Cable        | 3   |   9.99

-- SQL to achieve this:
SELECT
  o.event_raw:order_id::VARCHAR       AS order_id,
  o.event_raw:customer::VARCHAR       AS customer,
  f.index                             AS item_position,
  f.value:product::VARCHAR            AS product_name,
  f.value:qty::INT                    AS quantity,
  f.value:price::DECIMAL(10,2)        AS unit_price,
  (f.value:qty::INT * f.value:price::DECIMAL(10,2)) AS line_total
FROM ORDERS_RAW o,
  LATERAL FLATTEN(INPUT => o.event_raw:items) f
ORDER BY order_id, item_position;

Parsing and Building JSON with Snowflake Functions

-- Parse a JSON string stored as VARCHAR into VARIANT
SELECT PARSE_JSON('{"name":"Alice","score":98}') AS parsed;

-- Convert a VARIANT value back to a JSON string
SELECT TO_JSON(event_raw) AS json_string FROM WEB_EVENTS LIMIT 5;

-- Check the type of a value inside VARIANT
SELECT TYPEOF(event_raw:payment:amount) FROM WEB_EVENTS LIMIT 5;
-- Returns: 'DECIMAL', 'VARCHAR', 'BOOLEAN', 'NULL', 'ARRAY', 'OBJECT'

-- Build a JSON object from relational columns
SELECT OBJECT_CONSTRUCT(
  'customer_id', customer_id,
  'name',        first_name || ' ' || last_name,
  'email',       email,
  'country',     country
) AS customer_json
FROM CUSTOMERS
LIMIT 5;

-- Build a JSON array from column values
SELECT ARRAY_CONSTRUCT(1, 2, 3, 'hello', TRUE) AS my_array;

-- Check if a key exists in a VARIANT object
SELECT *
FROM WEB_EVENTS
WHERE event_raw:payment IS NOT NULL;

-- Count elements in a JSON array
SELECT
  event_raw:order_id::VARCHAR  AS order_id,
  ARRAY_SIZE(event_raw:items)  AS item_count
FROM ORDERS_RAW;

Schema Detection: Automatically Inferring Column Types from Semi-Structured Files

When you want to convert a VARIANT-loaded dataset into a proper relational table, Snowflake's INFER_SCHEMA function scans your staged files and recommends column names and data types automatically.

-- Automatically detect schema from Parquet or JSON files in a stage
SELECT *
FROM TABLE(
  INFER_SCHEMA(
    LOCATION     => '@s3_orders_stage/events/',
    FILE_FORMAT  => 'json_standard'
  )
);

-- Result shows:
-- COLUMN_NAME       TYPE          NULLABLE
-- event_id          TEXT          Y
-- user_id           FIXED         Y
-- event_type        TEXT          Y
-- timestamp         TEXT          Y
-- payment.amount    REAL          Y

-- Use the detected schema to create a table automatically
CREATE TABLE EVENTS_STRUCTURED
  USING TEMPLATE (
    SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
    FROM TABLE(
      INFER_SCHEMA(
        LOCATION    => '@s3_orders_stage/events/',
        FILE_FORMAT => 'json_standard'
      )
    )
  );

Choosing Between VARIANT and Relational Columns

USE VARIANT WHEN                        USE RELATIONAL COLUMNS WHEN
-----------------------------           ----------------------------
JSON structure changes frequently       Schema is stable and well-defined
You need fast ingest without ETL        You need fast analytical queries
Fields vary per row (sparse data)       Most rows have all fields
You are prototyping or exploring        You are building production tables
Source is API/event/clickstream         Source is transactional database
You need to store the raw original      You need column-level compression

Key Points

Snowflake supports numeric, string, date/time, boolean, binary, and semi-structured data types in the same table
Use NUMBER(p,s) for monetary values to avoid floating-point rounding errors; use FLOAT for scientific calculations
TIMESTAMP_NTZ is the safest timestamp type for data engineering — it stores values exactly as given with no timezone conversion
VARIANT stores any JSON, XML, or semi-structured value in a single column — load first, define structure later
Access nested JSON fields with colon notation (col:key:nested) and array elements with bracket notation (col:array[0])
Always cast VARIANT values to a specific type before comparing, joining, or performing arithmetic
FLATTEN with LATERAL expands JSON arrays into separate rows — essential for processing line items, tags, or event lists
INFER_SCHEMA automatically detects column types from staged files, saving hours of manual schema design

Previous lesson

Back to course

Next lesson