Snowflake Data Types and Semi-Structured
Snowflake handles both traditional structured data (rows and columns with fixed types) and modern semi-structured data (JSON, XML, Avro, Parquet, ORC) inside the same platform. The VARIANT data type is the bridge between these two worlds. It stores any value — a simple number, a nested JSON object, or an array of thousands of elements — in a single column. This topic explains every Snowflake data type, when to use each one, and how to work confidently with JSON stored in VARIANT columns.
Why Semi-Structured Data Matters
Modern applications produce data in JSON format constantly. A web analytics platform sends click events as JSON. A mobile app logs user actions as JSON arrays. An e-commerce checkout API returns order confirmations as nested JSON with arrays of line items inside an order object inside a customer object. Forcing all that into a rigid relational schema requires extensive ETL work every time the JSON structure changes.
Snowflake's VARIANT type stores JSON as-is. You load it without schema definition and query specific fields by name using dot notation — even when the structure varies from row to row. This flexibility is the reason Snowflake has become a dominant platform for data teams that deal with API and event data.
Complete Data Type Reference
Numeric Types
TYPE RANGE / PRECISION BEST USE ---- ----------------- -------- NUMBER(p, s) Up to 38 digits total Exact decimal math (money, ratios) DECIMAL(p, s) Same as NUMBER Alias — interchangeable INT / INTEGER -2^63 to 2^63-1 (64-bit) Whole numbers, IDs, counts BIGINT Same as INT in Snowflake Large IDs, timestamps as int SMALLINT -32,768 to 32,767 Small range integers FLOAT / FLOAT4 / FLOAT8 ~15 significant digits Scientific calculations DOUBLE Same as FLOAT8 Alias — scientific data REAL Same as FLOAT4 Alias Example choices: Customer ID: INT Price / Amount: NUMBER(12, 2) -- 10 digits before decimal, 2 after Percentage: NUMBER(5, 4) -- e.g., 0.9875 = 98.75% Scientific reading: FLOAT Latitude/Longitude: FLOAT -- decimal degrees need floating point
String Types
TYPE MAX LENGTH NOTE ---- ---------- ---- VARCHAR(n) 16 MB Variable-length text, n chars max STRING 16 MB Alias for VARCHAR with no limit specified TEXT 16 MB Alias for VARCHAR with no limit specified CHAR(n) Fixed n chars Pads with spaces to fill length NCHAR(n) Fixed n chars Unicode version of CHAR NVARCHAR(n) 16 MB Alias for VARCHAR Snowflake stores all string types in Unicode (UTF-8) internally. CHAR and VARCHAR behave identically in storage — no performance difference. VARCHAR(255) and VARCHAR(16000000) consume the same storage for a 10-char value. Prefer VARCHAR over CHAR to avoid unexpected trailing space issues.
Date and Time Types
TYPE STORES TIMEZONE AWARE? ---- ------ --------------- DATE Calendar date only No TIME Time of day (HH:MI:SS.nnnnnnn) No TIMESTAMP_NTZ Date + time, no timezone No (NTZ = No TimeZone) TIMESTAMP_LTZ Date + time, local timezone Yes (LTZ = Local TimeZone) TIMESTAMP_TZ Date + time + timezone offset Yes (TZ = TimeZone preserved) TIMESTAMP_NTZ is the most commonly used for data engineering. - Stores values as-is, no timezone conversion - Predictable behaviour across sessions and users in different regions TIMESTAMP_LTZ converts to/from the session's LOCAL_TIME_ZONE setting. - Values display differently depending on which user's session reads them - Can cause confusion in global teams TIMESTAMP_TZ stores the original timezone offset alongside the timestamp. - Best when you need to preserve the exact timezone of the original event - Example: 2024-06-15 14:30:00 +05:30 (India Standard Time preserved)
Boolean Type
BOOLEAN accepts: TRUE, FALSE, NULL String equivalents: 'TRUE'/'FALSE', 'YES'/'NO', 'ON'/'OFF', '1'/'0' SELECT IFF(is_active = TRUE, 'Active', 'Inactive') FROM CUSTOMERS; SELECT * FROM CUSTOMERS WHERE is_active; -- implicit = TRUE SELECT * FROM CUSTOMERS WHERE NOT is_active;
Binary Type
BINARY(n) Fixed-length raw bytes VARBINARY Variable-length raw bytes Used for: storing encrypted values, checksums, raw file bytes Rarely needed in analytics workloads — mostly used in application data pipelines
Semi-Structured Types: VARIANT, ARRAY, OBJECT
TYPE STORES EXAMPLE VALUE
---- ------ -------------
VARIANT Any value: scalar, array, or object {"name":"Alice","age":30}
ARRAY Ordered list of values ["red","green","blue"]
OBJECT Key-value pairs {"city":"Delhi","zip":"110001"}
All three are stored internally as VARIANT.
ARRAY and OBJECT are convenience types with type-checking at insert time.
In practice, most teams use VARIANT for all semi-structured data.
Loading and Querying JSON in VARIANT Columns
Consider a table of customer events captured from a web application. Each event is a JSON object with varying fields depending on the event type.
-- Sample JSON data (each line is one event):
{
"event_id": "ev_001",
"user_id": 5042,
"event_type": "purchase",
"timestamp": "2024-06-15T14:30:00Z",
"product": {
"id": "PROD-998",
"name": "Wireless Headphones",
"category": "Electronics"
},
"payment": {
"method": "credit_card",
"amount": 149.99,
"currency": "USD"
},
"tags": ["sale", "member-discount", "electronics"]
}
-- Create table with VARIANT column
CREATE TABLE WEB_EVENTS (
event_raw VARIANT,
loaded_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);
-- After loading with COPY INTO, query specific fields:
SELECT
event_raw:event_id::VARCHAR AS event_id,
event_raw:user_id::INT AS user_id,
event_raw:event_type::VARCHAR AS event_type,
event_raw:timestamp::TIMESTAMP_NTZ AS event_time,
event_raw:product:id::VARCHAR AS product_id,
event_raw:product:name::VARCHAR AS product_name,
event_raw:product:category::VARCHAR AS category,
event_raw:payment:amount::DECIMAL(10,2) AS amount,
event_raw:payment:currency::VARCHAR AS currency,
event_raw:tags[0]::VARCHAR AS first_tag,
event_raw:tags[1]::VARCHAR AS second_tag
FROM WEB_EVENTS
WHERE event_raw:event_type::VARCHAR = 'purchase'
AND event_raw:payment:amount::DECIMAL(10,2) > 100
LIMIT 20;
VARIANT Notation Guide
NOTATION ACCESSES -------- -------- col:key Top-level key in a JSON object col:key:nested_key Nested key (one level deep) col:key:level1:level2 Deeply nested key col:array_key[0] First element of a JSON array col:array_key[2] Third element of a JSON array col:key::VARCHAR Cast to VARCHAR (always cast before comparing) col:key::INT Cast to integer col:key::DECIMAL(10,2) Cast to decimal col:key::TIMESTAMP_NTZ Cast to timestamp col:key::BOOLEAN Cast to boolean IMPORTANT: Always cast VARIANT values before using them in: - WHERE conditions - JOIN conditions - Mathematical calculations - ORDER BY clauses Without a cast, comparisons use VARIANT ordering which may surprise you.
Flattening Arrays: The FLATTEN Function
When a VARIANT column contains arrays, you often need to expand each array element into its own row. FLATTEN converts an array inside a JSON field into separate rows — one row per array element.
-- Sample data: each order has an array of line items
{
"order_id": "ORD-001",
"customer": "Alice",
"items": [
{"product": "Headphones", "qty": 1, "price": 149.99},
{"product": "Phone Case", "qty": 2, "price": 19.99},
{"product": "Cable", "qty": 3, "price": 9.99}
]
}
FLATTEN DIAGRAM
================
Before FLATTEN:
order_id | items (array with 3 objects)
ORD-001 | [{headphones},{phone case},{cable}]
After FLATTEN:
order_id | item_index | product | qty | price
ORD-001 | 0 | Headphones | 1 | 149.99
ORD-001 | 1 | Phone Case | 2 | 19.99
ORD-001 | 2 | Cable | 3 | 9.99
-- SQL to achieve this:
SELECT
o.event_raw:order_id::VARCHAR AS order_id,
o.event_raw:customer::VARCHAR AS customer,
f.index AS item_position,
f.value:product::VARCHAR AS product_name,
f.value:qty::INT AS quantity,
f.value:price::DECIMAL(10,2) AS unit_price,
(f.value:qty::INT * f.value:price::DECIMAL(10,2)) AS line_total
FROM ORDERS_RAW o,
LATERAL FLATTEN(INPUT => o.event_raw:items) f
ORDER BY order_id, item_position;
Parsing and Building JSON with Snowflake Functions
-- Parse a JSON string stored as VARCHAR into VARIANT
SELECT PARSE_JSON('{"name":"Alice","score":98}') AS parsed;
-- Convert a VARIANT value back to a JSON string
SELECT TO_JSON(event_raw) AS json_string FROM WEB_EVENTS LIMIT 5;
-- Check the type of a value inside VARIANT
SELECT TYPEOF(event_raw:payment:amount) FROM WEB_EVENTS LIMIT 5;
-- Returns: 'DECIMAL', 'VARCHAR', 'BOOLEAN', 'NULL', 'ARRAY', 'OBJECT'
-- Build a JSON object from relational columns
SELECT OBJECT_CONSTRUCT(
'customer_id', customer_id,
'name', first_name || ' ' || last_name,
'email', email,
'country', country
) AS customer_json
FROM CUSTOMERS
LIMIT 5;
-- Build a JSON array from column values
SELECT ARRAY_CONSTRUCT(1, 2, 3, 'hello', TRUE) AS my_array;
-- Check if a key exists in a VARIANT object
SELECT *
FROM WEB_EVENTS
WHERE event_raw:payment IS NOT NULL;
-- Count elements in a JSON array
SELECT
event_raw:order_id::VARCHAR AS order_id,
ARRAY_SIZE(event_raw:items) AS item_count
FROM ORDERS_RAW;
Schema Detection: Automatically Inferring Column Types from Semi-Structured Files
When you want to convert a VARIANT-loaded dataset into a proper relational table, Snowflake's INFER_SCHEMA function scans your staged files and recommends column names and data types automatically.
-- Automatically detect schema from Parquet or JSON files in a stage
SELECT *
FROM TABLE(
INFER_SCHEMA(
LOCATION => '@s3_orders_stage/events/',
FILE_FORMAT => 'json_standard'
)
);
-- Result shows:
-- COLUMN_NAME TYPE NULLABLE
-- event_id TEXT Y
-- user_id FIXED Y
-- event_type TEXT Y
-- timestamp TEXT Y
-- payment.amount REAL Y
-- Use the detected schema to create a table automatically
CREATE TABLE EVENTS_STRUCTURED
USING TEMPLATE (
SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
FROM TABLE(
INFER_SCHEMA(
LOCATION => '@s3_orders_stage/events/',
FILE_FORMAT => 'json_standard'
)
)
);
Choosing Between VARIANT and Relational Columns
USE VARIANT WHEN USE RELATIONAL COLUMNS WHEN ----------------------------- ---------------------------- JSON structure changes frequently Schema is stable and well-defined You need fast ingest without ETL You need fast analytical queries Fields vary per row (sparse data) Most rows have all fields You are prototyping or exploring You are building production tables Source is API/event/clickstream Source is transactional database You need to store the raw original You need column-level compression
Key Points
- Snowflake supports numeric, string, date/time, boolean, binary, and semi-structured data types in the same table
- Use NUMBER(p,s) for monetary values to avoid floating-point rounding errors; use FLOAT for scientific calculations
- TIMESTAMP_NTZ is the safest timestamp type for data engineering — it stores values exactly as given with no timezone conversion
- VARIANT stores any JSON, XML, or semi-structured value in a single column — load first, define structure later
- Access nested JSON fields with colon notation (col:key:nested) and array elements with bracket notation (col:array[0])
- Always cast VARIANT values to a specific type before comparing, joining, or performing arithmetic
- FLATTEN with LATERAL expands JSON arrays into separate rows — essential for processing line items, tags, or event lists
- INFER_SCHEMA automatically detects column types from staged files, saving hours of manual schema design
