Snowpark Writing Python Code Inside Snowflake
Snowpark is one of the most exciting features Snowflake has introduced in recent years. It lets you write Python (and other programming languages) that runs directly inside Snowflake's compute engine. Instead of pulling data out to your laptop or a separate server to process it with Python, you push your Python code into Snowflake where the data already lives. This eliminates slow data movement and lets you use familiar programming languages for data transformations at scale.
Why Snowpark Exists
Traditional data workflows often look like this: extract millions of rows from Snowflake, load them into a Python environment, run Pandas or NumPy operations, then write results back to Snowflake. This creates three problems:
- Data movement is slow: transferring terabytes of data across a network takes time and bandwidth
- Memory limits: your laptop or server can only hold so much data in RAM
- Separate tools to manage: you need to maintain Python environments, connection strings, and dependencies outside of Snowflake
Snowpark solves all three. Your Python code runs inside Snowflake's own compute infrastructure. The data never leaves Snowflake. You write Python; Snowflake executes it at scale.
The Snowpark Architecture
Diagram: The Kitchen vs The Restaurant
Imagine you own a restaurant far from your kitchen. Every time a customer orders food, you drive ingredients from the kitchen to a separate cooking station across town, cook the meal, and drive it back. That is the traditional approach: extract data, process it elsewhere, load it back.
With Snowpark, the cooking station moves inside the restaurant kitchen. The chef (your Python code) works right where the ingredients (your data) are stored. No driving, no delays, no separate location to manage.
Setting Up Snowpark for Python
You can use Snowpark in two ways:
- Locally: Install the Snowpark Python library on your machine and connect to Snowflake remotely. Your code runs on Snowflake's servers even though you write it locally.
- Inside Snowflake Notebooks: Write and run Python directly in Snowflake's browser-based notebook environment with no local setup.
Installing Snowpark Locally
pip install snowflake-snowpark-pythonConnecting to Snowflake with Snowpark
from snowflake.snowpark import Session
connection_parameters = {
"account": "your_account_identifier",
"user": "your_username",
"password": "your_password",
"role": "DATA_ENGINEER",
"warehouse": "COMPUTE_WH",
"database": "SALES_DB",
"schema": "PUBLIC"
}
session = Session.builder.configs(connection_parameters).create()
print("Connected to Snowflake using Snowpark!")
print(session.sql("SELECT CURRENT_VERSION()").collect())Once you create a Session, everything you do runs on Snowflake's compute, not your local machine.
The Snowpark DataFrame
The central concept in Snowpark is the DataFrame. If you have used Pandas or Apache Spark, this concept feels familiar. A Snowpark DataFrame represents a lazily evaluated query against Snowflake data. You build transformations on it, and Snowflake executes everything as a single optimized SQL query when you trigger an action.
Diagram: The Blueprint vs The Building
A blueprint is not the building. An architect draws the blueprint (adds rooms, adjusts dimensions, moves walls) and only when construction begins does the real work happen. A Snowpark DataFrame is the blueprint. You define transformations step by step, and when you call .collect() or .show(), Snowflake builds the result.
Reading Data into a Snowpark DataFrame
from snowflake.snowpark.functions import col, lit, sum as sum_, avg, count
# Read from a table
orders_df = session.table("orders")
# Read from a SQL query
revenue_df = session.sql("""
SELECT customer_id, region, SUM(amount) as total
FROM orders
WHERE status = 'COMPLETED'
GROUP BY customer_id, region
""")
# Read from a CSV file in a Snowflake stage
raw_df = session.read.option("header", True).csv("@my_stage/data/sales.csv")
# Show the first 10 rows
orders_df.show(10)Transforming Data with Snowpark
from snowflake.snowpark.functions import col, when, to_date, year, month, upper, trim
# Filter rows
completed_orders = orders_df.filter(col("status") == "COMPLETED")
# Select specific columns and rename them
clean_df = completed_orders.select(
col("order_id"),
col("customer_id"),
upper(trim(col("region"))).alias("region"),
col("amount"),
to_date(col("order_date")).alias("order_date")
)
# Add a new calculated column
enriched_df = clean_df.with_column(
"revenue_tier",
when(col("amount") < 100, lit("Low"))
.when(col("amount") < 1000, lit("Medium"))
.otherwise(lit("High"))
)
# Group and aggregate
summary_df = enriched_df.group_by("region", "revenue_tier").agg(
sum_("amount").alias("total_revenue"),
count("order_id").alias("total_orders"),
avg("amount").alias("avg_order_value")
)
summary_df.show()Every step above adds to the query blueprint. Snowflake executes all of it at once when .show() runs.
Joining DataFrames
customers_df = session.table("customers")
orders_df = session.table("orders")
# Join orders with customers on customer_id
joined_df = orders_df.join(
customers_df,
orders_df["customer_id"] == customers_df["customer_id"],
join_type="inner"
)
# Select columns from both tables
result_df = joined_df.select(
orders_df["order_id"],
customers_df["customer_name"],
customers_df["email"],
orders_df["amount"],
orders_df["order_date"]
)
result_df.show(5)Writing Results Back to Snowflake
# Write to a new table (replace if it exists)
summary_df.write.mode("overwrite").save_as_table("regional_revenue_summary")
# Append to an existing table
new_data_df.write.mode("append").save_as_table("regional_revenue_summary")
# Write to a CSV file in a Snowflake stage
summary_df.write.csv("@output_stage/reports/")
print("Data written successfully")Snowpark User-Defined Functions (UDFs)
You can register Python functions as UDFs that run inside Snowflake's SQL engine. Once registered, other team members and SQL queries can call your Python function directly.
Registering a Simple Python UDF
from snowflake.snowpark.functions import udf
from snowflake.snowpark.types import StringType, FloatType
# Define a Python function
def categorize_spending(amount):
if amount is None:
return "Unknown"
elif amount < 50:
return "Low Spender"
elif amount < 500:
return "Mid Spender"
else:
return "High Spender"
# Register it as a UDF
categorize_udf = session.udf.register(
func=categorize_spending,
return_type=StringType(),
input_types=[FloatType()],
name="categorize_spending",
is_permanent=True,
stage_location="@my_udf_stage",
replace=True
)
# Use it in a Snowpark DataFrame
orders_df.with_column(
"spending_category",
categorize_udf(col("amount"))
).show()Setting is_permanent=True saves the UDF to Snowflake permanently. Anyone with the right permissions can then call categorize_spending(amount) in their SQL queries.
UDFs with Third-Party Libraries
from snowflake.snowpark.functions import udf
from snowflake.snowpark.types import StringType
# UDF that uses the 'phonenumbers' library (must be imported inside the function)
@udf(
return_type=StringType(),
packages=["phonenumbers"],
is_permanent=True,
stage_location="@my_udf_stage",
name="format_phone",
replace=True
)
def format_phone(raw: str) -> str:
import phonenumbers
try:
parsed = phonenumbers.parse(raw, "US")
return phonenumbers.format_number(parsed, phonenumbers.PhoneNumberFormat.INTERNATIONAL)
except Exception:
return "Invalid"The packages=["phonenumbers"] parameter tells Snowflake to install this Python package in its execution environment automatically. You do not install it anywhere manually.
Snowpark Stored Procedures with Python
You can also write Snowflake Stored Procedures in Python using Snowpark. These are ideal for complex data engineering tasks that involve multiple steps, conditionals, and loops.
from snowflake.snowpark import Session
from snowflake.snowpark.functions import col, sum as sum_
def run_monthly_close(session: Session, close_month: str, close_year: int) -> str:
"""Runs end-of-month revenue reconciliation"""
# Step 1: Pull current month orders
monthly_orders = session.table("orders").filter(
(col("order_month") == close_month) &
(col("order_year") == close_year)
)
# Step 2: Compute totals
totals = monthly_orders.group_by("region").agg(
sum_("amount").alias("total_revenue")
)
# Step 3: Write to reconciliation table
totals.with_column("close_month", col("region").cast("string")).write.mode("append").save_as_table("monthly_reconciliation")
row_count = totals.count()
return f"Monthly close complete: {row_count} regions processed for {close_month} {close_year}"
# Register as a stored procedure
session.sproc.register(
func=run_monthly_close,
name="run_monthly_close",
is_permanent=True,
stage_location="@my_sproc_stage",
replace=True
)
# Call it from SQL
# CALL run_monthly_close('June', 2024);Vectorized UDFs for High Performance
Standard UDFs process one row at a time. Vectorized UDFs process entire batches of rows using Pandas DataFrames, making them dramatically faster for large datasets.
from snowflake.snowpark.functions import pandas_udf
from snowflake.snowpark.types import PandasSeriesType, FloatType
import pandas as pd
@pandas_udf(
return_type=PandasSeriesType(FloatType()),
input_types=[PandasSeriesType(FloatType()), PandasSeriesType(FloatType())]
)
def calculate_discounted_price(prices: pd.Series, discount_rates: pd.Series) -> pd.Series:
return prices * (1 - discount_rates / 100)
# Apply the vectorized UDF to a DataFrame
products_df.with_column(
"discounted_price",
calculate_discounted_price(col("price"), col("discount_pct"))
).show()Snowflake sends batches of data to this function as Pandas Series. Your code processes thousands of rows at once, not one by one. This can be 10x to 100x faster than a standard row-by-row UDF for large datasets.
Running Snowpark in Snowflake Notebooks
Snowflake Notebooks let you write Python directly in a browser-based IDE inside Snowflake. No local Python installation, no connection strings to configure. You open a notebook, start a cell, and write Snowpark code.
Inside a Snowflake Notebook, the session object is pre-created automatically. You just start using it:
# Inside a Snowflake Notebook - session is already available
df = session.table("sales_data")
df.filter(col("region") == "APAC").group_by("product").agg(sum_("revenue").alias("total")).show()Snowflake Notebooks also support mixing SQL cells and Python cells in the same notebook. Run SQL in one cell, manipulate the results in Python in the next cell.
Key Points Summary
- Snowpark lets you write Python (and other languages) that runs inside Snowflake's compute engine, eliminating data movement.
- The Snowpark DataFrame is lazily evaluated: transformations build a query plan and execution happens only when you call
.collect()or.show(). - You can join, filter, aggregate, and transform data using Python syntax that compiles to optimized SQL under the hood.
- Python UDFs registered via Snowpark become permanent Snowflake functions available in SQL queries.
- Vectorized UDFs use Pandas DataFrames to process batches of rows at once, far outperforming row-by-row UDFs.
- Snowpark Stored Procedures written in Python run complex multi-step workflows inside Snowflake.
- Third-party Python packages are available in Snowpark UDFs via the
packagesparameter, managed by the Anaconda-powered Snowflake Package Repository. - Snowflake Notebooks provide a browser-based Python environment with the session pre-connected, requiring zero local setup.
