Deploying Your LangChain App to the Web
Your chatbot works on your local machine. Now you want real users to access it through a browser, a mobile app, or an API. This topic walks you through every step of deployment: wrapping your LangChain logic in a web API, containerizing it with Docker, deploying to a cloud server, handling multiple users safely, and keeping costs and security under control. By the end you will have a publicly accessible AI application running in the cloud.
The Post Office Analogy
Your local chatbot is like a letter-writing service that only works in your home. Deploying it to the web is like opening a proper post office — it accepts letters (requests) from anyone, processes them, and sends replies (responses) back. The post office needs a public address (URL), staff to handle multiple customers at once (concurrency), security checks (authentication), and a record of all transactions (logging).
Local Machine:
chatbot.py → runs in your terminal → only you can use it
Deployed Web App:
Browser / Mobile App
│ HTTP request
▼
FastAPI Server (your code)
│
▼
LangChain Chain
│
▼
OpenAI API
│
▼
Response → back to user
Option 1: FastAPI Web API
FastAPI is the most popular Python framework for building AI APIs. It is fast, supports async natively, generates automatic documentation, and integrates cleanly with LangChain.
pip install fastapi uvicorn python-multipart
Basic FastAPI App (api.py)
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
from typing import Optional
from dotenv import load_dotenv
from chains import chat # Import your chat function from the previous topic
load_dotenv()
app = FastAPI(
title="NovaTech AI Assistant API",
description="Answers questions using NovaTech's company documents.",
version="1.0.0"
)
# Allow requests from web browsers (CORS)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"], # Restrict to your frontend domain in production
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
# Request and response models
class ChatRequest(BaseModel):
message: str
session_id: Optional[str] = "default"
class ChatResponse(BaseModel):
answer: str
session_id: str
# In-memory session storage (replace with Redis for multi-server deployments)
from langchain_core.messages import HumanMessage, AIMessage
sessions = {}
def get_session_history(session_id: str) -> list:
if session_id not in sessions:
sessions[session_id] = []
return sessions[session_id]
@app.get("/")
def health_check():
return {"status": "ok", "service": "NovaTech AI Assistant"}
@app.post("/chat", response_model=ChatResponse)
async def chat_endpoint(request: ChatRequest):
if not request.message.strip():
raise HTTPException(status_code=400, detail="Message cannot be empty")
if len(request.message) > 2000:
raise HTTPException(status_code=400, detail="Message too long. Max 2000 characters.")
session_id = request.session_id or "default"
history = get_session_history(session_id)
try:
# Import your full chat function that handles retrieval + memory
from chains import rephrase_chain, build_rag_response
# Rephrase with history if available
if history:
standalone_q = rephrase_chain.invoke({
"history": history,
"question": request.message
})
else:
standalone_q = request.message
result = build_rag_response(standalone_q, history)
answer = result["answer"]
# Save to session history
history.append(HumanMessage(content=request.message))
history.append(AIMessage(content=answer))
# Keep history at a reasonable size
if len(history) > 20:
sessions[session_id] = history[-20:]
return ChatResponse(answer=answer, session_id=session_id)
except Exception as e:
raise HTTPException(status_code=500, detail=f"Internal error: {str(e)}")
@app.delete("/session/{session_id}")
def clear_session(session_id: str):
if session_id in sessions:
del sessions[session_id]
return {"message": f"Session {session_id} cleared"}
Run Locally
uvicorn api:app --reload --host 0.0.0.0 --port 8000
Open http://localhost:8000/docs in your browser. FastAPI generates an interactive documentation page where you can test your endpoints directly. This is one of FastAPI's best features for development.
Streaming Responses with FastAPI
For a real-time typing effect, return a streaming response instead of waiting for the complete answer.
from fastapi.responses import StreamingResponse
@app.post("/chat/stream")
async def chat_stream(request: ChatRequest):
from chains import answer_chain
history = get_session_history(request.session_id or "default")
docs_context = "" # Add retrieval here in full implementation
async def generate():
full_response = ""
async for chunk in answer_chain.astream({
"context": docs_context,
"history": history,
"question": request.message
}):
full_response += chunk
yield f"data: {chunk}\n\n" # Server-Sent Events format
# Save to history after streaming completes
history.append(HumanMessage(content=request.message))
history.append(AIMessage(content=full_response))
yield "data: [DONE]\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")
Adding Authentication
A public API with no authentication means anyone can call your endpoint and spend your OpenAI credits. Add API key authentication as a minimum security layer.
from fastapi import Security, HTTPException, status
from fastapi.security import APIKeyHeader
import os
API_KEY_HEADER = APIKeyHeader(name="X-API-Key", auto_error=False)
def verify_api_key(api_key: str = Security(API_KEY_HEADER)):
valid_keys = set(os.getenv("VALID_API_KEYS", "").split(","))
if not api_key or api_key not in valid_keys:
raise HTTPException(
status_code=status.HTTP_401_UNAUTHORIZED,
detail="Invalid or missing API key"
)
return api_key
# Protect the chat endpoint
@app.post("/chat", response_model=ChatResponse)
async def chat_endpoint(request: ChatRequest, api_key: str = Security(verify_api_key)):
# Only requests with a valid X-API-Key header reach this code
...
Add valid keys to your .env file: VALID_API_KEYS=key_abc123,key_xyz789. Clients include the key in their HTTP request header: X-API-Key: key_abc123.
Rate Limiting
Even with authentication, one user can spam your API and run up your costs. Rate limiting caps how many requests a user can make per time window.
pip install slowapi
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)
@app.post("/chat")
@limiter.limit("20/minute") # Max 20 requests per minute per IP
async def chat_endpoint(request: ChatRequest):
...
Option 2: Docker Container
Docker packages your application and all its dependencies into a portable container. The same container runs identically on your laptop, your staging server, and your production server — no "it works on my machine" problems.
Dockerfile
FROM python:3.11-slim WORKDIR /app # Install dependencies first (better layer caching) COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Copy application code COPY . . # Expose the API port EXPOSE 8000 # Start the server CMD ["uvicorn", "api:app", "--host", "0.0.0.0", "--port", "8000"]
Build and Run
# Build the Docker image docker build -t novatech-bot . # Run the container (pass environment variables securely) docker run -p 8000:8000 \ -e OPENAI_API_KEY=sk-your-key \ -e VALID_API_KEYS=key_abc123 \ novatech-bot
Never put API keys inside the Dockerfile or commit them to source control. Always pass them as environment variables at runtime.
Option 3: Deploy to a Cloud Server (AWS / GCP / DigitalOcean)
A virtual machine (VM) in the cloud runs your Docker container and makes it accessible on the internet. The steps are similar across providers.
General deployment steps:
1. Create a VM (t3.small on AWS, e2-small on GCP, or $6/month on DigitalOcean)
2. SSH into the VM
3. Install Docker: curl -fsSL https://get.docker.com | sh
4. Copy your project files to the VM (git clone or scp)
5. Run the Docker container:
docker run -d \
-p 80:8000 \
-e OPENAI_API_KEY=your-key \
--name novatech-bot \
--restart unless-stopped \
novatech-bot
6. Your app is now live at http://your-vm-ip/chat
Option 4: Serverless Deployment (Render or Railway)
Render and Railway deploy Python web applications from a GitHub repository with almost no configuration. They handle scaling, SSL certificates, and deployments automatically. Both have free tiers suitable for side projects.
Deploying to Render: 1. Push your code to a GitHub repository 2. Sign in at render.com 3. Click "New Web Service" 4. Connect your GitHub repository 5. Set build command: pip install -r requirements.txt 6. Set start command: uvicorn api:app --host 0.0.0.0 --port $PORT 7. Add environment variables (OPENAI_API_KEY, etc.) in the dashboard 8. Click Deploy Render gives you a URL like: https://novatech-bot.onrender.com
Option 5: LangServe for Quick Deployment
LangServe is LangChain's own deployment tool. It wraps any LCEL chain as a FastAPI endpoint automatically, reducing boilerplate code significantly.
pip install "langserve[all]"
# serve.py
from fastapi import FastAPI
from langserve import add_routes
from chains import answer_chain # Your LCEL chain from earlier
app = FastAPI(title="NovaTech Bot")
# This one line creates /chat, /chat/stream, /chat/batch, /chat/playground
add_routes(
app,
answer_chain,
path="/chat"
)
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
LangServe automatically creates a /chat/playground endpoint — a built-in web UI for testing your chain without writing any frontend code. This is excellent for demos and internal tools.
Handling Multiple Users with Session IDs
Every user needs their own conversation history. Never mix histories. Use a session identifier that the client generates and sends with each request.
import uuid
# Client generates a session ID when the conversation starts
session_id = str(uuid.uuid4()) # e.g., "a1b2c3d4-..."
# Client sends it with every request
{
"message": "What is the refund policy?",
"session_id": "a1b2c3d4-..."
}
# Server stores history keyed by session_id
sessions = {}
sessions["a1b2c3d4-..."] = [HumanMessage(...), AIMessage(...)]
For production with multiple server instances, replace the in-memory sessions dictionary with Redis. All server instances connect to the same Redis, so session data is consistent no matter which server handles each request.
Environment Variables in Production
Never hard-code secrets in your application code. In production, set environment variables through your cloud provider's dashboard or secrets manager.
Required environment variables: OPENAI_API_KEY ← Your OpenAI API key VALID_API_KEYS ← Comma-separated valid API keys for your users LANGCHAIN_API_KEY ← LangSmith key (optional but recommended) LANGCHAIN_TRACING_V2 ← Set to "true" to enable LangSmith tracing REDIS_URL ← Redis connection string (if using Redis for sessions)
Monitoring Your Deployed App
Once deployed, you need to know if the app is running, how long requests take, and when errors occur.
Health Check Endpoint
@app.get("/health")
def health():
return {
"status": "ok",
"version": "1.0.0",
"sessions_active": len(sessions)
}
Configure your cloud provider to check /health every 30 seconds. If it returns anything other than HTTP 200, the provider restarts the application automatically.
Request Logging
import logging
import time
from fastapi import Request
@app.middleware("http")
async def log_requests(request: Request, call_next):
start = time.time()
response = await call_next(request)
duration = time.time() - start
logging.info(
f"{request.method} {request.url.path} "
f"→ {response.status_code} ({duration:.3f}s)"
)
return response
Cost Management for Production
Production AI applications can accumulate significant API costs if not managed carefully.
Strategy How to Implement ────────────────────────────────────────────────────────────────── Cache frequent answers Store hash(question) → answer in Redis Limit response length Set max_tokens in model config Limit conversation history Keep only last 10 messages per session Use cheaper model for simple Q Route to gpt-3.5-turbo vs gpt-4o Set monthly spend alert Configure in OpenAI dashboard Charge users per query Track usage per API key
Simple Answer Cache
import hashlib
answer_cache = {} # Replace with Redis in production
def cached_chat(question: str, history: list) -> str:
# Create cache key from question (history excluded for simplicity)
cache_key = hashlib.md5(question.lower().strip().encode()).hexdigest()
if cache_key in answer_cache:
return answer_cache[cache_key] + " _(cached)_"
result = build_rag_response(question, history)
answer = result["answer"]
# Cache answers for frequently asked questions
answer_cache[cache_key] = answer
return answer
Full Deployment Checklist
Before going live, verify: Security: □ API key authentication on all endpoints □ Rate limiting enabled □ No secrets in source code or Dockerfile □ CORS restricted to known origins □ HTTPS enabled (SSL certificate) Reliability: □ Health check endpoint working □ Error handling returns friendly messages □ App restarts automatically on crash □ Max iterations/timeout set on agents Performance: □ Knowledge base built and cached on disk □ Answer caching for common questions □ Async endpoints for concurrent requests □ Response time tested under expected load Cost: □ max_tokens set on all model calls □ Conversation history trimmed □ Spend alerts configured in OpenAI dashboard Monitoring: □ Request logging active □ LangSmith tracing enabled □ Error alerts configured
Summary
Deploying a LangChain application means wrapping your chain logic in a FastAPI web service, containerizing it with Docker, and running it on a cloud VM or a platform like Render. FastAPI provides API endpoints, streaming support, authentication, and rate limiting. Docker ensures your app runs identically in every environment. Session IDs enable per-user conversation histories. Environment variables keep secrets out of source code. LangSmith provides production monitoring with zero code changes. Caching common answers cuts costs dramatically.
