Deploying Your LangChain App to the Web

Your chatbot works on your local machine. Now you want real users to access it through a browser, a mobile app, or an API. This topic walks you through every step of deployment: wrapping your LangChain logic in a web API, containerizing it with Docker, deploying to a cloud server, handling multiple users safely, and keeping costs and security under control. By the end you will have a publicly accessible AI application running in the cloud.

The Post Office Analogy

Your local chatbot is like a letter-writing service that only works in your home. Deploying it to the web is like opening a proper post office — it accepts letters (requests) from anyone, processes them, and sends replies (responses) back. The post office needs a public address (URL), staff to handle multiple customers at once (concurrency), security checks (authentication), and a record of all transactions (logging).

Local Machine:
  chatbot.py → runs in your terminal → only you can use it

Deployed Web App:
  Browser / Mobile App
        │ HTTP request
        ▼
  FastAPI Server (your code)
        │
        ▼
  LangChain Chain
        │
        ▼
  OpenAI API
        │
        ▼
  Response → back to user

Option 1: FastAPI Web API

FastAPI is the most popular Python framework for building AI APIs. It is fast, supports async natively, generates automatic documentation, and integrates cleanly with LangChain.

pip install fastapi uvicorn python-multipart

Basic FastAPI App (api.py)

from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
from typing import Optional
from dotenv import load_dotenv
from chains import chat  # Import your chat function from the previous topic

load_dotenv()

app = FastAPI(
    title="NovaTech AI Assistant API",
    description="Answers questions using NovaTech's company documents.",
    version="1.0.0"
)

# Allow requests from web browsers (CORS)
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],          # Restrict to your frontend domain in production
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

# Request and response models
class ChatRequest(BaseModel):
    message: str
    session_id: Optional[str] = "default"

class ChatResponse(BaseModel):
    answer: str
    session_id: str

# In-memory session storage (replace with Redis for multi-server deployments)
from langchain_core.messages import HumanMessage, AIMessage
sessions = {}

def get_session_history(session_id: str) -> list:
    if session_id not in sessions:
        sessions[session_id] = []
    return sessions[session_id]

@app.get("/")
def health_check():
    return {"status": "ok", "service": "NovaTech AI Assistant"}

@app.post("/chat", response_model=ChatResponse)
async def chat_endpoint(request: ChatRequest):
    if not request.message.strip():
        raise HTTPException(status_code=400, detail="Message cannot be empty")

    if len(request.message) > 2000:
        raise HTTPException(status_code=400, detail="Message too long. Max 2000 characters.")

    session_id = request.session_id or "default"
    history = get_session_history(session_id)

    try:
        # Import your full chat function that handles retrieval + memory
        from chains import rephrase_chain, build_rag_response

        # Rephrase with history if available
        if history:
            standalone_q = rephrase_chain.invoke({
                "history": history,
                "question": request.message
            })
        else:
            standalone_q = request.message

        result = build_rag_response(standalone_q, history)
        answer = result["answer"]

        # Save to session history
        history.append(HumanMessage(content=request.message))
        history.append(AIMessage(content=answer))

        # Keep history at a reasonable size
        if len(history) > 20:
            sessions[session_id] = history[-20:]

        return ChatResponse(answer=answer, session_id=session_id)

    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Internal error: {str(e)}")

@app.delete("/session/{session_id}")
def clear_session(session_id: str):
    if session_id in sessions:
        del sessions[session_id]
    return {"message": f"Session {session_id} cleared"}

Run Locally

uvicorn api:app --reload --host 0.0.0.0 --port 8000

Open http://localhost:8000/docs in your browser. FastAPI generates an interactive documentation page where you can test your endpoints directly. This is one of FastAPI's best features for development.

Streaming Responses with FastAPI

For a real-time typing effect, return a streaming response instead of waiting for the complete answer.

from fastapi.responses import StreamingResponse

@app.post("/chat/stream")
async def chat_stream(request: ChatRequest):
    from chains import answer_chain

    history = get_session_history(request.session_id or "default")
    docs_context = ""  # Add retrieval here in full implementation

    async def generate():
        full_response = ""
        async for chunk in answer_chain.astream({
            "context": docs_context,
            "history": history,
            "question": request.message
        }):
            full_response += chunk
            yield f"data: {chunk}\n\n"  # Server-Sent Events format

        # Save to history after streaming completes
        history.append(HumanMessage(content=request.message))
        history.append(AIMessage(content=full_response))
        yield "data: [DONE]\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

Adding Authentication

A public API with no authentication means anyone can call your endpoint and spend your OpenAI credits. Add API key authentication as a minimum security layer.

from fastapi import Security, HTTPException, status
from fastapi.security import APIKeyHeader
import os

API_KEY_HEADER = APIKeyHeader(name="X-API-Key", auto_error=False)

def verify_api_key(api_key: str = Security(API_KEY_HEADER)):
    valid_keys = set(os.getenv("VALID_API_KEYS", "").split(","))
    if not api_key or api_key not in valid_keys:
        raise HTTPException(
            status_code=status.HTTP_401_UNAUTHORIZED,
            detail="Invalid or missing API key"
        )
    return api_key

# Protect the chat endpoint
@app.post("/chat", response_model=ChatResponse)
async def chat_endpoint(request: ChatRequest, api_key: str = Security(verify_api_key)):
    # Only requests with a valid X-API-Key header reach this code
    ...

Add valid keys to your .env file: VALID_API_KEYS=key_abc123,key_xyz789. Clients include the key in their HTTP request header: X-API-Key: key_abc123.

Rate Limiting

Even with authentication, one user can spam your API and run up your costs. Rate limiting caps how many requests a user can make per time window.

pip install slowapi

from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded

limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

@app.post("/chat")
@limiter.limit("20/minute")  # Max 20 requests per minute per IP
async def chat_endpoint(request: ChatRequest):
    ...

Option 2: Docker Container

Docker packages your application and all its dependencies into a portable container. The same container runs identically on your laptop, your staging server, and your production server — no "it works on my machine" problems.

Dockerfile

FROM python:3.11-slim

WORKDIR /app

# Install dependencies first (better layer caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY . .

# Expose the API port
EXPOSE 8000

# Start the server
CMD ["uvicorn", "api:app", "--host", "0.0.0.0", "--port", "8000"]

Build and Run

# Build the Docker image
docker build -t novatech-bot .

# Run the container (pass environment variables securely)
docker run -p 8000:8000 \
  -e OPENAI_API_KEY=sk-your-key \
  -e VALID_API_KEYS=key_abc123 \
  novatech-bot

Never put API keys inside the Dockerfile or commit them to source control. Always pass them as environment variables at runtime.

Option 3: Deploy to a Cloud Server (AWS / GCP / DigitalOcean)

A virtual machine (VM) in the cloud runs your Docker container and makes it accessible on the internet. The steps are similar across providers.

General deployment steps:

1. Create a VM (t3.small on AWS, e2-small on GCP, or $6/month on DigitalOcean)
2. SSH into the VM
3. Install Docker: curl -fsSL https://get.docker.com | sh
4. Copy your project files to the VM (git clone or scp)
5. Run the Docker container:
   docker run -d \
     -p 80:8000 \
     -e OPENAI_API_KEY=your-key \
     --name novatech-bot \
     --restart unless-stopped \
     novatech-bot

6. Your app is now live at http://your-vm-ip/chat

Option 4: Serverless Deployment (Render or Railway)

Render and Railway deploy Python web applications from a GitHub repository with almost no configuration. They handle scaling, SSL certificates, and deployments automatically. Both have free tiers suitable for side projects.

Deploying to Render:
1. Push your code to a GitHub repository
2. Sign in at render.com
3. Click "New Web Service"
4. Connect your GitHub repository
5. Set build command: pip install -r requirements.txt
6. Set start command: uvicorn api:app --host 0.0.0.0 --port $PORT
7. Add environment variables (OPENAI_API_KEY, etc.) in the dashboard
8. Click Deploy

Render gives you a URL like: https://novatech-bot.onrender.com

Option 5: LangServe for Quick Deployment

LangServe is LangChain's own deployment tool. It wraps any LCEL chain as a FastAPI endpoint automatically, reducing boilerplate code significantly.

pip install "langserve[all]"

# serve.py
from fastapi import FastAPI
from langserve import add_routes
from chains import answer_chain  # Your LCEL chain from earlier

app = FastAPI(title="NovaTech Bot")

# This one line creates /chat, /chat/stream, /chat/batch, /chat/playground
add_routes(
    app,
    answer_chain,
    path="/chat"
)

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

LangServe automatically creates a /chat/playground endpoint — a built-in web UI for testing your chain without writing any frontend code. This is excellent for demos and internal tools.

Handling Multiple Users with Session IDs

Every user needs their own conversation history. Never mix histories. Use a session identifier that the client generates and sends with each request.

import uuid

# Client generates a session ID when the conversation starts
session_id = str(uuid.uuid4())  # e.g., "a1b2c3d4-..."

# Client sends it with every request
{
    "message": "What is the refund policy?",
    "session_id": "a1b2c3d4-..."
}

# Server stores history keyed by session_id
sessions = {}
sessions["a1b2c3d4-..."] = [HumanMessage(...), AIMessage(...)]

For production with multiple server instances, replace the in-memory sessions dictionary with Redis. All server instances connect to the same Redis, so session data is consistent no matter which server handles each request.

Environment Variables in Production

Never hard-code secrets in your application code. In production, set environment variables through your cloud provider's dashboard or secrets manager.

Required environment variables:
  OPENAI_API_KEY       ← Your OpenAI API key
  VALID_API_KEYS       ← Comma-separated valid API keys for your users
  LANGCHAIN_API_KEY    ← LangSmith key (optional but recommended)
  LANGCHAIN_TRACING_V2 ← Set to "true" to enable LangSmith tracing
  REDIS_URL            ← Redis connection string (if using Redis for sessions)

Monitoring Your Deployed App

Once deployed, you need to know if the app is running, how long requests take, and when errors occur.

Health Check Endpoint

@app.get("/health")
def health():
    return {
        "status": "ok",
        "version": "1.0.0",
        "sessions_active": len(sessions)
    }

Configure your cloud provider to check /health every 30 seconds. If it returns anything other than HTTP 200, the provider restarts the application automatically.

Request Logging

import logging
import time
from fastapi import Request

@app.middleware("http")
async def log_requests(request: Request, call_next):
    start = time.time()
    response = await call_next(request)
    duration = time.time() - start
    logging.info(
        f"{request.method} {request.url.path} "
        f"→ {response.status_code} ({duration:.3f}s)"
    )
    return response

Cost Management for Production

Production AI applications can accumulate significant API costs if not managed carefully.

Strategy                       How to Implement
──────────────────────────────────────────────────────────────────
Cache frequent answers         Store hash(question) → answer in Redis
Limit response length          Set max_tokens in model config
Limit conversation history     Keep only last 10 messages per session
Use cheaper model for simple Q  Route to gpt-3.5-turbo vs gpt-4o
Set monthly spend alert        Configure in OpenAI dashboard
Charge users per query         Track usage per API key

Simple Answer Cache

import hashlib

answer_cache = {}  # Replace with Redis in production

def cached_chat(question: str, history: list) -> str:
    # Create cache key from question (history excluded for simplicity)
    cache_key = hashlib.md5(question.lower().strip().encode()).hexdigest()

    if cache_key in answer_cache:
        return answer_cache[cache_key] + " _(cached)_"

    result = build_rag_response(question, history)
    answer = result["answer"]

    # Cache answers for frequently asked questions
    answer_cache[cache_key] = answer
    return answer

Full Deployment Checklist

Before going live, verify:

Security:
  □ API key authentication on all endpoints
  □ Rate limiting enabled
  □ No secrets in source code or Dockerfile
  □ CORS restricted to known origins
  □ HTTPS enabled (SSL certificate)

Reliability:
  □ Health check endpoint working
  □ Error handling returns friendly messages
  □ App restarts automatically on crash
  □ Max iterations/timeout set on agents

Performance:
  □ Knowledge base built and cached on disk
  □ Answer caching for common questions
  □ Async endpoints for concurrent requests
  □ Response time tested under expected load

Cost:
  □ max_tokens set on all model calls
  □ Conversation history trimmed
  □ Spend alerts configured in OpenAI dashboard

Monitoring:
  □ Request logging active
  □ LangSmith tracing enabled
  □ Error alerts configured

Summary

Deploying a LangChain application means wrapping your chain logic in a FastAPI web service, containerizing it with Docker, and running it on a cloud VM or a platform like Render. FastAPI provides API endpoints, streaming support, authentication, and rate limiting. Docker ensures your app runs identically in every environment. Session IDs enable per-user conversation histories. Environment variables keep secrets out of source code. LangSmith provides production monitoring with zero code changes. Caching common answers cuts costs dramatically.

Previous lesson

Back to course