Deploying AI Agents to Production
Building an agent that works on a local machine is one milestone. Deploying it so real users can access it reliably, securely, and at scale is the next. This topic covers everything needed to take an AI Agent from a local Python script to a production-ready web service.
What Does "Production" Mean for an AI Agent?
A production AI Agent must be:
- Accessible: Available via an API or web interface
- Reliable: Handles errors gracefully and recovers automatically
- Secure: API keys are protected; inputs are validated
- Scalable: Handles many users simultaneously
- Observable: Logs every request, response, and error
- Cost-managed: Monitors token usage to avoid unexpected bills
Production Architecture Overview
[User / Frontend App]
|
HTTP Request
|
▼
[FastAPI Web Server] ← Handles requests, validates input
|
▼
[Agent Runner] ← Runs the agent loop
|
┌────┴─────┐
▼ ▼
[OpenAI API] [Tools] ← External services
│ │
└────┬─────┘
▼
[Response Returned]
|
[Logging + Monitoring] ← Records everything for analysis
Step 1 — Build the Agent as an API Using FastAPI
# Install: pip install fastapi uvicorn pydantic
# api.py
import os
import time
import logging
from dotenv import load_dotenv
from fastapi import FastAPI, HTTPException, Request
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
from agent import run_agent
load_dotenv()
# ─── Setup Logging ────────────────────────────────────────────────
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s | %(levelname)s | %(message)s",
handlers=[
logging.FileHandler("agent_logs.log"),
logging.StreamHandler()
]
)
logger = logging.getLogger(__name__)
# ─── FastAPI App ──────────────────────────────────────────────────
app = FastAPI(
title="AI Agent API",
description="A production-ready AI Agent accessible via REST API",
version="1.0.0"
)
# Allow cross-origin requests (so a frontend can call this API)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"], # In production: restrict to known domains
allow_methods=["POST", "GET"],
allow_headers=["*"]
)
# ─── Request / Response Models ────────────────────────────────────
class AgentRequest(BaseModel):
question: str
session_id: str = "default"
max_steps: int = 5
class AgentResponse(BaseModel):
answer: str
session_id: str
steps_taken: int
processing_time_ms: float
# ─── API Endpoints ────────────────────────────────────────────────
@app.get("/health")
def health_check():
"""Health check endpoint — returns OK if the server is running."""
return {"status": "ok", "service": "AI Agent API"}
@app.post("/agent/ask", response_model=AgentResponse)
async def ask_agent(request: AgentRequest):
"""Submit a question to the AI Agent and get a response."""
if not request.question.strip():
raise HTTPException(status_code=400, detail="Question cannot be empty")
if len(request.question) > 2000:
raise HTTPException(status_code=400, detail="Question too long (max 2000 characters)")
start_time = time.time()
try:
logger.info(f"[{request.session_id}] Question: {request.question[:100]}...")
# Run the agent
answer = run_agent(request.question, verbose=False)
elapsed_ms = (time.time() - start_time) * 1000
logger.info(f"[{request.session_id}] Answer in {elapsed_ms:.0f}ms: {answer[:100]}...")
return AgentResponse(
answer=answer,
session_id=request.session_id,
steps_taken=3, # In production: track actual steps
processing_time_ms=round(elapsed_ms, 2)
)
except Exception as e:
logger.error(f"[{request.session_id}] Error: {str(e)}")
raise HTTPException(status_code=500, detail="Agent encountered an error. Please try again.")
@app.get("/agent/history/{session_id}")
def get_history(session_id: str):
"""Get conversation history for a session (simplified placeholder)."""
return {"session_id": session_id, "history": []}
Run the API Locally
# Start the server
uvicorn api:app --host 0.0.0.0 --port 8000 --reload
# Test it:
# GET http://localhost:8000/health
# POST http://localhost:8000/agent/ask
# Body: {"question": "What is machine learning?"}
Step 2 — Rate Limiting
To prevent abuse and control costs, limit how many requests a single user can make per minute:
# Install: pip install slowapi
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
from fastapi import Request
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)
@app.post("/agent/ask")
@limiter.limit("10/minute") # Max 10 requests per minute per IP
async def ask_agent(request: Request, body: AgentRequest):
...
Step 3 — Environment Configuration for Production
# .env.production (separate from development .env) OPENAI_API_KEY=sk-your-production-key ENVIRONMENT=production MAX_TOKENS=800 LOG_LEVEL=INFO ALLOWED_ORIGINS=https://estudy247.com,https://app.estudy247.com
# config.py
import os
class Config:
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
ENVIRONMENT = os.getenv("ENVIRONMENT", "development")
MAX_TOKENS = int(os.getenv("MAX_TOKENS", "800"))
LOG_LEVEL = os.getenv("LOG_LEVEL", "INFO")
IS_PRODUCTION = ENVIRONMENT == "production"
Step 4 — Deploy to the Cloud
Option A — Deploy to Railway (Easiest)
# 1. Create a Procfile echo "web: uvicorn api:app --host 0.0.0.0 --port $PORT" > Procfile # 2. Create requirements.txt pip freeze > requirements.txt # 3. Deploy # - Go to railway.app # - Connect GitHub repo # - Add environment variables (OPENAI_API_KEY, etc.) # - Click Deploy
Option B — Deploy to AWS EC2
# On the EC2 server: git clone your-repo-url cd your-repo pip install -r requirements.txt # Install process manager pip install gunicorn # Run with Gunicorn (production WSGI server) gunicorn api:app -w 4 -k uvicorn.workers.UvicornWorker \ --bind 0.0.0.0:8000 --daemon # Use Nginx as reverse proxy (recommended)
Option C — Deploy to Google Cloud Run (Serverless)
# Create Dockerfile FROM python:3.11-slim WORKDIR /app COPY requirements.txt . RUN pip install -r requirements.txt COPY . . CMD ["uvicorn", "api:app", "--host", "0.0.0.0", "--port", "8080"] # Deploy gcloud run deploy ai-agent --source . --platform managed --region asia-south1
Step 5 — Monitoring and Observability
# monitoring.py — Track token usage and performance
import json
from datetime import datetime
class AgentMonitor:
def __init__(self, log_file: str = "metrics.jsonl"):
self.log_file = log_file
def log_request(self, session_id: str, question: str,
answer: str, tokens_used: int,
latency_ms: float, success: bool):
"""Log every agent request for monitoring."""
record = {
"timestamp": datetime.utcnow().isoformat(),
"session_id": session_id,
"question_len": len(question),
"answer_len": len(answer),
"tokens_used": tokens_used,
"latency_ms": latency_ms,
"success": success
}
with open(self.log_file, "a") as f:
f.write(json.dumps(record) + "\n")
def get_daily_stats(self) -> dict:
"""Calculate daily token usage and costs."""
records = []
try:
with open(self.log_file, "r") as f:
for line in f:
records.append(json.loads(line))
except FileNotFoundError:
return {}
today = datetime.utcnow().strftime("%Y-%m-%d")
today_records = [r for r in records if r["timestamp"].startswith(today)]
total_tokens = sum(r["tokens_used"] for r in today_records)
# GPT-4o: $5 per 1M input tokens, $15 per 1M output tokens
estimated_cost = (total_tokens / 1_000_000) * 10 # Rough average
return {
"date": today,
"total_requests": len(today_records),
"total_tokens": total_tokens,
"estimated_cost": f"${estimated_cost:.4f}",
"avg_latency_ms": sum(r["latency_ms"] for r in today_records) / max(len(today_records), 1)
}
Production Checklist
| Item | Status |
|---|---|
| API keys stored in environment variables (not in code) | ✅ Essential |
| Health check endpoint available | ✅ Essential |
| Input validation on all endpoints | ✅ Essential |
| Rate limiting implemented | ✅ Essential |
| Error handling returns safe messages | ✅ Essential |
| Logging set up for all requests | ✅ Essential |
| Token usage monitoring active | ⚠️ Recommended |
| CORS configured for allowed origins only | ⚠️ Recommended |
| CI/CD pipeline for automated deployment | ⚠️ Recommended |
| Evaluation suite runs on every deployment | ⚠️ Recommended |
Summary
Deploying an AI Agent to production requires wrapping it in a FastAPI web service, adding input validation, error handling, rate limiting, and logging. The agent can then be hosted on any cloud provider — Railway, AWS, or Google Cloud. Production deployments should always include monitoring for token usage and latency, since these directly affect cost and user experience. With a proper deployment setup, an AI Agent can serve thousands of users reliably and safely.
