Deploying AI Agents to Production

Building an agent that works on a local machine is one milestone. Deploying it so real users can access it reliably, securely, and at scale is the next. This topic covers everything needed to take an AI Agent from a local Python script to a production-ready web service.

What Does "Production" Mean for an AI Agent?

A production AI Agent must be:

  • Accessible: Available via an API or web interface
  • Reliable: Handles errors gracefully and recovers automatically
  • Secure: API keys are protected; inputs are validated
  • Scalable: Handles many users simultaneously
  • Observable: Logs every request, response, and error
  • Cost-managed: Monitors token usage to avoid unexpected bills

Production Architecture Overview

[User / Frontend App]
         |
    HTTP Request
         |
         ▼
[FastAPI Web Server]    ← Handles requests, validates input
         |
         ▼
[Agent Runner]          ← Runs the agent loop
         |
    ┌────┴─────┐
    ▼          ▼
[OpenAI API]  [Tools]   ← External services
    │          │
    └────┬─────┘
         ▼
[Response Returned]
         |
[Logging + Monitoring]  ← Records everything for analysis

Step 1 — Build the Agent as an API Using FastAPI

# Install: pip install fastapi uvicorn pydantic

# api.py

import os
import time
import logging
from dotenv import load_dotenv
from fastapi import FastAPI, HTTPException, Request
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
from agent import run_agent

load_dotenv()

# ─── Setup Logging ────────────────────────────────────────────────
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(message)s",
    handlers=[
        logging.FileHandler("agent_logs.log"),
        logging.StreamHandler()
    ]
)
logger = logging.getLogger(__name__)

# ─── FastAPI App ──────────────────────────────────────────────────
app = FastAPI(
    title="AI Agent API",
    description="A production-ready AI Agent accessible via REST API",
    version="1.0.0"
)

# Allow cross-origin requests (so a frontend can call this API)
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],  # In production: restrict to known domains
    allow_methods=["POST", "GET"],
    allow_headers=["*"]
)

# ─── Request / Response Models ────────────────────────────────────
class AgentRequest(BaseModel):
    question: str
    session_id: str = "default"
    max_steps: int = 5

class AgentResponse(BaseModel):
    answer: str
    session_id: str
    steps_taken: int
    processing_time_ms: float


# ─── API Endpoints ────────────────────────────────────────────────
@app.get("/health")
def health_check():
    """Health check endpoint — returns OK if the server is running."""
    return {"status": "ok", "service": "AI Agent API"}


@app.post("/agent/ask", response_model=AgentResponse)
async def ask_agent(request: AgentRequest):
    """Submit a question to the AI Agent and get a response."""

    if not request.question.strip():
        raise HTTPException(status_code=400, detail="Question cannot be empty")

    if len(request.question) > 2000:
        raise HTTPException(status_code=400, detail="Question too long (max 2000 characters)")

    start_time = time.time()

    try:
        logger.info(f"[{request.session_id}] Question: {request.question[:100]}...")

        # Run the agent
        answer = run_agent(request.question, verbose=False)

        elapsed_ms = (time.time() - start_time) * 1000

        logger.info(f"[{request.session_id}] Answer in {elapsed_ms:.0f}ms: {answer[:100]}...")

        return AgentResponse(
            answer=answer,
            session_id=request.session_id,
            steps_taken=3,  # In production: track actual steps
            processing_time_ms=round(elapsed_ms, 2)
        )

    except Exception as e:
        logger.error(f"[{request.session_id}] Error: {str(e)}")
        raise HTTPException(status_code=500, detail="Agent encountered an error. Please try again.")


@app.get("/agent/history/{session_id}")
def get_history(session_id: str):
    """Get conversation history for a session (simplified placeholder)."""
    return {"session_id": session_id, "history": []}

Run the API Locally

# Start the server
uvicorn api:app --host 0.0.0.0 --port 8000 --reload

# Test it:
# GET  http://localhost:8000/health
# POST http://localhost:8000/agent/ask
#      Body: {"question": "What is machine learning?"}

Step 2 — Rate Limiting

To prevent abuse and control costs, limit how many requests a single user can make per minute:

# Install: pip install slowapi

from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
from fastapi import Request

limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

@app.post("/agent/ask")
@limiter.limit("10/minute")  # Max 10 requests per minute per IP
async def ask_agent(request: Request, body: AgentRequest):
    ...

Step 3 — Environment Configuration for Production

# .env.production (separate from development .env)
OPENAI_API_KEY=sk-your-production-key
ENVIRONMENT=production
MAX_TOKENS=800
LOG_LEVEL=INFO
ALLOWED_ORIGINS=https://estudy247.com,https://app.estudy247.com
# config.py
import os

class Config:
    OPENAI_API_KEY  = os.getenv("OPENAI_API_KEY")
    ENVIRONMENT     = os.getenv("ENVIRONMENT", "development")
    MAX_TOKENS      = int(os.getenv("MAX_TOKENS", "800"))
    LOG_LEVEL       = os.getenv("LOG_LEVEL", "INFO")
    IS_PRODUCTION   = ENVIRONMENT == "production"

Step 4 — Deploy to the Cloud

Option A — Deploy to Railway (Easiest)

# 1. Create a Procfile
echo "web: uvicorn api:app --host 0.0.0.0 --port $PORT" > Procfile

# 2. Create requirements.txt
pip freeze > requirements.txt

# 3. Deploy
# - Go to railway.app
# - Connect GitHub repo
# - Add environment variables (OPENAI_API_KEY, etc.)
# - Click Deploy

Option B — Deploy to AWS EC2

# On the EC2 server:
git clone your-repo-url
cd your-repo
pip install -r requirements.txt

# Install process manager
pip install gunicorn

# Run with Gunicorn (production WSGI server)
gunicorn api:app -w 4 -k uvicorn.workers.UvicornWorker \
  --bind 0.0.0.0:8000 --daemon

# Use Nginx as reverse proxy (recommended)

Option C — Deploy to Google Cloud Run (Serverless)

# Create Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["uvicorn", "api:app", "--host", "0.0.0.0", "--port", "8080"]

# Deploy
gcloud run deploy ai-agent --source . --platform managed --region asia-south1

Step 5 — Monitoring and Observability

# monitoring.py — Track token usage and performance

import json
from datetime import datetime

class AgentMonitor:
    def __init__(self, log_file: str = "metrics.jsonl"):
        self.log_file = log_file

    def log_request(self, session_id: str, question: str,
                    answer: str, tokens_used: int,
                    latency_ms: float, success: bool):
        """Log every agent request for monitoring."""
        record = {
            "timestamp":   datetime.utcnow().isoformat(),
            "session_id":  session_id,
            "question_len": len(question),
            "answer_len":  len(answer),
            "tokens_used": tokens_used,
            "latency_ms":  latency_ms,
            "success":     success
        }
        with open(self.log_file, "a") as f:
            f.write(json.dumps(record) + "\n")

    def get_daily_stats(self) -> dict:
        """Calculate daily token usage and costs."""
        records = []
        try:
            with open(self.log_file, "r") as f:
                for line in f:
                    records.append(json.loads(line))
        except FileNotFoundError:
            return {}

        today = datetime.utcnow().strftime("%Y-%m-%d")
        today_records = [r for r in records if r["timestamp"].startswith(today)]

        total_tokens = sum(r["tokens_used"] for r in today_records)
        # GPT-4o: $5 per 1M input tokens, $15 per 1M output tokens
        estimated_cost = (total_tokens / 1_000_000) * 10  # Rough average

        return {
            "date":           today,
            "total_requests": len(today_records),
            "total_tokens":   total_tokens,
            "estimated_cost": f"${estimated_cost:.4f}",
            "avg_latency_ms": sum(r["latency_ms"] for r in today_records) / max(len(today_records), 1)
        }

Production Checklist

ItemStatus
API keys stored in environment variables (not in code)✅ Essential
Health check endpoint available✅ Essential
Input validation on all endpoints✅ Essential
Rate limiting implemented✅ Essential
Error handling returns safe messages✅ Essential
Logging set up for all requests✅ Essential
Token usage monitoring active⚠️ Recommended
CORS configured for allowed origins only⚠️ Recommended
CI/CD pipeline for automated deployment⚠️ Recommended
Evaluation suite runs on every deployment⚠️ Recommended

Summary

Deploying an AI Agent to production requires wrapping it in a FastAPI web service, adding input validation, error handling, rate limiting, and logging. The agent can then be hosted on any cloud provider — Railway, AWS, or Google Cloud. Production deployments should always include monitoring for token usage and latency, since these directly affect cost and user experience. With a proper deployment setup, an AI Agent can serve thousands of users reliably and safely.

Leave a Comment

Your email address will not be published. Required fields are marked *