RAG – Retrieval-Augmented Generation
Retrieval-Augmented Generation (RAG) is a technique that gives an AI Agent access to a private, custom knowledge base — so it can answer questions using specific documents, files, or data that were not part of the LLM's original training.
RAG is one of the most widely used patterns in real-world AI applications today. It is how companies build Q&A systems over their own documents, policies, manuals, and databases.
The Problem RAG Solves
LLMs have two major knowledge limitations:
- Training cutoff: They only know information from their training data (up to a certain date)
- Private data: They have no knowledge of internal company documents, policies, customer records, etc.
RAG solves both by retrieving relevant information at the time of the query and feeding it to the LLM as part of the prompt.
Without RAG
User: "What is the refund policy in our company handbook?"
LLM: "I don't have access to your company's internal documents.
I cannot answer this question."
With RAG
User: "What is the refund policy in our company handbook?"
RAG System:
1. Searches company handbook for "refund policy"
2. Retrieves relevant paragraph: "All refunds must be requested
within 14 days. Products in original packaging qualify for
a full refund..."
3. Sends retrieved text to LLM with the question
LLM: "According to the company handbook, all refund requests
must be submitted within 14 days of purchase. Items must
be in their original packaging to qualify for a full refund."
How RAG Works — Step by Step
Phase 1 — Indexing (Done Once, Before Queries)
1. Load documents (PDF, DOCX, website, database, etc.) 2. Split documents into small chunks (e.g., 500 words each) 3. Convert each chunk into a vector (embedding) 4. Store all vectors in a vector database
Phase 2 — Retrieval (At Query Time)
1. User sends a question 2. Question is converted into a vector (embedding) 3. Vector database finds the most similar document chunks 4. Top matching chunks are retrieved
Phase 3 — Generation
1. Retrieved chunks + user question are combined into a prompt 2. Prompt is sent to the LLM 3. LLM generates an answer grounded in the retrieved context
What Are Embeddings?
An embedding is a list of numbers that represents the meaning of a text. Similar texts have similar embeddings (numbers close together), and dissimilar texts have very different embeddings.
"What is machine learning?" → Embedding: [0.12, -0.45, 0.87, 0.23, ...] (hundreds of numbers) "Machine learning is a type of AI" → Embedding: [0.11, -0.44, 0.89, 0.21, ...] (very similar numbers) "The weather is nice today" → Embedding: [-0.67, 0.31, -0.12, 0.78, ...] (very different numbers)
When a user asks a question, its embedding is compared against all stored document embeddings to find the most relevant chunks — this is called semantic search.
Building a Simple RAG System
# rag_system.py
# Install required packages:
# pip install openai chromadb tiktoken
import os
import json
from dotenv import load_dotenv
import openai
import chromadb
load_dotenv()
client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
# ─── 1. Setup Vector Database ─────────────────────────────────────
chroma_client = chromadb.Client()
collection = chroma_client.get_or_create_collection(
name="company_knowledge_base"
)
# ─── 2. Embedding Function ────────────────────────────────────────
def get_embedding(text: str) -> list:
"""Convert text into a vector embedding using OpenAI."""
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
# ─── 3. Index Documents ───────────────────────────────────────────
def index_documents(documents: list[dict]):
"""
Store documents in the vector database.
Each document: {"id": "...", "text": "...", "metadata": {...}}
"""
for doc in documents:
embedding = get_embedding(doc["text"])
collection.add(
ids=[doc["id"]],
embeddings=[embedding],
documents=[doc["text"]],
metadatas=[doc.get("metadata", {})]
)
print(f"✅ Indexed {len(documents)} documents")
# ─── 4. Retrieve Relevant Chunks ─────────────────────────────────
def retrieve(query: str, top_k: int = 3) -> list[str]:
"""Find the most relevant document chunks for a query."""
query_embedding = get_embedding(query)
results = collection.query(
query_embeddings=[query_embedding],
n_results=top_k
)
return results["documents"][0] # Return list of matching texts
# ─── 5. Generate Answer with Context ─────────────────────────────
def rag_query(question: str) -> str:
"""Answer a question using retrieved context from documents."""
# Retrieve relevant chunks
relevant_chunks = retrieve(question)
context = "\n\n---\n\n".join(relevant_chunks)
# Build the RAG prompt
rag_prompt = f"""Answer the user's question using ONLY the information
provided in the context below. If the answer is not found in the context,
say "I could not find this information in the available documents."
CONTEXT:
{context}
QUESTION: {question}
ANSWER:"""
# Call LLM with context
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant that answers questions based on provided context."},
{"role": "user", "content": rag_prompt}
],
temperature=0.1,
max_tokens=600
)
return response.choices[0].message.content
# ─── Test the RAG System ──────────────────────────────────────────
if __name__ == "__main__":
# Sample company policy documents
documents = [
{
"id": "policy_001",
"text": "Refund Policy: Customers may request a refund within 14 days "
"of purchase. Items must be in original packaging and unused. "
"Digital products are non-refundable after download.",
"metadata": {"source": "Company Handbook", "section": "Refunds"}
},
{
"id": "policy_002",
"text": "Shipping Policy: Standard delivery takes 5-7 business days. "
"Express delivery (1-2 days) is available for an additional fee. "
"Free shipping applies to orders above ₹1,000.",
"metadata": {"source": "Company Handbook", "section": "Shipping"}
},
{
"id": "policy_003",
"text": "Customer Support: Support is available Monday to Saturday, "
"9 AM to 6 PM IST. Customers can reach us via email at "
"support@company.com or by calling 1800-XXX-XXXX.",
"metadata": {"source": "Company Handbook", "section": "Support"}
}
]
# Step 1: Index the documents
index_documents(documents)
# Step 2: Ask questions
print("\nQuestion 1:")
print(rag_query("What is the refund policy?"))
print("\nQuestion 2:")
print(rag_query("How long does shipping take?"))
print("\nQuestion 3:")
print(rag_query("When can I contact customer support?"))
Expected Output
✅ Indexed 3 documents Question 1: Customers may request a refund within 14 days of purchase. Items must be in their original packaging and unused. Note that digital products are non-refundable once downloaded. Question 2: Standard delivery takes 5–7 business days. Express delivery (1–2 days) is available for an additional fee, and orders above ₹1,000 qualify for free shipping. Question 3: Customer support is available Monday to Saturday, from 9 AM to 6 PM IST. You can reach the team by email at support@company.com or by phone at 1800-XXX-XXXX.
Loading Real Documents
# Install: pip install pypdf
from pypdf import PdfReader
def load_pdf(file_path: str) -> list[dict]:
"""Load and chunk a PDF file into documents."""
reader = PdfReader(file_path)
documents = []
for page_num, page in enumerate(reader.pages):
text = page.extract_text()
if text.strip(): # Skip empty pages
documents.append({
"id": f"pdf_page_{page_num + 1}",
"text": text,
"metadata": {
"source": file_path,
"page": page_num + 1
}
})
return documents
# Usage
docs = load_pdf("company_manual.pdf")
index_documents(docs)
Chunking Strategies
| Strategy | How It Works | Best For |
|---|---|---|
| Fixed-size chunking | Split into N words or characters | Simple documents, quick setup |
| Sentence chunking | Each sentence is a chunk | Short factual documents |
| Paragraph chunking | Each paragraph is a chunk | Articles, books, reports |
| Semantic chunking | Split by topic change | Complex, multi-topic documents |
| Overlapping chunking | Chunks share some text at borders | Preserving context across boundaries |
RAG in an AI Agent
# Add RAG as a tool in an AI Agent
def search_knowledge_base(query: str) -> str:
"""Search the internal knowledge base for relevant information."""
chunks = retrieve(query, top_k=3)
if not chunks:
return json.dumps({"result": "No relevant information found."})
return json.dumps({"result": "\n\n".join(chunks)})
# Add to TOOL_MAP:
"search_knowledge_base": search_knowledge_base
Summary
RAG extends an LLM's knowledge by retrieving relevant information from a custom document store at query time. The core steps are: index documents as embeddings, retrieve similar chunks using semantic search, and feed those chunks to the LLM as context. RAG is the foundation of every enterprise AI assistant, document Q&A system, and knowledge-base chatbot. Combined with an AI Agent, it creates a system that can reason about, search, and answer questions over any private data.
