RAG – Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) is a technique that gives an AI Agent access to a private, custom knowledge base — so it can answer questions using specific documents, files, or data that were not part of the LLM's original training.

RAG is one of the most widely used patterns in real-world AI applications today. It is how companies build Q&A systems over their own documents, policies, manuals, and databases.

The Problem RAG Solves

LLMs have two major knowledge limitations:

Training cutoff: They only know information from their training data (up to a certain date)
Private data: They have no knowledge of internal company documents, policies, customer records, etc.

RAG solves both by retrieving relevant information at the time of the query and feeding it to the LLM as part of the prompt.

Without RAG

User: "What is the refund policy in our company handbook?"
LLM:  "I don't have access to your company's internal documents.
       I cannot answer this question."

With RAG

User: "What is the refund policy in our company handbook?"
RAG System:
  1. Searches company handbook for "refund policy"
  2. Retrieves relevant paragraph: "All refunds must be requested
     within 14 days. Products in original packaging qualify for
     a full refund..."
  3. Sends retrieved text to LLM with the question

LLM:  "According to the company handbook, all refund requests
       must be submitted within 14 days of purchase. Items must
       be in their original packaging to qualify for a full refund."

How RAG Works — Step by Step

Phase 1 — Indexing (Done Once, Before Queries)

1. Load documents (PDF, DOCX, website, database, etc.)
2. Split documents into small chunks (e.g., 500 words each)
3. Convert each chunk into a vector (embedding)
4. Store all vectors in a vector database

Phase 2 — Retrieval (At Query Time)

1. User sends a question
2. Question is converted into a vector (embedding)
3. Vector database finds the most similar document chunks
4. Top matching chunks are retrieved

Phase 3 — Generation

1. Retrieved chunks + user question are combined into a prompt
2. Prompt is sent to the LLM
3. LLM generates an answer grounded in the retrieved context

What Are Embeddings?

An embedding is a list of numbers that represents the meaning of a text. Similar texts have similar embeddings (numbers close together), and dissimilar texts have very different embeddings.

"What is machine learning?"
→ Embedding: [0.12, -0.45, 0.87, 0.23, ...]  (hundreds of numbers)

"Machine learning is a type of AI"
→ Embedding: [0.11, -0.44, 0.89, 0.21, ...]  (very similar numbers)

"The weather is nice today"
→ Embedding: [-0.67, 0.31, -0.12, 0.78, ...]  (very different numbers)

When a user asks a question, its embedding is compared against all stored document embeddings to find the most relevant chunks — this is called semantic search.

Building a Simple RAG System

# rag_system.py
# Install required packages:
# pip install openai chromadb tiktoken

import os
import json
from dotenv import load_dotenv
import openai
import chromadb

load_dotenv()
client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# ─── 1. Setup Vector Database ─────────────────────────────────────
chroma_client = chromadb.Client()
collection = chroma_client.get_or_create_collection(
    name="company_knowledge_base"
)

# ─── 2. Embedding Function ────────────────────────────────────────
def get_embedding(text: str) -> list:
    """Convert text into a vector embedding using OpenAI."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding


# ─── 3. Index Documents ───────────────────────────────────────────
def index_documents(documents: list[dict]):
    """
    Store documents in the vector database.
    Each document: {"id": "...", "text": "...", "metadata": {...}}
    """
    for doc in documents:
        embedding = get_embedding(doc["text"])
        collection.add(
            ids=[doc["id"]],
            embeddings=[embedding],
            documents=[doc["text"]],
            metadatas=[doc.get("metadata", {})]
        )
    print(f"✅ Indexed {len(documents)} documents")


# ─── 4. Retrieve Relevant Chunks ─────────────────────────────────
def retrieve(query: str, top_k: int = 3) -> list[str]:
    """Find the most relevant document chunks for a query."""
    query_embedding = get_embedding(query)
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k
    )
    return results["documents"][0]  # Return list of matching texts


# ─── 5. Generate Answer with Context ─────────────────────────────
def rag_query(question: str) -> str:
    """Answer a question using retrieved context from documents."""

    # Retrieve relevant chunks
    relevant_chunks = retrieve(question)
    context = "\n\n---\n\n".join(relevant_chunks)

    # Build the RAG prompt
    rag_prompt = f"""Answer the user's question using ONLY the information
provided in the context below. If the answer is not found in the context,
say "I could not find this information in the available documents."

CONTEXT:
{context}

QUESTION: {question}

ANSWER:"""

    # Call LLM with context
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a helpful assistant that answers questions based on provided context."},
            {"role": "user",   "content": rag_prompt}
        ],
        temperature=0.1,
        max_tokens=600
    )

    return response.choices[0].message.content


# ─── Test the RAG System ──────────────────────────────────────────
if __name__ == "__main__":

    # Sample company policy documents
    documents = [
        {
            "id": "policy_001",
            "text": "Refund Policy: Customers may request a refund within 14 days "
                    "of purchase. Items must be in original packaging and unused. "
                    "Digital products are non-refundable after download.",
            "metadata": {"source": "Company Handbook", "section": "Refunds"}
        },
        {
            "id": "policy_002",
            "text": "Shipping Policy: Standard delivery takes 5-7 business days. "
                    "Express delivery (1-2 days) is available for an additional fee. "
                    "Free shipping applies to orders above ₹1,000.",
            "metadata": {"source": "Company Handbook", "section": "Shipping"}
        },
        {
            "id": "policy_003",
            "text": "Customer Support: Support is available Monday to Saturday, "
                    "9 AM to 6 PM IST. Customers can reach us via email at "
                    "support@company.com or by calling 1800-XXX-XXXX.",
            "metadata": {"source": "Company Handbook", "section": "Support"}
        }
    ]

    # Step 1: Index the documents
    index_documents(documents)

    # Step 2: Ask questions
    print("\nQuestion 1:")
    print(rag_query("What is the refund policy?"))

    print("\nQuestion 2:")
    print(rag_query("How long does shipping take?"))

    print("\nQuestion 3:")
    print(rag_query("When can I contact customer support?"))

Expected Output

✅ Indexed 3 documents

Question 1:
Customers may request a refund within 14 days of purchase. Items must be in 
their original packaging and unused. Note that digital products are non-refundable 
once downloaded.

Question 2:
Standard delivery takes 5–7 business days. Express delivery (1–2 days) is 
available for an additional fee, and orders above ₹1,000 qualify for free shipping.

Question 3:
Customer support is available Monday to Saturday, from 9 AM to 6 PM IST. 
You can reach the team by email at support@company.com or by phone at 1800-XXX-XXXX.

Loading Real Documents

# Install: pip install pypdf

from pypdf import PdfReader

def load_pdf(file_path: str) -> list[dict]:
    """Load and chunk a PDF file into documents."""
    reader = PdfReader(file_path)
    documents = []

    for page_num, page in enumerate(reader.pages):
        text = page.extract_text()
        if text.strip():  # Skip empty pages
            documents.append({
                "id": f"pdf_page_{page_num + 1}",
                "text": text,
                "metadata": {
                    "source": file_path,
                    "page": page_num + 1
                }
            })

    return documents

# Usage
docs = load_pdf("company_manual.pdf")
index_documents(docs)

Chunking Strategies

Strategy	How It Works	Best For
Fixed-size chunking	Split into N words or characters	Simple documents, quick setup
Sentence chunking	Each sentence is a chunk	Short factual documents
Paragraph chunking	Each paragraph is a chunk	Articles, books, reports
Semantic chunking	Split by topic change	Complex, multi-topic documents
Overlapping chunking	Chunks share some text at borders	Preserving context across boundaries

RAG in an AI Agent

# Add RAG as a tool in an AI Agent

def search_knowledge_base(query: str) -> str:
    """Search the internal knowledge base for relevant information."""
    chunks = retrieve(query, top_k=3)
    if not chunks:
        return json.dumps({"result": "No relevant information found."})
    return json.dumps({"result": "\n\n".join(chunks)})

# Add to TOOL_MAP:
"search_knowledge_base": search_knowledge_base

Summary

RAG extends an LLM's knowledge by retrieving relevant information from a custom document store at query time. The core steps are: index documents as embeddings, retrieve similar chunks using semantic search, and feed those chunks to the LLM as context. RAG is the foundation of every enterprise AI assistant, document Q&A system, and knowledge-base chatbot. Combined with an AI Agent, it creates a system that can reason about, search, and answer questions over any private data.

Previous lessons

Back to courses

Next lessons