GenAI Retrieval-Augmented Generation

Large language models have a fixed knowledge cutoff and no access to private documents. Retrieval-Augmented Generation — commonly called RAG — solves both problems. It connects a language model to an external knowledge source, allowing it to retrieve relevant information before generating a response. The result is accurate, grounded answers based on current or private data.

The Problem RAG Solves

A standalone LLM has three core limitations when used in real-world applications:

Knowledge cutoff: The model does not know about events or changes after its training ended
Private data blindness: The model knows nothing about internal company documents, customer records, or proprietary data
Hallucination: When unsure, the model fabricates plausible-sounding but incorrect answers

RAG addresses all three by retrieving real documents and placing them into the model's context before asking it to generate a response.

How RAG Works

RAG Pipeline
──────────────────────────────────────────────────────────────────────
Step 1: User asks a question
  "What is our company's refund policy for digital products?"
        │
        ▼
Step 2: Retriever searches the knowledge base
  → Searches the vector database for the most relevant policy documents
  → Returns: [Refund Policy v3.pdf — Section 4, paragraph 2]
        │
        ▼
Step 3: Retrieved documents are added to the prompt
  Augmented prompt:
  "Using the following document excerpt, answer the user's question.
   Document: [Refund Policy v3: Digital products are non-refundable
   after download unless the file is corrupted or inaccessible...]
   Question: What is our company's refund policy for digital products?"
        │
        ▼
Step 4: LLM generates a grounded response
  "Digital products cannot be refunded after download. The exception
   applies only if the file is corrupted or cannot be accessed.
   Please contact support at support@company.com for assistance."
──────────────────────────────────────────────────────────────────────

The Two Core Components of RAG

1. The Retriever

The retriever searches through a knowledge base to find the most relevant documents for the user's query. Modern RAG systems use semantic search — the retriever finds documents that are conceptually similar to the query, not just keyword-matching ones.

Query: "refund policy digital downloads"

Keyword search result: Documents containing "refund" AND "digital"
Semantic search result: Documents about returns, reimbursements, and
                         digital product policies — even if exact
                         keywords don't match exactly

2. The Generator (LLM)

The generator takes the retrieved documents along with the original question and produces a final response. Because the relevant facts are provided directly in the prompt, the model generates accurate, grounded answers rather than hallucinating.

Building a RAG Knowledge Base

Indexing Phase (done once, before deployment):
──────────────────────────────────────────────────────────
Raw Documents (PDFs, web pages, Word files, databases)
        │
        ▼
Chunking → Split each document into smaller pieces (200–500 words)
        │
        ▼
Embedding → Convert each chunk into a vector (numerical representation)
        │
        ▼
Vector Database → Store all chunk vectors for fast similarity search
──────────────────────────────────────────────────────────

Query Phase (happens at runtime):
──────────────────────────────────────────────────────────
User query → Converted into a vector using same embedding model
        │
        ▼
Vector DB → Find chunks with vectors closest to query vector
        │
        ▼
Top N most relevant chunks returned to LLM
──────────────────────────────────────────────────────────

Document Chunking Strategies

Chunking Method	How It Works	Best For
Fixed-size chunking	Split into chunks of N characters or tokens	Simple, fast, works for most content
Sentence-based chunking	Keep sentence boundaries intact	Prose, articles, reports
Semantic chunking	Group sentences by meaning similarity	Long documents with varying topics
Hierarchical chunking	Store full section + summary + sentence level	Legal documents, technical manuals

Popular RAG Frameworks and Tools

LangChain: End-to-end RAG pipeline builder with many integrations
LlamaIndex: Specialized for document indexing and retrieval workflows
Haystack: Enterprise-grade retrieval and generation pipeline
Azure AI Search: Microsoft's managed search and vector store service
Amazon Bedrock Knowledge Bases: AWS managed RAG solution

RAG vs Fine-Tuning — When to Use Each

                  RAG                    Fine-Tuning
──────────────────────────────────────────────────────────────
Knowledge type:   External, dynamic      Encoded in weights
Updates:          Easy — add documents   Requires retraining
Cost:             Low (retrieval + LLM)  Medium–High (training)
Private data:     Excellent              Risky (data in weights)
Real-time info:   Yes (update database)  No
Style/behavior:   Limited control        Strong control
Best for:         Factual Q&A, search    Tone, format, domain vocab
──────────────────────────────────────────────────────────────

Advanced RAG Techniques

Hybrid Search

Combine semantic (vector) search with keyword (BM25) search for better recall across different query types.

Re-ranking

After initial retrieval, a second model re-ranks the retrieved chunks by relevance before passing them to the LLM.

Parent-Child Chunking

Retrieve small chunks for precision, but send the full parent section to the LLM for richer context.

Query Rewriting

Before searching, the LLM rewrites the user's question into a more search-friendly form to improve retrieval quality.

Real-World RAG Applications

Use Case	Knowledge Source	Benefit
Internal HR chatbot	Employee handbook, HR policies	Accurate policy answers without leaking training data
Customer support bot	Product documentation, FAQs	Grounded answers, fewer hallucinations
Legal research assistant	Case law, contracts, regulations	Cites specific documents, reduces fabricated citations
Medical information tool	Clinical guidelines, drug databases	Up-to-date, source-referenced answers

RAG relies heavily on converting text into vectors — a concept called embeddings. The next topic explains what embeddings are and how vector databases store and search them efficiently.

Previous lesson

Back to course

Next lesson