GenAI Retrieval-Augmented Generation
Large language models have a fixed knowledge cutoff and no access to private documents. Retrieval-Augmented Generation — commonly called RAG — solves both problems. It connects a language model to an external knowledge source, allowing it to retrieve relevant information before generating a response. The result is accurate, grounded answers based on current or private data.
The Problem RAG Solves
A standalone LLM has three core limitations when used in real-world applications:
- Knowledge cutoff: The model does not know about events or changes after its training ended
- Private data blindness: The model knows nothing about internal company documents, customer records, or proprietary data
- Hallucination: When unsure, the model fabricates plausible-sounding but incorrect answers
RAG addresses all three by retrieving real documents and placing them into the model's context before asking it to generate a response.
How RAG Works
RAG Pipeline
──────────────────────────────────────────────────────────────────────
Step 1: User asks a question
"What is our company's refund policy for digital products?"
│
▼
Step 2: Retriever searches the knowledge base
→ Searches the vector database for the most relevant policy documents
→ Returns: [Refund Policy v3.pdf — Section 4, paragraph 2]
│
▼
Step 3: Retrieved documents are added to the prompt
Augmented prompt:
"Using the following document excerpt, answer the user's question.
Document: [Refund Policy v3: Digital products are non-refundable
after download unless the file is corrupted or inaccessible...]
Question: What is our company's refund policy for digital products?"
│
▼
Step 4: LLM generates a grounded response
"Digital products cannot be refunded after download. The exception
applies only if the file is corrupted or cannot be accessed.
Please contact support at support@company.com for assistance."
──────────────────────────────────────────────────────────────────────
The Two Core Components of RAG
1. The Retriever
The retriever searches through a knowledge base to find the most relevant documents for the user's query. Modern RAG systems use semantic search — the retriever finds documents that are conceptually similar to the query, not just keyword-matching ones.
Query: "refund policy digital downloads"
Keyword search result: Documents containing "refund" AND "digital"
Semantic search result: Documents about returns, reimbursements, and
digital product policies — even if exact
keywords don't match exactly
2. The Generator (LLM)
The generator takes the retrieved documents along with the original question and produces a final response. Because the relevant facts are provided directly in the prompt, the model generates accurate, grounded answers rather than hallucinating.
Building a RAG Knowledge Base
Indexing Phase (done once, before deployment):
──────────────────────────────────────────────────────────
Raw Documents (PDFs, web pages, Word files, databases)
│
▼
Chunking → Split each document into smaller pieces (200–500 words)
│
▼
Embedding → Convert each chunk into a vector (numerical representation)
│
▼
Vector Database → Store all chunk vectors for fast similarity search
──────────────────────────────────────────────────────────
Query Phase (happens at runtime):
──────────────────────────────────────────────────────────
User query → Converted into a vector using same embedding model
│
▼
Vector DB → Find chunks with vectors closest to query vector
│
▼
Top N most relevant chunks returned to LLM
──────────────────────────────────────────────────────────
Document Chunking Strategies
| Chunking Method | How It Works | Best For |
|---|---|---|
| Fixed-size chunking | Split into chunks of N characters or tokens | Simple, fast, works for most content |
| Sentence-based chunking | Keep sentence boundaries intact | Prose, articles, reports |
| Semantic chunking | Group sentences by meaning similarity | Long documents with varying topics |
| Hierarchical chunking | Store full section + summary + sentence level | Legal documents, technical manuals |
Popular RAG Frameworks and Tools
- LangChain: End-to-end RAG pipeline builder with many integrations
- LlamaIndex: Specialized for document indexing and retrieval workflows
- Haystack: Enterprise-grade retrieval and generation pipeline
- Azure AI Search: Microsoft's managed search and vector store service
- Amazon Bedrock Knowledge Bases: AWS managed RAG solution
RAG vs Fine-Tuning — When to Use Each
RAG Fine-Tuning
──────────────────────────────────────────────────────────────
Knowledge type: External, dynamic Encoded in weights
Updates: Easy — add documents Requires retraining
Cost: Low (retrieval + LLM) Medium–High (training)
Private data: Excellent Risky (data in weights)
Real-time info: Yes (update database) No
Style/behavior: Limited control Strong control
Best for: Factual Q&A, search Tone, format, domain vocab
──────────────────────────────────────────────────────────────
Advanced RAG Techniques
Hybrid Search
Combine semantic (vector) search with keyword (BM25) search for better recall across different query types.
Re-ranking
After initial retrieval, a second model re-ranks the retrieved chunks by relevance before passing them to the LLM.
Parent-Child Chunking
Retrieve small chunks for precision, but send the full parent section to the LLM for richer context.
Query Rewriting
Before searching, the LLM rewrites the user's question into a more search-friendly form to improve retrieval quality.
Real-World RAG Applications
| Use Case | Knowledge Source | Benefit |
|---|---|---|
| Internal HR chatbot | Employee handbook, HR policies | Accurate policy answers without leaking training data |
| Customer support bot | Product documentation, FAQs | Grounded answers, fewer hallucinations |
| Legal research assistant | Case law, contracts, regulations | Cites specific documents, reduces fabricated citations |
| Medical information tool | Clinical guidelines, drug databases | Up-to-date, source-referenced answers |
RAG relies heavily on converting text into vectors — a concept called embeddings. The next topic explains what embeddings are and how vector databases store and search them efficiently.
