LangChain Retrieval Augmented Generation

You have learned how to load documents, split them into chunks, convert chunks into embeddings, and store those embeddings in a vector store. Now you combine all of these into one powerful pattern called Retrieval Augmented Generation — commonly abbreviated as RAG. RAG lets an AI answer questions using your specific documents instead of relying only on what it learned during training. This is the most widely used LangChain pattern in real products.

The Open-Book Exam Analogy

A closed-book exam tests only what a student already memorized. An open-book exam lets the student look up relevant pages, read the specific sections that answer each question, and write an informed answer. RAG gives the AI an open book — it retrieves relevant document sections before answering, producing accurate, document-grounded responses instead of guesses from memory.

Closed-Book AI (no RAG):
  Question: "What is Acme Corp's cancellation policy?"
  AI: "I don't have information about Acme Corp's specific policies."
  (or worse: makes something up)

Open-Book AI (with RAG):
  Question: "What is Acme Corp's cancellation policy?"
  → Retriever finds: "Customers may cancel within 14 days for a full refund..."
  → AI uses that text to answer:
  "According to Acme Corp's policy, you can cancel within 14 days for a full refund."

The RAG Pipeline Step by Step

INDEXING PHASE (done once):
Documents → Split → Embed → Store in Vector DB

QUERYING PHASE (done per user question):

User Question
     │
     ▼
[Embed the question]
     │ question vector
     ▼
[Search Vector Store]
     │ top K relevant chunks
     ▼
[Build Prompt]
  system: "Answer using only the provided context."
  context: [chunk1 text] + [chunk2 text] + [chunk3 text]
  human: [user question]
     │
     ▼
[Send to LLM]
     │
     ▼
[AI generates answer grounded in your documents]
     │
     ▼
Return answer to user

Building a Basic RAG Chain

from dotenv import load_dotenv
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

load_dotenv()

# --- INDEXING PHASE ---
loader = PyPDFLoader("company_handbook.pdf")
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=150)
chunks = splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
vector_store = FAISS.from_documents(chunks, embeddings)
retriever = vector_store.as_retriever(search_kwargs={"k": 4})

# --- QUERYING PHASE ---
model = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
parser = StrOutputParser()

prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You are a helpful assistant. Answer the question using only the context below. "
     "If the answer is not in the context, say 'I could not find that in the provided documents.'\n\n"
     "Context:\n{context}"),
    ("human", "{question}")
])

def format_docs(docs):
    """Combine retrieved chunks into one text block."""
    return "\n\n---\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | model
    | parser
)

# Ask questions
answer = rag_chain.invoke("What is the remote work policy?")
print(answer)

The key line is the chain definition. retriever | format_docs takes the user's question, searches the vector store, and formats the matching chunks into a text block. RunnablePassthrough() passes the original question unchanged. Both feed into the prompt template, which injects context and question into a structured message list for the model.

Adding Source Citations to Answers

Users trust answers more when they can see exactly which document they came from. Modify the chain to return both the answer and the source documents.

from langchain_core.runnables import RunnableParallel

# Run retrieval and question passing in parallel
setup = RunnableParallel(
    context=retriever | format_docs,
    question=RunnablePassthrough(),
    source_docs=retriever
)

answer_chain = setup | {
    "answer": prompt | model | parser,
    "sources": lambda x: [
        {"source": doc.metadata.get("source", "unknown"),
         "page": doc.metadata.get("page", "N/A")}
        for doc in x["source_docs"]
    ]
}

result = answer_chain.invoke("How many vacation days do employees get?")
print("Answer:", result["answer"])
print("\nSources:")
for src in result["sources"]:
    print(f"  - {src['source']}, page {src['page']}")

Conversational RAG: Remembering Previous Questions

Basic RAG answers one question at a time. For chatbot-style interfaces where users ask follow-up questions, you need conversational RAG that remembers context. The challenge: if a user asks "What about the sick leave policy?" after asking about vacation days, the retriever needs to understand "sick leave policy" not just "What about" in isolation.

The solution: rephrase the user's follow-up question using the conversation history before sending it to the retriever.

from langchain_core.messages import HumanMessage, AIMessage
from langchain_core.prompts import MessagesPlaceholder

# Step 1: Rephrase follow-up questions using history
rephrase_prompt = ChatPromptTemplate.from_messages([
    ("system",
     "Given the conversation history and a new question, rephrase the question "
     "to be a standalone question that contains all necessary context. "
     "Return only the rephrased question, nothing else."),
    MessagesPlaceholder(variable_name="history"),
    ("human", "{question}")
])

rephrase_chain = rephrase_prompt | model | parser

# Step 2: Answer using retrieved context
answer_prompt = ChatPromptTemplate.from_messages([
    ("system",
     "Answer the question using only the context provided. "
     "If not found, say so clearly.\n\nContext:\n{context}"),
    ("human", "{question}")
])

history = []

def conversational_rag(user_question: str) -> str:
    # Rephrase if there is history
    if history:
        standalone_question = rephrase_chain.invoke({
            "history": history,
            "question": user_question
        })
    else:
        standalone_question = user_question

    # Retrieve and answer
    docs = retriever.invoke(standalone_question)
    context = format_docs(docs)

    answer = (answer_prompt | model | parser).invoke({
        "context": context,
        "question": standalone_question
    })

    # Update history
    history.append(HumanMessage(content=user_question))
    history.append(AIMessage(content=answer))

    return answer

print(conversational_rag("How many vacation days do full-time employees get?"))
print(conversational_rag("What about part-time employees?"))  # Follow-up
print(conversational_rag("And sick leave?"))  # Another follow-up

Grounding and Hallucination Prevention

The most common problem in RAG systems is the AI adding information that is not in the retrieved documents. This is called hallucination. Two prompt engineering techniques reduce this significantly.

Technique 1: Explicit Grounding Instruction

"Answer ONLY using the context provided. Do not use any outside knowledge.
If the answer is not explicitly stated in the context, respond with:
'I could not find this information in the provided documents.'"

Technique 2: Ask for Quotes

"When answering, include a direct quote from the context that supports your answer.
Format: Answer: [your answer]. Source quote: '[exact text from context]'"

Evaluating RAG Quality

A RAG system that feels good during manual testing may perform poorly on a broader set of questions. Systematic evaluation finds weaknesses before users do.

Create a Test Question Set

test_questions = [
    {
        "question": "What is the company's remote work policy?",
        "expected_answer_contains": ["two days", "manager approval"]
    },
    {
        "question": "How many sick days per year?",
        "expected_answer_contains": ["10 days", "calendar year"]
    },
    {
        "question": "What year was the company founded?",
        "expected_answer_contains": ["not found"]  # Not in the handbook
    }
]

passed = 0
for test in test_questions:
    answer = rag_chain.invoke(test["question"]).lower()
    checks = [expected.lower() in answer
              for expected in test["expected_answer_contains"]]
    status = "PASS" if all(checks) else "FAIL"
    if status == "PASS":
        passed += 1
    print(f"{status}: {test['question'][:50]}...")

print(f"\nResult: {passed}/{len(test_questions)} tests passed")

Improving Retrieval Quality

When the RAG system gives wrong or incomplete answers, the problem is often retrieval — the wrong chunks are being found. These strategies improve retrieval accuracy.

Increase K

Retrieve more chunks (k=6 or k=8 instead of k=4). The relevant chunk has more chances to appear in the retrieved set. The prompt gets longer but the answer quality improves.

Adjust Chunk Size

If answers are always incomplete, the relevant information may be split across chunk boundaries. Increase chunk_size or overlap to keep related sentences together.

Use MMR Search

retriever = vector_store.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 5, "fetch_k": 20}
)

MMR ensures the retrieved chunks cover different aspects of the topic rather than returning five nearly identical chunks.

Add Metadata Filters

# When user asks about pricing, only search pricing documents
retriever = vector_store.as_retriever(
    search_kwargs={
        "k": 4,
        "filter": {"document_type": "pricing"}
    }
)

Multi-Document RAG

Many real applications have multiple document categories: product documentation, support articles, legal policies, and FAQs. A single flat vector store works, but structured organization improves accuracy.

# Build separate retrievers per category
product_store = FAISS.from_documents(product_chunks, embeddings)
policy_store = FAISS.from_documents(policy_chunks, embeddings)
faq_store = FAISS.from_documents(faq_chunks, embeddings)

# Route question to the right retriever
def smart_retriever(question: str):
    question_lower = question.lower()

    if any(word in question_lower for word in ["price", "cost", "buy", "purchase"]):
        return product_store.as_retriever(search_kwargs={"k": 4})
    elif any(word in question_lower for word in ["policy", "rule", "allowed", "permit"]):
        return policy_store.as_retriever(search_kwargs={"k": 4})
    else:
        return faq_store.as_retriever(search_kwargs={"k": 4})

def routed_rag(question: str) -> str:
    retriever = smart_retriever(question)
    docs = retriever.invoke(question)
    context = format_docs(docs)
    return (answer_prompt | model | parser).invoke({
        "context": context,
        "question": question
    })

RAG vs Fine-Tuning: When to Use Which

Scenario                         RAG         Fine-Tuning
──────────────────────────────────────────────────────────────
Your data changes frequently     Great fit   Poor fit
You need citations               Great fit   Poor fit
You have limited budget          Great fit   Expensive
You need consistent style/tone   OK          Better fit
Data fits in documents/files     Great fit   Possible
Latency is critical              Add cache   Faster
Data is private/confidential     Great fit   Needs care

For most business use cases — answering questions about company documents, support articles, product manuals — RAG is the right choice. It is faster to build, cheaper to maintain, and easier to update when documents change.

Complete RAG Application with Caching

import hashlib
from functools import lru_cache

# Cache answers for identical questions (avoid repeated API calls)
answer_cache = {}

def cached_rag(question: str) -> str:
    question_hash = hashlib.md5(question.encode()).hexdigest()

    if question_hash in answer_cache:
        print("[Cache hit]")
        return answer_cache[question_hash]

    answer = rag_chain.invoke(question)
    answer_cache[question_hash] = answer
    return answer

# Identical question uses cache, different question calls the API
print(cached_rag("What is the refund policy?"))  # API call
print(cached_rag("What is the refund policy?"))  # Cache hit — instant, free

Summary

RAG combines document retrieval with language model generation to produce answers grounded in your specific documents. The indexing phase loads, splits, embeds, and stores documents once. The querying phase embeds each user question, retrieves relevant chunks, builds a context-rich prompt, and generates an answer. Conversational RAG handles follow-up questions by rephrasing them with conversation history before retrieval. Source citations build user trust. Retrieval quality improves by tuning chunk size, increasing k, and using MMR.

Previous lesson

Back to course

Next lesson