LangChain Retrieval Augmented Generation
You have learned how to load documents, split them into chunks, convert chunks into embeddings, and store those embeddings in a vector store. Now you combine all of these into one powerful pattern called Retrieval Augmented Generation — commonly abbreviated as RAG. RAG lets an AI answer questions using your specific documents instead of relying only on what it learned during training. This is the most widely used LangChain pattern in real products.
The Open-Book Exam Analogy
A closed-book exam tests only what a student already memorized. An open-book exam lets the student look up relevant pages, read the specific sections that answer each question, and write an informed answer. RAG gives the AI an open book — it retrieves relevant document sections before answering, producing accurate, document-grounded responses instead of guesses from memory.
Closed-Book AI (no RAG): Question: "What is Acme Corp's cancellation policy?" AI: "I don't have information about Acme Corp's specific policies." (or worse: makes something up) Open-Book AI (with RAG): Question: "What is Acme Corp's cancellation policy?" → Retriever finds: "Customers may cancel within 14 days for a full refund..." → AI uses that text to answer: "According to Acme Corp's policy, you can cancel within 14 days for a full refund."
The RAG Pipeline Step by Step
INDEXING PHASE (done once):
Documents → Split → Embed → Store in Vector DB
QUERYING PHASE (done per user question):
User Question
│
▼
[Embed the question]
│ question vector
▼
[Search Vector Store]
│ top K relevant chunks
▼
[Build Prompt]
system: "Answer using only the provided context."
context: [chunk1 text] + [chunk2 text] + [chunk3 text]
human: [user question]
│
▼
[Send to LLM]
│
▼
[AI generates answer grounded in your documents]
│
▼
Return answer to user
Building a Basic RAG Chain
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
load_dotenv()
# --- INDEXING PHASE ---
loader = PyPDFLoader("company_handbook.pdf")
documents = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=150)
chunks = splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
vector_store = FAISS.from_documents(chunks, embeddings)
retriever = vector_store.as_retriever(search_kwargs={"k": 4})
# --- QUERYING PHASE ---
model = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
parser = StrOutputParser()
prompt = ChatPromptTemplate.from_messages([
("system",
"You are a helpful assistant. Answer the question using only the context below. "
"If the answer is not in the context, say 'I could not find that in the provided documents.'\n\n"
"Context:\n{context}"),
("human", "{question}")
])
def format_docs(docs):
"""Combine retrieved chunks into one text block."""
return "\n\n---\n\n".join(doc.page_content for doc in docs)
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| model
| parser
)
# Ask questions
answer = rag_chain.invoke("What is the remote work policy?")
print(answer)
The key line is the chain definition. retriever | format_docs takes the user's question, searches the vector store, and formats the matching chunks into a text block. RunnablePassthrough() passes the original question unchanged. Both feed into the prompt template, which injects context and question into a structured message list for the model.
Adding Source Citations to Answers
Users trust answers more when they can see exactly which document they came from. Modify the chain to return both the answer and the source documents.
from langchain_core.runnables import RunnableParallel
# Run retrieval and question passing in parallel
setup = RunnableParallel(
context=retriever | format_docs,
question=RunnablePassthrough(),
source_docs=retriever
)
answer_chain = setup | {
"answer": prompt | model | parser,
"sources": lambda x: [
{"source": doc.metadata.get("source", "unknown"),
"page": doc.metadata.get("page", "N/A")}
for doc in x["source_docs"]
]
}
result = answer_chain.invoke("How many vacation days do employees get?")
print("Answer:", result["answer"])
print("\nSources:")
for src in result["sources"]:
print(f" - {src['source']}, page {src['page']}")
Conversational RAG: Remembering Previous Questions
Basic RAG answers one question at a time. For chatbot-style interfaces where users ask follow-up questions, you need conversational RAG that remembers context. The challenge: if a user asks "What about the sick leave policy?" after asking about vacation days, the retriever needs to understand "sick leave policy" not just "What about" in isolation.
The solution: rephrase the user's follow-up question using the conversation history before sending it to the retriever.
from langchain_core.messages import HumanMessage, AIMessage
from langchain_core.prompts import MessagesPlaceholder
# Step 1: Rephrase follow-up questions using history
rephrase_prompt = ChatPromptTemplate.from_messages([
("system",
"Given the conversation history and a new question, rephrase the question "
"to be a standalone question that contains all necessary context. "
"Return only the rephrased question, nothing else."),
MessagesPlaceholder(variable_name="history"),
("human", "{question}")
])
rephrase_chain = rephrase_prompt | model | parser
# Step 2: Answer using retrieved context
answer_prompt = ChatPromptTemplate.from_messages([
("system",
"Answer the question using only the context provided. "
"If not found, say so clearly.\n\nContext:\n{context}"),
("human", "{question}")
])
history = []
def conversational_rag(user_question: str) -> str:
# Rephrase if there is history
if history:
standalone_question = rephrase_chain.invoke({
"history": history,
"question": user_question
})
else:
standalone_question = user_question
# Retrieve and answer
docs = retriever.invoke(standalone_question)
context = format_docs(docs)
answer = (answer_prompt | model | parser).invoke({
"context": context,
"question": standalone_question
})
# Update history
history.append(HumanMessage(content=user_question))
history.append(AIMessage(content=answer))
return answer
print(conversational_rag("How many vacation days do full-time employees get?"))
print(conversational_rag("What about part-time employees?")) # Follow-up
print(conversational_rag("And sick leave?")) # Another follow-up
Grounding and Hallucination Prevention
The most common problem in RAG systems is the AI adding information that is not in the retrieved documents. This is called hallucination. Two prompt engineering techniques reduce this significantly.
Technique 1: Explicit Grounding Instruction
"Answer ONLY using the context provided. Do not use any outside knowledge. If the answer is not explicitly stated in the context, respond with: 'I could not find this information in the provided documents.'"
Technique 2: Ask for Quotes
"When answering, include a direct quote from the context that supports your answer. Format: Answer: [your answer]. Source quote: '[exact text from context]'"
Evaluating RAG Quality
A RAG system that feels good during manual testing may perform poorly on a broader set of questions. Systematic evaluation finds weaknesses before users do.
Create a Test Question Set
test_questions = [
{
"question": "What is the company's remote work policy?",
"expected_answer_contains": ["two days", "manager approval"]
},
{
"question": "How many sick days per year?",
"expected_answer_contains": ["10 days", "calendar year"]
},
{
"question": "What year was the company founded?",
"expected_answer_contains": ["not found"] # Not in the handbook
}
]
passed = 0
for test in test_questions:
answer = rag_chain.invoke(test["question"]).lower()
checks = [expected.lower() in answer
for expected in test["expected_answer_contains"]]
status = "PASS" if all(checks) else "FAIL"
if status == "PASS":
passed += 1
print(f"{status}: {test['question'][:50]}...")
print(f"\nResult: {passed}/{len(test_questions)} tests passed")
Improving Retrieval Quality
When the RAG system gives wrong or incomplete answers, the problem is often retrieval — the wrong chunks are being found. These strategies improve retrieval accuracy.
Increase K
Retrieve more chunks (k=6 or k=8 instead of k=4). The relevant chunk has more chances to appear in the retrieved set. The prompt gets longer but the answer quality improves.
Adjust Chunk Size
If answers are always incomplete, the relevant information may be split across chunk boundaries. Increase chunk_size or overlap to keep related sentences together.
Use MMR Search
retriever = vector_store.as_retriever(
search_type="mmr",
search_kwargs={"k": 5, "fetch_k": 20}
)
MMR ensures the retrieved chunks cover different aspects of the topic rather than returning five nearly identical chunks.
Add Metadata Filters
# When user asks about pricing, only search pricing documents
retriever = vector_store.as_retriever(
search_kwargs={
"k": 4,
"filter": {"document_type": "pricing"}
}
)
Multi-Document RAG
Many real applications have multiple document categories: product documentation, support articles, legal policies, and FAQs. A single flat vector store works, but structured organization improves accuracy.
# Build separate retrievers per category
product_store = FAISS.from_documents(product_chunks, embeddings)
policy_store = FAISS.from_documents(policy_chunks, embeddings)
faq_store = FAISS.from_documents(faq_chunks, embeddings)
# Route question to the right retriever
def smart_retriever(question: str):
question_lower = question.lower()
if any(word in question_lower for word in ["price", "cost", "buy", "purchase"]):
return product_store.as_retriever(search_kwargs={"k": 4})
elif any(word in question_lower for word in ["policy", "rule", "allowed", "permit"]):
return policy_store.as_retriever(search_kwargs={"k": 4})
else:
return faq_store.as_retriever(search_kwargs={"k": 4})
def routed_rag(question: str) -> str:
retriever = smart_retriever(question)
docs = retriever.invoke(question)
context = format_docs(docs)
return (answer_prompt | model | parser).invoke({
"context": context,
"question": question
})
RAG vs Fine-Tuning: When to Use Which
Scenario RAG Fine-Tuning ────────────────────────────────────────────────────────────── Your data changes frequently Great fit Poor fit You need citations Great fit Poor fit You have limited budget Great fit Expensive You need consistent style/tone OK Better fit Data fits in documents/files Great fit Possible Latency is critical Add cache Faster Data is private/confidential Great fit Needs care
For most business use cases — answering questions about company documents, support articles, product manuals — RAG is the right choice. It is faster to build, cheaper to maintain, and easier to update when documents change.
Complete RAG Application with Caching
import hashlib
from functools import lru_cache
# Cache answers for identical questions (avoid repeated API calls)
answer_cache = {}
def cached_rag(question: str) -> str:
question_hash = hashlib.md5(question.encode()).hexdigest()
if question_hash in answer_cache:
print("[Cache hit]")
return answer_cache[question_hash]
answer = rag_chain.invoke(question)
answer_cache[question_hash] = answer
return answer
# Identical question uses cache, different question calls the API
print(cached_rag("What is the refund policy?")) # API call
print(cached_rag("What is the refund policy?")) # Cache hit — instant, free
Summary
RAG combines document retrieval with language model generation to produce answers grounded in your specific documents. The indexing phase loads, splits, embeds, and stores documents once. The querying phase embeds each user question, retrieves relevant chunks, builds a context-rich prompt, and generates an answer. Conversational RAG handles follow-up questions by rephrasing them with conversation history before retrieval. Source citations build user trust. Retrieval quality improves by tuning chunk size, increasing k, and using MMR.
