LangChain Document Loaders Feeding
AI models know a lot about the world from their training data, but they know nothing about your specific files, your company's documents, your product manuals, or your personal notes. Document Loaders bridge this gap. They read files from many different sources and convert the content into a format LangChain can process and feed to the AI model. This topic covers every major document source, how loading works under the hood, and best practices for handling real-world data.
The Librarian Analogy
A librarian retrieves specific books from shelves, opens them, reads the relevant sections, and brings you the information you need. They handle the physical task of finding and reading so you do not have to. LangChain Document Loaders do the same thing for digital files — they find the file, open it, extract the text content, and hand it to you as a standardized Python object ready for further processing.
Document Loader Workflow:
File on disk (PDF, DOCX, CSV, etc.)
│
▼
┌─────────────────────┐
│ Document Loader │ ← Opens file, extracts text
│ (source-specific) │
└──────────┬──────────┘
│
▼
List of Document objects:
┌────────────────────────────────────────────┐
│ Document( │
│ page_content="The text from the file", │
│ metadata={ │
│ "source": "report.pdf", │
│ "page": 1, │
│ "author": "Jane Smith" │
│ } │
│ ) │
└────────────────────────────────────────────┘
The Document object has two parts: page_content (the actual text) and metadata (information about the source). Metadata tells you where each piece of text came from, which is essential when you want to cite sources in AI responses.
Installing Required Packages
Different file types require different Python packages. Install what you need:
# For PDF files pip install pypdf # For Word documents pip install python-docx # For web pages pip install requests beautifulsoup4 # For Excel and CSV (usually pre-installed with pandas) pip install pandas openpyxl # For all common formats at once (includes everything above) pip install "langchain-community[all]"
Loading Text Files
Plain text files (.txt) are the simplest format. TextLoader reads the file and returns it as a single Document.
from langchain_community.document_loaders import TextLoader
loader = TextLoader("meeting_notes.txt", encoding="utf-8")
documents = loader.load()
print(f"Loaded {len(documents)} document(s)")
print(documents[0].page_content[:200]) # First 200 characters
print(documents[0].metadata)
# {'source': 'meeting_notes.txt'}
Always specify the encoding (usually "utf-8") to avoid errors with special characters like accented letters or symbols.
Loading PDF Files
PDFs are the most common format for business documents, reports, and books. PyPDFLoader reads a PDF and creates one Document per page, making it easy to reference exact page numbers later.
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("annual_report.pdf")
documents = loader.load()
print(f"Loaded {len(documents)} pages")
# Access specific pages
first_page = documents[0]
print(first_page.page_content[:300])
print(first_page.metadata)
# {'source': 'annual_report.pdf', 'page': 0}
# Find the executive summary (usually in first few pages)
for doc in documents[:5]:
print(f"Page {doc.metadata['page']}: {doc.page_content[:100]}\n")
Note that PDF text extraction is not perfect for all PDFs. Scanned PDFs (images of text) require OCR (optical character recognition) tools to extract text. PDFs created from digital sources (typed documents, spreadsheets exported to PDF) extract cleanly.
Loading Word Documents
from langchain_community.document_loaders import Docx2txtLoader
loader = Docx2txtLoader("employee_handbook.docx")
documents = loader.load()
print(documents[0].page_content[:500])
The entire Word document loads as a single Document. Tables and formatted lists in the document convert to plain text — formatting like bold or italic is lost, but the text content is preserved.
Loading Web Pages
WebBaseLoader fetches a URL and extracts the main text content from the HTML.
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://en.wikipedia.org/wiki/Photosynthesis")
documents = loader.load()
print(documents[0].page_content[:500])
print(documents[0].metadata)
# {'source': 'https://en.wikipedia.org/wiki/Photosynthesis', 'title': 'Photosynthesis'}
WebBaseLoader uses BeautifulSoup to parse HTML and strips navigation, headers, footers, and advertisements. It focuses on the main content area. For pages that use JavaScript to load content, this loader may not capture everything — those require a browser-based loader.
Loading Multiple URLs at Once
urls = [
"https://docs.python.org/3/tutorial/",
"https://docs.python.org/3/library/functions.html",
"https://docs.python.org/3/reference/lexical_analysis.html"
]
loader = WebBaseLoader(urls)
documents = loader.load()
print(f"Loaded {len(documents)} web pages")
Loading CSV Files
CSVLoader reads tabular data and creates one Document per row. Each row becomes a small text block with column names and values clearly labeled.
from langchain_community.document_loaders import CSVLoader
loader = CSVLoader("products.csv")
documents = loader.load()
print(f"Loaded {len(documents)} rows")
print(documents[0].page_content)
# "product_id: 101
# name: Laptop Pro 15
# price: 1299.99
# category: Electronics
# stock: 45"
This format works well when you want the AI to answer questions like "Which products cost under $500?" or "How many items are in the Electronics category?" Each row is a self-contained piece of context the AI can reason about.
Loading Multiple Files from a Directory
DirectoryLoader loads all files matching a pattern from a folder. This is invaluable when you have dozens or hundreds of documents to process.
from langchain_community.document_loaders import DirectoryLoader, TextLoader
# Load all .txt files from a folder
loader = DirectoryLoader(
path="./documents/",
glob="**/*.txt", # Match all .txt files, including subfolders
loader_cls=TextLoader, # Which loader to use for each file
show_progress=True # Show a progress bar
)
documents = loader.load()
print(f"Loaded {len(documents)} documents from directory")
# See which files were loaded
sources = [doc.metadata["source"] for doc in documents]
for source in sources:
print(source)
Loading Mixed File Types
from langchain_community.document_loaders import DirectoryLoader
# Load PDFs from one folder
pdf_loader = DirectoryLoader("./reports/", glob="*.pdf",
loader_cls=PyPDFLoader)
# Load text files from another
text_loader = DirectoryLoader("./notes/", glob="*.txt",
loader_cls=TextLoader)
# Combine all documents
all_documents = pdf_loader.load() + text_loader.load()
print(f"Total documents loaded: {len(all_documents)}")
Loading JSON Files
from langchain_community.document_loaders import JSONLoader
# Load a JSON file, extracting content from a specific field
loader = JSONLoader(
file_path="customer_reviews.json",
jq_schema=".[] | .review_text", # Extract the review_text field from each item
text_content=False
)
documents = loader.load()
print(f"Loaded {len(documents)} reviews")
The jq_schema parameter uses jq syntax (a JSON query language) to navigate the JSON structure and extract the text you want. Install jq with pip if needed: pip install jq.
Loading YouTube Transcripts
pip install youtube-transcript-api
from langchain_community.document_loaders import YoutubeLoader
loader = YoutubeLoader.from_youtube_url(
"https://www.youtube.com/watch?v=example_video_id",
add_video_info=True
)
documents = loader.load()
print(documents[0].page_content[:500]) # Transcript text
print(documents[0].metadata) # Title, channel, duration
Metadata: The Often-Overlooked Value
Metadata is as important as the content itself. When your AI answers a question by searching through hundreds of documents, you want to know exactly which document the answer came from. Metadata enables this citation feature.
# Add custom metadata during loading
loader = TextLoader("q3_financial_report.txt")
documents = loader.load()
# Add extra metadata to each document
for doc in documents:
doc.metadata["department"] = "Finance"
doc.metadata["quarter"] = "Q3 2024"
doc.metadata["confidential"] = True
# Later, when this document is retrieved:
# doc.metadata["source"] = "q3_financial_report.txt"
# doc.metadata["department"] = "Finance"
# doc.metadata["quarter"] = "Q3 2024"
Your AI application can then include this metadata in its responses: "According to the Q3 2024 Finance report, revenue grew by 12%." This makes AI responses verifiable and trustworthy.
Lazy Loading for Large Document Sets
Loading thousands of large documents into memory at once can crash your application. Use lazy_load() to process documents one at a time without loading all of them into memory simultaneously.
loader = DirectoryLoader("./large_corpus/", glob="*.txt", loader_cls=TextLoader)
# Lazy load — processes one document at a time
total_characters = 0
for doc in loader.lazy_load():
total_characters += len(doc.page_content)
# Process doc here, don't keep all in memory
print(f"Total characters across all documents: {total_characters}")
Document Loader Comparison
Loader File Type Notes ──────────────────────────────────────────────────────────── TextLoader .txt Simple, reliable PyPDFLoader .pdf One doc per page Docx2txtLoader .docx Single document WebBaseLoader URL Strips HTML tags CSVLoader .csv One doc per row DirectoryLoader Any type Batch processing JSONLoader .json jq path required YoutubeLoader YouTube URL Transcript extraction UnstructuredFileLoader Many types Universal fallback
Handling Common Loading Errors
Encoding Errors
# Try UTF-8 first, fall back to latin-1 for older files
try:
loader = TextLoader("file.txt", encoding="utf-8")
docs = loader.load()
except UnicodeDecodeError:
loader = TextLoader("file.txt", encoding="latin-1")
docs = loader.load()
Missing Files
from pathlib import Path
def safe_load(filepath: str) -> list:
if not Path(filepath).exists():
print(f"Warning: {filepath} not found")
return []
loader = TextLoader(filepath)
return loader.load()
Empty Files
def load_and_filter(filepath: str) -> list:
docs = safe_load(filepath)
# Remove documents with no content
return [doc for doc in docs if doc.page_content.strip()]
Complete Example: Building a Document Library
from pathlib import Path
from langchain_community.document_loaders import (
PyPDFLoader, TextLoader, CSVLoader, WebBaseLoader
)
def build_document_library(config: dict) -> list:
"""
Load documents from multiple sources.
config = {
"pdfs": ["report.pdf", "manual.pdf"],
"texts": ["notes.txt", "readme.txt"],
"urls": ["https://example.com/page1"],
"csvs": ["data.csv"]
}
"""
all_docs = []
for pdf_path in config.get("pdfs", []):
if Path(pdf_path).exists():
loader = PyPDFLoader(pdf_path)
docs = loader.load()
print(f"PDF: {pdf_path} → {len(docs)} pages")
all_docs.extend(docs)
for text_path in config.get("texts", []):
if Path(text_path).exists():
loader = TextLoader(text_path, encoding="utf-8")
docs = loader.load()
print(f"Text: {text_path} → {len(docs)} doc(s)")
all_docs.extend(docs)
if config.get("urls"):
loader = WebBaseLoader(config["urls"])
docs = loader.load()
print(f"Web: {len(config['urls'])} URL(s) → {len(docs)} doc(s)")
all_docs.extend(docs)
for csv_path in config.get("csvs", []):
if Path(csv_path).exists():
loader = CSVLoader(csv_path)
docs = loader.load()
print(f"CSV: {csv_path} → {len(docs)} rows")
all_docs.extend(docs)
print(f"\nTotal documents loaded: {len(all_docs)}")
return all_docs
# Usage
config = {
"pdfs": ["company_policy.pdf", "product_guide.pdf"],
"texts": ["faq.txt"],
"urls": ["https://company.com/about"],
"csvs": ["products.csv"]
}
documents = build_document_library(config)
Summary
Document Loaders read files from many sources and produce a standard list of Document objects, each containing text content and metadata. PyPDFLoader handles PDFs (one document per page), TextLoader handles plain text, Docx2txtLoader handles Word files, WebBaseLoader handles web pages, CSVLoader handles tabular data, and DirectoryLoader handles batches of files. Metadata records the source of each document piece and enables citations in AI responses. Always validate file existence, handle encoding errors, and filter empty documents.
