LangChain Document Loaders Feeding

AI models know a lot about the world from their training data, but they know nothing about your specific files, your company's documents, your product manuals, or your personal notes. Document Loaders bridge this gap. They read files from many different sources and convert the content into a format LangChain can process and feed to the AI model. This topic covers every major document source, how loading works under the hood, and best practices for handling real-world data.

The Librarian Analogy

A librarian retrieves specific books from shelves, opens them, reads the relevant sections, and brings you the information you need. They handle the physical task of finding and reading so you do not have to. LangChain Document Loaders do the same thing for digital files — they find the file, open it, extract the text content, and hand it to you as a standardized Python object ready for further processing.

Document Loader Workflow:

File on disk (PDF, DOCX, CSV, etc.)
          │
          ▼
┌─────────────────────┐
│  Document Loader    │  ← Opens file, extracts text
│  (source-specific)  │
└──────────┬──────────┘
           │
           ▼
List of Document objects:
┌────────────────────────────────────────────┐
│  Document(                                 │
│    page_content="The text from the file",  │
│    metadata={                              │
│      "source": "report.pdf",               │
│      "page": 1,                            │
│      "author": "Jane Smith"                │
│    }                                       │
│  )                                         │
└────────────────────────────────────────────┘

The Document object has two parts: page_content (the actual text) and metadata (information about the source). Metadata tells you where each piece of text came from, which is essential when you want to cite sources in AI responses.

Installing Required Packages

Different file types require different Python packages. Install what you need:

# For PDF files
pip install pypdf

# For Word documents
pip install python-docx

# For web pages
pip install requests beautifulsoup4

# For Excel and CSV (usually pre-installed with pandas)
pip install pandas openpyxl

# For all common formats at once (includes everything above)
pip install "langchain-community[all]"

Loading Text Files

Plain text files (.txt) are the simplest format. TextLoader reads the file and returns it as a single Document.

from langchain_community.document_loaders import TextLoader

loader = TextLoader("meeting_notes.txt", encoding="utf-8")
documents = loader.load()

print(f"Loaded {len(documents)} document(s)")
print(documents[0].page_content[:200])  # First 200 characters
print(documents[0].metadata)
# {'source': 'meeting_notes.txt'}

Always specify the encoding (usually "utf-8") to avoid errors with special characters like accented letters or symbols.

Loading PDF Files

PDFs are the most common format for business documents, reports, and books. PyPDFLoader reads a PDF and creates one Document per page, making it easy to reference exact page numbers later.

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("annual_report.pdf")
documents = loader.load()

print(f"Loaded {len(documents)} pages")

# Access specific pages
first_page = documents[0]
print(first_page.page_content[:300])
print(first_page.metadata)
# {'source': 'annual_report.pdf', 'page': 0}

# Find the executive summary (usually in first few pages)
for doc in documents[:5]:
    print(f"Page {doc.metadata['page']}: {doc.page_content[:100]}\n")

Note that PDF text extraction is not perfect for all PDFs. Scanned PDFs (images of text) require OCR (optical character recognition) tools to extract text. PDFs created from digital sources (typed documents, spreadsheets exported to PDF) extract cleanly.

Loading Word Documents

from langchain_community.document_loaders import Docx2txtLoader

loader = Docx2txtLoader("employee_handbook.docx")
documents = loader.load()

print(documents[0].page_content[:500])

The entire Word document loads as a single Document. Tables and formatted lists in the document convert to plain text — formatting like bold or italic is lost, but the text content is preserved.

Loading Web Pages

WebBaseLoader fetches a URL and extracts the main text content from the HTML.

from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://en.wikipedia.org/wiki/Photosynthesis")
documents = loader.load()

print(documents[0].page_content[:500])
print(documents[0].metadata)
# {'source': 'https://en.wikipedia.org/wiki/Photosynthesis', 'title': 'Photosynthesis'}

WebBaseLoader uses BeautifulSoup to parse HTML and strips navigation, headers, footers, and advertisements. It focuses on the main content area. For pages that use JavaScript to load content, this loader may not capture everything — those require a browser-based loader.

Loading Multiple URLs at Once

urls = [
    "https://docs.python.org/3/tutorial/",
    "https://docs.python.org/3/library/functions.html",
    "https://docs.python.org/3/reference/lexical_analysis.html"
]

loader = WebBaseLoader(urls)
documents = loader.load()

print(f"Loaded {len(documents)} web pages")

Loading CSV Files

CSVLoader reads tabular data and creates one Document per row. Each row becomes a small text block with column names and values clearly labeled.

from langchain_community.document_loaders import CSVLoader

loader = CSVLoader("products.csv")
documents = loader.load()

print(f"Loaded {len(documents)} rows")
print(documents[0].page_content)
# "product_id: 101
#  name: Laptop Pro 15
#  price: 1299.99
#  category: Electronics
#  stock: 45"

This format works well when you want the AI to answer questions like "Which products cost under $500?" or "How many items are in the Electronics category?" Each row is a self-contained piece of context the AI can reason about.

Loading Multiple Files from a Directory

DirectoryLoader loads all files matching a pattern from a folder. This is invaluable when you have dozens or hundreds of documents to process.

from langchain_community.document_loaders import DirectoryLoader, TextLoader

# Load all .txt files from a folder
loader = DirectoryLoader(
    path="./documents/",
    glob="**/*.txt",          # Match all .txt files, including subfolders
    loader_cls=TextLoader,    # Which loader to use for each file
    show_progress=True        # Show a progress bar
)
documents = loader.load()

print(f"Loaded {len(documents)} documents from directory")

# See which files were loaded
sources = [doc.metadata["source"] for doc in documents]
for source in sources:
    print(source)

Loading Mixed File Types

from langchain_community.document_loaders import DirectoryLoader

# Load PDFs from one folder
pdf_loader = DirectoryLoader("./reports/", glob="*.pdf",
                              loader_cls=PyPDFLoader)

# Load text files from another
text_loader = DirectoryLoader("./notes/", glob="*.txt",
                               loader_cls=TextLoader)

# Combine all documents
all_documents = pdf_loader.load() + text_loader.load()
print(f"Total documents loaded: {len(all_documents)}")

Loading JSON Files

from langchain_community.document_loaders import JSONLoader

# Load a JSON file, extracting content from a specific field
loader = JSONLoader(
    file_path="customer_reviews.json",
    jq_schema=".[] | .review_text",   # Extract the review_text field from each item
    text_content=False
)

documents = loader.load()
print(f"Loaded {len(documents)} reviews")

The jq_schema parameter uses jq syntax (a JSON query language) to navigate the JSON structure and extract the text you want. Install jq with pip if needed: pip install jq.

Loading YouTube Transcripts

pip install youtube-transcript-api

from langchain_community.document_loaders import YoutubeLoader

loader = YoutubeLoader.from_youtube_url(
    "https://www.youtube.com/watch?v=example_video_id",
    add_video_info=True
)
documents = loader.load()

print(documents[0].page_content[:500])   # Transcript text
print(documents[0].metadata)             # Title, channel, duration

Metadata: The Often-Overlooked Value

Metadata is as important as the content itself. When your AI answers a question by searching through hundreds of documents, you want to know exactly which document the answer came from. Metadata enables this citation feature.

# Add custom metadata during loading
loader = TextLoader("q3_financial_report.txt")
documents = loader.load()

# Add extra metadata to each document
for doc in documents:
    doc.metadata["department"] = "Finance"
    doc.metadata["quarter"] = "Q3 2024"
    doc.metadata["confidential"] = True

# Later, when this document is retrieved:
# doc.metadata["source"] = "q3_financial_report.txt"
# doc.metadata["department"] = "Finance"
# doc.metadata["quarter"] = "Q3 2024"

Your AI application can then include this metadata in its responses: "According to the Q3 2024 Finance report, revenue grew by 12%." This makes AI responses verifiable and trustworthy.

Lazy Loading for Large Document Sets

Loading thousands of large documents into memory at once can crash your application. Use lazy_load() to process documents one at a time without loading all of them into memory simultaneously.

loader = DirectoryLoader("./large_corpus/", glob="*.txt", loader_cls=TextLoader)

# Lazy load — processes one document at a time
total_characters = 0
for doc in loader.lazy_load():
    total_characters += len(doc.page_content)
    # Process doc here, don't keep all in memory

print(f"Total characters across all documents: {total_characters}")

Document Loader Comparison

Loader                  File Type        Notes
────────────────────────────────────────────────────────────
TextLoader              .txt             Simple, reliable
PyPDFLoader             .pdf             One doc per page
Docx2txtLoader          .docx            Single document
WebBaseLoader           URL              Strips HTML tags
CSVLoader               .csv             One doc per row
DirectoryLoader         Any type         Batch processing
JSONLoader              .json            jq path required
YoutubeLoader           YouTube URL      Transcript extraction
UnstructuredFileLoader  Many types       Universal fallback

Handling Common Loading Errors

Encoding Errors

# Try UTF-8 first, fall back to latin-1 for older files
try:
    loader = TextLoader("file.txt", encoding="utf-8")
    docs = loader.load()
except UnicodeDecodeError:
    loader = TextLoader("file.txt", encoding="latin-1")
    docs = loader.load()

Missing Files

from pathlib import Path

def safe_load(filepath: str) -> list:
    if not Path(filepath).exists():
        print(f"Warning: {filepath} not found")
        return []
    loader = TextLoader(filepath)
    return loader.load()

Empty Files

def load_and_filter(filepath: str) -> list:
    docs = safe_load(filepath)
    # Remove documents with no content
    return [doc for doc in docs if doc.page_content.strip()]

Complete Example: Building a Document Library

from pathlib import Path
from langchain_community.document_loaders import (
    PyPDFLoader, TextLoader, CSVLoader, WebBaseLoader
)

def build_document_library(config: dict) -> list:
    """
    Load documents from multiple sources.
    config = {
        "pdfs": ["report.pdf", "manual.pdf"],
        "texts": ["notes.txt", "readme.txt"],
        "urls": ["https://example.com/page1"],
        "csvs": ["data.csv"]
    }
    """
    all_docs = []

    for pdf_path in config.get("pdfs", []):
        if Path(pdf_path).exists():
            loader = PyPDFLoader(pdf_path)
            docs = loader.load()
            print(f"PDF: {pdf_path} → {len(docs)} pages")
            all_docs.extend(docs)

    for text_path in config.get("texts", []):
        if Path(text_path).exists():
            loader = TextLoader(text_path, encoding="utf-8")
            docs = loader.load()
            print(f"Text: {text_path} → {len(docs)} doc(s)")
            all_docs.extend(docs)

    if config.get("urls"):
        loader = WebBaseLoader(config["urls"])
        docs = loader.load()
        print(f"Web: {len(config['urls'])} URL(s) → {len(docs)} doc(s)")
        all_docs.extend(docs)

    for csv_path in config.get("csvs", []):
        if Path(csv_path).exists():
            loader = CSVLoader(csv_path)
            docs = loader.load()
            print(f"CSV: {csv_path} → {len(docs)} rows")
            all_docs.extend(docs)

    print(f"\nTotal documents loaded: {len(all_docs)}")
    return all_docs

# Usage
config = {
    "pdfs": ["company_policy.pdf", "product_guide.pdf"],
    "texts": ["faq.txt"],
    "urls": ["https://company.com/about"],
    "csvs": ["products.csv"]
}

documents = build_document_library(config)

Summary

Document Loaders read files from many sources and produce a standard list of Document objects, each containing text content and metadata. PyPDFLoader handles PDFs (one document per page), TextLoader handles plain text, Docx2txtLoader handles Word files, WebBaseLoader handles web pages, CSVLoader handles tabular data, and DirectoryLoader handles batches of files. Metadata records the source of each document piece and enables citations in AI responses. Always validate file existence, handle encoding errors, and filter empty documents.

Previous lessons

Back to courses

Next lessons