LangChain Text Splitters Breaking Large Text

You loaded your documents. Now you face a practical problem: a 200-page PDF contains roughly 100,000 words. Sending that entire document to the AI model in one call is impossible — it exceeds the context window of almost every model available. Even if the model could accept it, doing so would be slow and extremely expensive. Text Splitters solve this by cutting large documents into smaller, overlapping chunks that fit comfortably within context window limits.

The Newspaper Clipper Analogy

Imagine a researcher who needs to find every article about climate change from a five-year archive of newspapers. They do not read every newspaper front to back each time someone asks a question. Instead, they cut out relevant articles, organize them by topic, and file them in labeled folders. When a question comes in, they pull only the relevant clippings. LangChain Text Splitters cut your documents into clippings, and the Retriever (covered in the next topic) finds the right clippings for each question.

Full Document (too large):
┌────────────────────────────────────────────────────────────┐
│ Chapter 1 (2000 words) │ Chapter 2 (1800 words) │ Ch 3...  │
│ [Exceeds context window — cannot send all at once]         │
└────────────────────────────────────────────────────────────┘

After Splitting (manageable chunks):
┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│  Chunk 1     │  │  Chunk 2     │  │  Chunk 3     │
│  500 words   │  │  500 words   │  │  500 words   │
│  [Fits!]     │  │  [Fits!]     │  │  [Fits!]     │
└──────────────┘  └──────────────┘  └──────────────┘

The Chunk Overlap Concept

When you split a document, sentences near the cut points belong to the context of both neighboring chunks. If a key idea spans the boundary between chunk 3 and chunk 4, and you send only chunk 3, the answer might be incomplete. Overlap solves this by including a small amount of repeated text at the start of each chunk. The overlap ensures that boundary-crossing ideas appear fully in at least one chunk.

Without overlap:
  Chunk 1: "...The treaty was signed on June 15th."
  Chunk 2: "The implications were immediate. Trade began..."

The full sentence "The treaty was signed on June 15th. The implications
were immediate." is split across two chunks. Neither chunk tells the
complete story.

With overlap (100-character overlap):
  Chunk 1: "...The treaty was signed on June 15th."
  Chunk 2: "The treaty was signed on June 15th. The implications were immediate. Trade began..."

Now Chunk 2 contains the complete idea including the context from Chunk 1's end.

RecursiveCharacterTextSplitter: The Best Default

RecursiveCharacterTextSplitter is the most commonly used and generally best-performing text splitter in LangChain. It tries to split text at natural boundaries — paragraphs first, then sentences, then words — only falling back to character-level splitting when necessary. This produces chunks that are complete thoughts rather than fragments mid-sentence.

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,        # Maximum characters per chunk
    chunk_overlap=200,      # Characters repeated at chunk boundaries
    length_function=len,    # How to measure chunk size (character count)
    separators=["\n\n", "\n", ". ", " ", ""]  # Try these separators in order
)

# Split a list of documents (from a loader)
chunks = splitter.split_documents(documents)

print(f"Original: {len(documents)} document(s)")
print(f"After splitting: {len(chunks)} chunks")
print(f"\nFirst chunk:\n{chunks[0].page_content}")
print(f"\nFirst chunk metadata: {chunks[0].metadata}")

The metadata from the original document carries over to each chunk. A chunk from page 3 of "report.pdf" keeps {"source": "report.pdf", "page": 3}. This traceability is essential for citation features.

Understanding the Separators List

The separators parameter defines the order in which the splitter tries to cut text. It moves down the list, using the next separator only if splitting at the current one produces a chunk that is still too large.

separators=["\n\n", "\n", ". ", " ", ""]

Step 1: Try splitting at blank lines (\n\n) — paragraph boundaries
Step 2: If chunks are still too large, try line breaks (\n)
Step 3: If still too large, try sentence endings (". ")
Step 4: If still too large, try word boundaries (" ")
Step 5: If still too large, split at any character ("")

For most English text, step 1 or step 2 produces good chunks. The algorithm only goes deeper when it has no choice. This keeps chunks semantically meaningful — complete paragraphs or complete sentences — as much as possible.

Splitting Plain Text (Not Documents)

If you have a text string rather than a list of Document objects, use split_text() instead of split_documents().

text = """
Python is a high-level programming language. It was created by Guido van Rossum.
Python is known for its clear syntax and readability.

Python supports multiple programming paradigms. These include object-oriented,
functional, and procedural programming.

Python has a large standard library. It includes modules for file I/O,
networking, and data manipulation.
"""

chunks = splitter.split_text(text)

for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1} ({len(chunk)} chars):")
    print(chunk)
    print("---")

CharacterTextSplitter: Simple Alternative

CharacterTextSplitter splits text at a single specified separator. It is simpler than RecursiveCharacterTextSplitter and useful when your text has clear, consistent structure.

from langchain_text_splitters import CharacterTextSplitter

# Split at every blank line (good for articles or blog posts)
splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1000,
    chunk_overlap=100
)

chunks = splitter.split_documents(documents)

Use this when your document has a clear structure (emails separated by blank lines, articles with clear paragraph breaks) and you want splits to always happen at that specific boundary.

TokenTextSplitter: Split by Token Count

Chunk size measured in characters is an approximation of token count. For precision — especially when you are close to a model's context window limit — use TokenTextSplitter to split by actual token count rather than character count.

from langchain_text_splitters import TokenTextSplitter

splitter = TokenTextSplitter(
    chunk_size=500,       # Maximum 500 tokens per chunk
    chunk_overlap=50      # 50 token overlap between chunks
)

chunks = splitter.split_documents(documents)

# Each chunk now contains at most 500 tokens — exactly what you specified

Token splitting requires the tiktoken library: pip install tiktoken. It is slower than character splitting but guarantees you never exceed the token limit.

MarkdownHeaderTextSplitter: Structure-Aware Splitting

Markdown documents have built-in structure (# headings, ## subheadings). Splitting by structure rather than arbitrary character counts keeps related content together and adds heading information to each chunk's metadata.

from langchain_text_splitters import MarkdownHeaderTextSplitter

markdown_text = """
# Introduction
This document covers Python basics.

## Variables
Variables store data. You create them with the equals sign.

### Variable Types
Python has several built-in types: strings, integers, floats, and booleans.

## Functions
Functions are reusable blocks of code.
"""

headers_to_split_on = [
    ("#", "h1"),
    ("##", "h2"),
    ("###", "h3"),
]

splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = splitter.split_text(markdown_text)

for chunk in chunks:
    print(f"Content: {chunk.page_content[:100]}")
    print(f"Metadata: {chunk.metadata}")
    print("---")
# Metadata includes which heading section each chunk belongs to:
# {"h1": "Introduction", "h2": "Variables", "h3": "Variable Types"}

The heading context in metadata is extremely valuable. When a user asks "How do I use variables in Python?" and the retriever finds the relevant chunk, the metadata shows exactly which section it came from, enabling precise citations.

HTMLHeaderTextSplitter

Web pages loaded with WebBaseLoader contain text extracted from HTML. If you want to preserve the heading structure from the HTML, use HTMLHeaderTextSplitter.

from langchain_text_splitters import HTMLHeaderTextSplitter

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = splitter.split_text(html_content)

Choosing the Right Chunk Size

Chunk size is one of the most important settings that affects answer quality. There is no universal perfect size — the right size depends on your documents and questions.

Chunk Size    Good For                    Tradeoff
──────────────────────────────────────────────────────────────
200-400       Short facts, product info   May miss context
500-1000      General documents, articles Good balance
1000-2000     Technical docs, long topics Better context, higher cost
2000+         Very detailed questions     High token cost per query

Guideline: Match Chunk Size to Question Complexity

If users ask specific, narrow questions ("What is the return policy?"), smaller chunks (500 characters) work well. If users ask broad, contextual questions ("Explain the company's pricing strategy"), larger chunks (1500 characters) provide better context.

Guideline: Test Different Sizes

Build your initial system with chunk_size=1000 and chunk_overlap=200. Run 10-20 representative questions. If answers feel incomplete, increase chunk_size or overlap. If answers are too verbose or slow, decrease chunk_size.

Preserving Metadata Through Splitting

When you split a Document, every resulting chunk inherits the original document's metadata. You can add more metadata during or after splitting to track chunk-specific information.

chunks = splitter.split_documents(documents)

# Add chunk position metadata
for i, chunk in enumerate(chunks):
    chunk.metadata["chunk_id"] = i
    chunk.metadata["chunk_total"] = len(chunks)
    chunk.metadata["word_count"] = len(chunk.page_content.split())

# Now each chunk knows its position in the full document
print(chunks[5].metadata)
# {
#   "source": "report.pdf",
#   "page": 2,
#   "chunk_id": 5,
#   "chunk_total": 47,
#   "word_count": 183
# }

Complete Splitting Pipeline

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

def load_and_split(pdf_path: str, chunk_size: int = 1000, overlap: int = 200):
    """Load a PDF and split it into chunks ready for embedding."""

    # Step 1: Load
    print(f"Loading {pdf_path}...")
    loader = PyPDFLoader(pdf_path)
    documents = loader.load()
    print(f"  Loaded {len(documents)} pages")

    # Step 2: Filter empty pages
    documents = [doc for doc in documents if doc.page_content.strip()]
    print(f"  After filtering empty pages: {len(documents)}")

    # Step 3: Split
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap
    )
    chunks = splitter.split_documents(documents)
    print(f"  Created {len(chunks)} chunks")
    print(f"  Average chunk size: {sum(len(c.page_content) for c in chunks) // len(chunks)} chars")

    return chunks

chunks = load_and_split("company_report.pdf")

Splitting Diagram: Full View

company_report.pdf
│
├── Page 1: "Executive Summary... [1500 chars]"
│   → Chunk 1: "Executive Summary... [1000]"
│   → Chunk 2: "...Summary continued [800]" (200 char overlap)
│
├── Page 2: "Revenue Analysis... [2200 chars]"
│   → Chunk 3: "Revenue Analysis... [1000]"
│   → Chunk 4: "...Revenue continued [1000]" (200 char overlap)
│   → Chunk 5: "...Final revenue data [400]"
│
└── Page 3: "Conclusion... [600 chars]"
    → Chunk 6: "Conclusion... [600]" (fits in one chunk)

Total: 3 pages → 6 chunks
Each chunk: max 1000 chars, 200 char overlap with neighbors
Each chunk metadata: {source, page, chunk_id}

Summary

Text Splitters break large documents into smaller chunks that fit within AI model context windows. RecursiveCharacterTextSplitter is the best default because it splits at natural language boundaries (paragraphs, sentences) rather than arbitrary character positions. Chunk overlap prevents loss of context at chunk boundaries. Token-based splitting provides precise control when you need to stay within exact token limits. Structure-aware splitters for Markdown and HTML preserve heading hierarchy in chunk metadata. The right chunk size depends on your documents and the types of questions users will ask.

Leave a Comment