Document Chunking for RAG: Dividing Content for Optimal Retrieval

You’ve ingested your documents, cleaned the text, and extracted metadata. Now comes a critical decision: how should you divide documents into chunks for embedding and retrieval?

Chunk poorly and your retriever will either miss relevant information or return irrelevant results. Get it right and your RAG system retrieves exactly what it needs.

Why Chunking Matters

The core problem: Your embedding model has context limits (typically 512-8192 tokens). Entire documents often exceed this. Additionally, documents contain multiple ideas. Embedding the entire document loses granularity—when you search, you retrieve the whole document even if only a small section matches the query.

The solution: Break documents into chunks small enough to embed, large enough to contain sufficient context.

Fixed-Size Chunking

The simplest approach: split documents into chunks of fixed character or token length with optional overlap.

Example: 500-character chunks with 100-character overlap

Document: "Long text spanning many topics..."
Chunk 1: [0-500]
Chunk 2: [400-900]      (overlaps with Chunk 1)
Chunk 3: [800-1300]     (overlaps with Chunk 2)
...

Advantages:

Simple to implement
Predictable compute costs
Works with any content type
Deterministic and reproducible

Disadvantages:

Chunks may split mid-sentence or mid-idea
Same size works poorly across document types
Overlaps increase storage and compute

When to use: Quick prototyping, uniform content, computational simplicity valued over quality.

Semantic Chunking

Rather than arbitrary size boundaries, semantic chunking respects document structure and meaning.

Approach 1: Sentence-based chunking

Document split by sentences, then grouped:
- Chunk 1: Sentences 1-5 (semantically related)
- Chunk 2: Sentences 6-12 (different topic)
- Chunk 3: Sentences 13-18 (another topic)

Approach 2: Paragraph-based chunking

Use existing paragraph breaks as natural boundaries, then group paragraphs into chunks.

Approach 3: Structure-aware chunking

Respect document hierarchy: sections, subsections, lists maintain structure within chunks.

Advantages:

Respects semantic boundaries
More coherent chunks
Better for structured documents

Disadvantages:

Chunk sizes vary widely
Content-dependent, less predictable
Requires parsing document structure

When to use: Structured documents (articles, reports, documentation).

Recursive Chunking with Overlap

LangChain popularized recursive chunking: split by progressively smaller delimiters until chunks reach target size.

Process:

Level 1: Split by "\n\n" (paragraph breaks)
Level 2: Split by "\n" (line breaks)
Level 3: Split by "." (sentences)
Level 4: Split by " " (words)

Stop when chunks are within size limits.

Advantages:

Preserves semantic structure when possible
Falls back to smaller delimiters as needed
Good balance of quality and consistency

Disadvantages:

Requires delimiter availability
Fine-tuning delimiter list for your content

Implementation:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(document)

Specialized Chunking Strategies

Code Chunking

Programming code has its own structure (functions, classes, blocks).

Approach: Respect code structure, keep related code together, preserve context (imports, type definitions).

Tools: Tree-sitter for parsing language-specific structure.

Table Chunking

Tables have rows and columns, not sentence structure.

Options:

Convert to prose descriptions
Keep table intact with row/column context
Index each row separately with table context

Long-Form Content

Books, manuals, research papers span thousands of tokens.

Strategy: Multi-level chunking:

Level 1: Chapter or major section
Level 2: Subsection within chapter
Level 3: Detailed paragraph block

During retrieval, retrieve higher-level chunks first, then optionally refine with detailed chunks.

Choosing Chunk Size

Factors to consider:

1. Embedding model context: What’s your model’s maximum input length?

2. Retrieval query length: Short queries need small chunks. Complex queries benefit from larger chunks providing more context.

3. Your domain:

Technical docs: Larger chunks (500-1000 tokens) preserve examples and explanation
News/articles: Medium chunks (300-500 tokens) align with natural paragraphs
Code: Smaller chunks (200-400 tokens) keep functions/methods intact

4. Downstream usage: What does your LLM generator expect?

If you concatenate 5 chunks: each should be ~1000 tokens (for 4K context models)
If you concatenate 10 chunks: each should be ~500 tokens

5. Overlap trade-offs:

No overlap: efficient but risks missing context at boundaries
10-20% overlap: sweet spot for most cases
High overlap (50%+): safety but costly

General starting point: 500-token chunks with 10-15% overlap. Adjust based on retrieval quality.

Chunking Mistakes to Avoid

Chunks too small (< 100 tokens): Insufficient context, poor embedding quality, excessive number of documents.

Chunks too large (> 2000 tokens): Exceed most embedding model limits, dilute information density, slow retrieval.

Ignoring domain structure: Chapters split mid-thought, code functions split across chunks.

Fixed size for all content: Same chunk size works poorly for blogs AND API documentation AND code.

Ignoring retrieval metrics: Choosing chunk size theoretically rather than empirically measuring retrieval accuracy.

Measuring Chunking Quality

Metric 1: Retrieval Recall For known Q&A pairs, does the relevant chunk get retrieved in top K results?

Metric 2: Chunk Coherence Do chunks stand alone or reference context they don’t contain?

Metric 3: Embedding Coverage Are query concepts covered by chunks they should match?

Metric 4: Latency How many chunks retrieved? How much data transferred and embedded?

Dynamic Chunking

Advanced systems adjust chunking based on content:

def dynamic_chunk_size(content_type, complexity_score):
    if content_type == "code":
        return 300
    elif content_type == "narrative":
        if complexity_score > 0.7:
            return 800
        else:
            return 500
    else:
        return 600

Modern Chunking Approaches

Sliding window with hierarchical indexing: Index at multiple levels simultaneously.

Semantic similarity-based chunking: Use embedding similarity to determine natural chunk boundaries.

Agent-based chunking: Train a model to decide where to split based on content analysis.

Implementation Checklist

Measure baseline retrieval quality before chunking
Test 3-5 different chunk size/overlap combinations
Evaluate on representative queries
Document your chosen strategy and rationale
Plan for re-chunking if retrieval quality degrades
Monitor chunk coherence in production

Chunking is both art and science. Start simple, measure carefully, iterate based on real-world retrieval performance.