Document Chunking for RAG: Dividing Content for Optimal Retrieval
You’ve ingested your documents, cleaned the text, and extracted metadata. Now comes a critical decision: how should you divide documents into chunks for embedding and retrieval?
Chunk poorly and your retriever will either miss relevant information or return irrelevant results. Get it right and your RAG system retrieves exactly what it needs.
Why Chunking Matters
The core problem: Your embedding model has context limits (typically 512-8192 tokens). Entire documents often exceed this. Additionally, documents contain multiple ideas. Embedding the entire document loses granularity—when you search, you retrieve the whole document even if only a small section matches the query.
The solution: Break documents into chunks small enough to embed, large enough to contain sufficient context.
Fixed-Size Chunking
The simplest approach: split documents into chunks of fixed character or token length with optional overlap.
Example: 500-character chunks with 100-character overlap
Document: "Long text spanning many topics..."Chunk 1: [0-500]Chunk 2: [400-900] (overlaps with Chunk 1)Chunk 3: [800-1300] (overlaps with Chunk 2)...Advantages:
- Simple to implement
- Predictable compute costs
- Works with any content type
- Deterministic and reproducible
Disadvantages:
- Chunks may split mid-sentence or mid-idea
- Same size works poorly across document types
- Overlaps increase storage and compute
When to use: Quick prototyping, uniform content, computational simplicity valued over quality.
Semantic Chunking
Rather than arbitrary size boundaries, semantic chunking respects document structure and meaning.
Approach 1: Sentence-based chunking
Document split by sentences, then grouped:- Chunk 1: Sentences 1-5 (semantically related)- Chunk 2: Sentences 6-12 (different topic)- Chunk 3: Sentences 13-18 (another topic)Approach 2: Paragraph-based chunking
Use existing paragraph breaks as natural boundaries, then group paragraphs into chunks.
Approach 3: Structure-aware chunking
Respect document hierarchy: sections, subsections, lists maintain structure within chunks.
Advantages:
- Respects semantic boundaries
- More coherent chunks
- Better for structured documents
Disadvantages:
- Chunk sizes vary widely
- Content-dependent, less predictable
- Requires parsing document structure
When to use: Structured documents (articles, reports, documentation).
Recursive Chunking with Overlap
LangChain popularized recursive chunking: split by progressively smaller delimiters until chunks reach target size.
Process:
Level 1: Split by "\n\n" (paragraph breaks)Level 2: Split by "\n" (line breaks)Level 3: Split by "." (sentences)Level 4: Split by " " (words)Stop when chunks are within size limits.
Advantages:
- Preserves semantic structure when possible
- Falls back to smaller delimiters as needed
- Good balance of quality and consistency
Disadvantages:
- Requires delimiter availability
- Fine-tuning delimiter list for your content
Implementation:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter( chunk_size=500, chunk_overlap=50, separators=["\n\n", "\n", ". ", " ", ""])chunks = splitter.split_text(document)Specialized Chunking Strategies
Code Chunking
Programming code has its own structure (functions, classes, blocks).
Approach: Respect code structure, keep related code together, preserve context (imports, type definitions).
Tools: Tree-sitter for parsing language-specific structure.
Table Chunking
Tables have rows and columns, not sentence structure.
Options:
- Convert to prose descriptions
- Keep table intact with row/column context
- Index each row separately with table context
Long-Form Content
Books, manuals, research papers span thousands of tokens.
Strategy: Multi-level chunking:
- Level 1: Chapter or major section
- Level 2: Subsection within chapter
- Level 3: Detailed paragraph block
During retrieval, retrieve higher-level chunks first, then optionally refine with detailed chunks.
Choosing Chunk Size
Factors to consider:
1. Embedding model context: What’s your model’s maximum input length?
2. Retrieval query length: Short queries need small chunks. Complex queries benefit from larger chunks providing more context.
3. Your domain:
- Technical docs: Larger chunks (500-1000 tokens) preserve examples and explanation
- News/articles: Medium chunks (300-500 tokens) align with natural paragraphs
- Code: Smaller chunks (200-400 tokens) keep functions/methods intact
4. Downstream usage: What does your LLM generator expect?
- If you concatenate 5 chunks: each should be ~1000 tokens (for 4K context models)
- If you concatenate 10 chunks: each should be ~500 tokens
5. Overlap trade-offs:
- No overlap: efficient but risks missing context at boundaries
- 10-20% overlap: sweet spot for most cases
- High overlap (50%+): safety but costly
General starting point: 500-token chunks with 10-15% overlap. Adjust based on retrieval quality.
Chunking Mistakes to Avoid
Chunks too small (< 100 tokens): Insufficient context, poor embedding quality, excessive number of documents.
Chunks too large (> 2000 tokens): Exceed most embedding model limits, dilute information density, slow retrieval.
Ignoring domain structure: Chapters split mid-thought, code functions split across chunks.
Fixed size for all content: Same chunk size works poorly for blogs AND API documentation AND code.
Ignoring retrieval metrics: Choosing chunk size theoretically rather than empirically measuring retrieval accuracy.
Measuring Chunking Quality
Metric 1: Retrieval Recall For known Q&A pairs, does the relevant chunk get retrieved in top K results?
Metric 2: Chunk Coherence Do chunks stand alone or reference context they don’t contain?
Metric 3: Embedding Coverage Are query concepts covered by chunks they should match?
Metric 4: Latency How many chunks retrieved? How much data transferred and embedded?
Dynamic Chunking
Advanced systems adjust chunking based on content:
def dynamic_chunk_size(content_type, complexity_score): if content_type == "code": return 300 elif content_type == "narrative": if complexity_score > 0.7: return 800 else: return 500 else: return 600Modern Chunking Approaches
Sliding window with hierarchical indexing: Index at multiple levels simultaneously.
Semantic similarity-based chunking: Use embedding similarity to determine natural chunk boundaries.
Agent-based chunking: Train a model to decide where to split based on content analysis.
Implementation Checklist
- Measure baseline retrieval quality before chunking
- Test 3-5 different chunk size/overlap combinations
- Evaluate on representative queries
- Document your chosen strategy and rationale
- Plan for re-chunking if retrieval quality degrades
- Monitor chunk coherence in production
Chunking is both art and science. Start simple, measure carefully, iterate based on real-world retrieval performance.