Semantic Chunking: Letting Meaning Decide Where Sentences Belong
Most chunking strategies answer the question: where should I cut this text? Semantic chunking asks a different question: where does the meaning actually change?
Instead of counting characters or following punctuation rules, semantic chunking embeds sentences and groups consecutive sentences that are semantically similar into the same chunk. When similarity drops below a threshold, that’s where you make the cut.
The result: chunks where every sentence belongs because it’s topically related to its neighbors, not because it happened to fall in the same 500-token window.
The Core Algorithm
Step 1: Split document into sentences S1, S2, S3, S4, ... Sn
Step 2: Embed each sentence E1, E2, E3, E4, ... En
Step 3: Compute cosine similarity between consecutive sentences sim(S1,S2), sim(S2,S3), sim(S3,S4), ...
Step 4: Identify "breakpoints" — places where similarity drops sharply ───────────────────────────────────────────── S1─S2: 0.91 │ Similar (same chunk) S2─S3: 0.88 │ Similar S3─S4: 0.42 ← BREAK (topic shift detected) S4─S5: 0.87 │ Similar (new chunk starts here) S5─S6: 0.85 │ Similar ─────────────────────────────────────────────
Step 5: Group sentences between breakpoints → chunks Chunk 1: S1 + S2 + S3 Chunk 2: S4 + S5 + S6 + ...The breakpoint threshold is typically set dynamically using the distribution of similarity scores — a common approach is to flag any gap that exceeds the 95th percentile of score drops in the document.
Implementation with LangChain
LangChain’s SemanticChunker (introduced in 2024) implements this directly:
from langchain_experimental.text_splitter import SemanticChunkerfrom langchain_openai.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
chunker = SemanticChunker( embeddings, breakpoint_threshold_type="percentile", # or "standard_deviation", "interquartile" breakpoint_threshold_amount=95, # 95th percentile of similarity drops number_of_chunks=None, # auto-determine chunk count)
chunks = chunker.split_text(document_text)Three threshold strategies are available:
| Strategy | How It Works | Best For |
|---|---|---|
percentile | Cut at the top X% largest drops | Long docs with clear topic shifts |
standard_deviation | Cut when drop > mean + (N × std) | Uniform-density documents |
interquartile | Cut at Q3 + 1.5×IQR of drops | Documents with outlier transitions |
Why Chunk Quality Improves
The embedding for a fixed-size chunk that spans two topics will be a blended average of those topics. Neither topic will be well-represented. A semantic chunk, by contrast, contains sentences that all point in roughly the same semantic direction, producing a tighter, more focused embedding.
Fixed-size chunk (spans topic boundary): "The model uses attention to compute dependencies across tokens. Positional encodings let the model understand token order. ← Topic 1 Now let's discuss training. The loss function used is cross-entropy. We train with the Adam optimizer at learning rate 3e-4." ← Topic 2
→ Embedding is a confused mix of "attention mechanisms" and "training procedure"
Semantic chunk (stays within topic): "The model uses attention to compute dependencies across tokens. Positional encodings let the model understand token order. Multi-head attention allows the model to attend to information from different representation subspaces simultaneously." ← Pure Topic 1
→ Embedding clearly captures "transformer attention mechanisms"The Cost: Embedding at Chunking Time
Here’s the honest trade-off: semantic chunking requires you to embed every sentence during ingestion. For a 10M document corpus averaging 40 sentences per document, that’s 400M sentence embeddings before you even build your vector index.
For OpenAI’s text-embedding-3-small (as of 2025 pricing), this could run into thousands of dollars at scale. Options to mitigate:
- Use a fast local model for chunking (e.g.,
sentence-transformers/all-MiniLM-L6-v2) and a higher-quality model for the actual chunk embeddings that go into your index. - Batch aggressively: most embedding APIs handle 2,048 texts per request, dramatically reducing API call overhead.
- Cache sentence embeddings: if you’re re-ingesting updated documents, many sentences remain unchanged. A content hash cache avoids redundant embedding work.
Minimum Chunk Size Enforcement
Semantic splitting can produce very small chunks — sometimes a single sentence — when topic transitions are frequent. These degrade retrieval quality for the opposite reason from large chunks: they lack enough context for the embedding to be discriminative.
Add a minimum chunk size constraint:
def merge_small_chunks(chunks, min_tokens=100): merged = [] buffer = "" for chunk in chunks: if token_count(buffer + chunk) < min_tokens: buffer += " " + chunk else: if buffer: merged.append(buffer.strip()) buffer = chunk if buffer: merged.append(buffer.strip()) return mergedA minimum of 100–150 tokens works well in practice.
2025 Trend: Late Chunking
Late chunking is an emerging technique that inverts the normal order of operations. Instead of chunking text first and then embedding each chunk independently, you embed the full document (or a long passage) with a long-context model first, then chunk the resulting token embeddings.
Traditional: Late Chunking:Text → Chunks → Embed Text → Embed (full) → Chunk embeddings ↑ Token embeddings retain full document contextThis means every token embedding has been informed by the full document context before chunking happens. Early experiments (colBERT-based systems, JinaAI’s late chunking work) show improvement on multi-hop queries where context matters across chunk boundaries.
Semantic Chunking vs Recursive Chunking: Which to Choose
| Factor | Semantic Chunking | Recursive Chunking |
|---|---|---|
| Retrieval quality | Higher | Good |
| Ingestion time | 5–20× slower | Moderate |
| Ingestion cost | High (embedding step) | Low |
| Variable chunk sizes | Yes (natural) | Somewhat |
| Works on any content | Yes | Yes |
| 2025 adoption | Growing | Widely adopted |
For most teams: start with recursive chunking, switch to semantic chunking for content types where you can demonstrate retrieval improvement justifies the cost.
Practical Takeaway
Semantic chunking is not a silver bullet — it’s a precision instrument. Use it when:
- Retrieval quality is the primary bottleneck in your RAG system
- Documents are long-form with natural topic transitions
- Your ingestion pipeline runs offline (cost and time are less urgent)
- You’ve already squeezed the gains available from recursive chunking
Avoid it when you’re still iterating on your overall architecture. The increased complexity and cost don’t pay off until the rest of your pipeline is solid.