Semantic Chunking: Embedding-Driven Document Splitting for RAG

Discover semantic chunking — split documents by meaning not size, using embedding similarity to find natural topic boundaries for superior RAG retrieval.

Semantic Chunking: Letting Meaning Decide Where Sentences Belong

Most chunking strategies answer the question: where should I cut this text? Semantic chunking asks a different question: where does the meaning actually change?

Instead of counting characters or following punctuation rules, semantic chunking embeds sentences and groups consecutive sentences that are semantically similar into the same chunk. When similarity drops below a threshold, that’s where you make the cut.

The result: chunks where every sentence belongs because it’s topically related to its neighbors, not because it happened to fall in the same 500-token window.

The Core Algorithm

Step 1: Split document into sentences
S1, S2, S3, S4, ... Sn
Step 2: Embed each sentence
E1, E2, E3, E4, ... En
Step 3: Compute cosine similarity between consecutive sentences
sim(S1,S2), sim(S2,S3), sim(S3,S4), ...
Step 4: Identify "breakpoints" — places where similarity drops sharply
─────────────────────────────────────────────
S1─S2: 0.91 │ Similar (same chunk)
S2─S3: 0.88 │ Similar
S3─S4: 0.42 ← BREAK (topic shift detected)
S4─S5: 0.87 │ Similar (new chunk starts here)
S5─S6: 0.85 │ Similar
─────────────────────────────────────────────
Step 5: Group sentences between breakpoints → chunks
Chunk 1: S1 + S2 + S3
Chunk 2: S4 + S5 + S6 + ...

The breakpoint threshold is typically set dynamically using the distribution of similarity scores — a common approach is to flag any gap that exceeds the 95th percentile of score drops in the document.

Implementation with LangChain

LangChain’s SemanticChunker (introduced in 2024) implements this directly:

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
chunker = SemanticChunker(
embeddings,
breakpoint_threshold_type="percentile", # or "standard_deviation", "interquartile"
breakpoint_threshold_amount=95, # 95th percentile of similarity drops
number_of_chunks=None, # auto-determine chunk count
)
chunks = chunker.split_text(document_text)

Three threshold strategies are available:

StrategyHow It WorksBest For
percentileCut at the top X% largest dropsLong docs with clear topic shifts
standard_deviationCut when drop > mean + (N × std)Uniform-density documents
interquartileCut at Q3 + 1.5×IQR of dropsDocuments with outlier transitions

Why Chunk Quality Improves

The embedding for a fixed-size chunk that spans two topics will be a blended average of those topics. Neither topic will be well-represented. A semantic chunk, by contrast, contains sentences that all point in roughly the same semantic direction, producing a tighter, more focused embedding.

Fixed-size chunk (spans topic boundary):
"The model uses attention to compute dependencies across tokens.
Positional encodings let the model understand token order. ← Topic 1
Now let's discuss training. The loss function used is cross-entropy.
We train with the Adam optimizer at learning rate 3e-4." ← Topic 2
→ Embedding is a confused mix of "attention mechanisms" and "training procedure"
Semantic chunk (stays within topic):
"The model uses attention to compute dependencies across tokens.
Positional encodings let the model understand token order.
Multi-head attention allows the model to attend to information
from different representation subspaces simultaneously." ← Pure Topic 1
→ Embedding clearly captures "transformer attention mechanisms"

The Cost: Embedding at Chunking Time

Here’s the honest trade-off: semantic chunking requires you to embed every sentence during ingestion. For a 10M document corpus averaging 40 sentences per document, that’s 400M sentence embeddings before you even build your vector index.

For OpenAI’s text-embedding-3-small (as of 2025 pricing), this could run into thousands of dollars at scale. Options to mitigate:

  1. Use a fast local model for chunking (e.g., sentence-transformers/all-MiniLM-L6-v2) and a higher-quality model for the actual chunk embeddings that go into your index.
  2. Batch aggressively: most embedding APIs handle 2,048 texts per request, dramatically reducing API call overhead.
  3. Cache sentence embeddings: if you’re re-ingesting updated documents, many sentences remain unchanged. A content hash cache avoids redundant embedding work.

Minimum Chunk Size Enforcement

Semantic splitting can produce very small chunks — sometimes a single sentence — when topic transitions are frequent. These degrade retrieval quality for the opposite reason from large chunks: they lack enough context for the embedding to be discriminative.

Add a minimum chunk size constraint:

def merge_small_chunks(chunks, min_tokens=100):
merged = []
buffer = ""
for chunk in chunks:
if token_count(buffer + chunk) < min_tokens:
buffer += " " + chunk
else:
if buffer:
merged.append(buffer.strip())
buffer = chunk
if buffer:
merged.append(buffer.strip())
return merged

A minimum of 100–150 tokens works well in practice.

2025 Trend: Late Chunking

Late chunking is an emerging technique that inverts the normal order of operations. Instead of chunking text first and then embedding each chunk independently, you embed the full document (or a long passage) with a long-context model first, then chunk the resulting token embeddings.

Traditional: Late Chunking:
Text → Chunks → Embed Text → Embed (full) → Chunk embeddings
↑ Token embeddings retain full document context

This means every token embedding has been informed by the full document context before chunking happens. Early experiments (colBERT-based systems, JinaAI’s late chunking work) show improvement on multi-hop queries where context matters across chunk boundaries.

Semantic Chunking vs Recursive Chunking: Which to Choose

FactorSemantic ChunkingRecursive Chunking
Retrieval qualityHigherGood
Ingestion time5–20× slowerModerate
Ingestion costHigh (embedding step)Low
Variable chunk sizesYes (natural)Somewhat
Works on any contentYesYes
2025 adoptionGrowingWidely adopted

For most teams: start with recursive chunking, switch to semantic chunking for content types where you can demonstrate retrieval improvement justifies the cost.

Practical Takeaway

Semantic chunking is not a silver bullet — it’s a precision instrument. Use it when:

  • Retrieval quality is the primary bottleneck in your RAG system
  • Documents are long-form with natural topic transitions
  • Your ingestion pipeline runs offline (cost and time are less urgent)
  • You’ve already squeezed the gains available from recursive chunking

Avoid it when you’re still iterating on your overall architecture. The increased complexity and cost don’t pay off until the rest of your pipeline is solid.