Semantic Chunking: Letting Meaning Decide Where Sentences Belong

Most chunking strategies answer the question: where should I cut this text? Semantic chunking asks a different question: where does the meaning actually change?

Instead of counting characters or following punctuation rules, semantic chunking embeds sentences and groups consecutive sentences that are semantically similar into the same chunk. When similarity drops below a threshold, that’s where you make the cut.

The result: chunks where every sentence belongs because it’s topically related to its neighbors, not because it happened to fall in the same 500-token window.

The Core Algorithm

Step 1: Split document into sentences
  S1, S2, S3, S4, ... Sn

Step 2: Embed each sentence
  E1, E2, E3, E4, ... En

Step 3: Compute cosine similarity between consecutive sentences
  sim(S1,S2), sim(S2,S3), sim(S3,S4), ...

Step 4: Identify "breakpoints" — places where similarity drops sharply
  ─────────────────────────────────────────────
  S1─S2: 0.91  │ Similar (same chunk)
  S2─S3: 0.88  │ Similar
  S3─S4: 0.42  ← BREAK (topic shift detected)
  S4─S5: 0.87  │ Similar (new chunk starts here)
  S5─S6: 0.85  │ Similar
  ─────────────────────────────────────────────

Step 5: Group sentences between breakpoints → chunks
  Chunk 1: S1 + S2 + S3
  Chunk 2: S4 + S5 + S6 + ...

The breakpoint threshold is typically set dynamically using the distribution of similarity scores — a common approach is to flag any gap that exceeds the 95th percentile of score drops in the document.

Implementation with LangChain

LangChain’s SemanticChunker (introduced in 2024) implements this directly:

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",  # or "standard_deviation", "interquartile"
    breakpoint_threshold_amount=95,          # 95th percentile of similarity drops
    number_of_chunks=None,                   # auto-determine chunk count
)

chunks = chunker.split_text(document_text)

Three threshold strategies are available:

Strategy	How It Works	Best For
`percentile`	Cut at the top X% largest drops	Long docs with clear topic shifts
`standard_deviation`	Cut when drop > mean + (N × std)	Uniform-density documents
`interquartile`	Cut at Q3 + 1.5×IQR of drops	Documents with outlier transitions

Why Chunk Quality Improves

The embedding for a fixed-size chunk that spans two topics will be a blended average of those topics. Neither topic will be well-represented. A semantic chunk, by contrast, contains sentences that all point in roughly the same semantic direction, producing a tighter, more focused embedding.

Fixed-size chunk (spans topic boundary):
  "The model uses attention to compute dependencies across tokens.
   Positional encodings let the model understand token order.   ← Topic 1
   Now let's discuss training. The loss function used is cross-entropy.
   We train with the Adam optimizer at learning rate 3e-4."     ← Topic 2

→ Embedding is a confused mix of "attention mechanisms" and "training procedure"

Semantic chunk (stays within topic):
  "The model uses attention to compute dependencies across tokens.
   Positional encodings let the model understand token order.
   Multi-head attention allows the model to attend to information
   from different representation subspaces simultaneously."      ← Pure Topic 1

→ Embedding clearly captures "transformer attention mechanisms"

The Cost: Embedding at Chunking Time

Here’s the honest trade-off: semantic chunking requires you to embed every sentence during ingestion. For a 10M document corpus averaging 40 sentences per document, that’s 400M sentence embeddings before you even build your vector index.

For OpenAI’s text-embedding-3-small (as of 2025 pricing), this could run into thousands of dollars at scale. Options to mitigate:

Use a fast local model for chunking (e.g., sentence-transformers/all-MiniLM-L6-v2) and a higher-quality model for the actual chunk embeddings that go into your index.
Batch aggressively: most embedding APIs handle 2,048 texts per request, dramatically reducing API call overhead.
Cache sentence embeddings: if you’re re-ingesting updated documents, many sentences remain unchanged. A content hash cache avoids redundant embedding work.

Minimum Chunk Size Enforcement

Semantic splitting can produce very small chunks — sometimes a single sentence — when topic transitions are frequent. These degrade retrieval quality for the opposite reason from large chunks: they lack enough context for the embedding to be discriminative.

Add a minimum chunk size constraint:

def merge_small_chunks(chunks, min_tokens=100):
    merged = []
    buffer = ""
    for chunk in chunks:
        if token_count(buffer + chunk) < min_tokens:
            buffer += " " + chunk
        else:
            if buffer:
                merged.append(buffer.strip())
            buffer = chunk
    if buffer:
        merged.append(buffer.strip())
    return merged

A minimum of 100–150 tokens works well in practice.

2025 Trend: Late Chunking

Late chunking is an emerging technique that inverts the normal order of operations. Instead of chunking text first and then embedding each chunk independently, you embed the full document (or a long passage) with a long-context model first, then chunk the resulting token embeddings.

Traditional:          Late Chunking:
Text → Chunks → Embed   Text → Embed (full) → Chunk embeddings
                        ↑ Token embeddings retain full document context

This means every token embedding has been informed by the full document context before chunking happens. Early experiments (colBERT-based systems, JinaAI’s late chunking work) show improvement on multi-hop queries where context matters across chunk boundaries.

Semantic Chunking vs Recursive Chunking: Which to Choose

Factor	Semantic Chunking	Recursive Chunking
Retrieval quality	Higher	Good
Ingestion time	5–20× slower	Moderate
Ingestion cost	High (embedding step)	Low
Variable chunk sizes	Yes (natural)	Somewhat
Works on any content	Yes	Yes
2025 adoption	Growing	Widely adopted

For most teams: start with recursive chunking, switch to semantic chunking for content types where you can demonstrate retrieval improvement justifies the cost.

Practical Takeaway

Semantic chunking is not a silver bullet — it’s a precision instrument. Use it when:

Retrieval quality is the primary bottleneck in your RAG system
Documents are long-form with natural topic transitions
Your ingestion pipeline runs offline (cost and time are less urgent)
You’ve already squeezed the gains available from recursive chunking

Avoid it when you’re still iterating on your overall architecture. The increased complexity and cost don’t pay off until the rest of your pipeline is solid.