Semantic Search: When You Need Meaning, Not Keywords

A user types: “What documents do I need to bring to the appointment?”

A keyword search sees: “documents”, “bring”, “appointment” — and returns results about document management software, meeting scheduling tools, and appointment booking.

A semantic search sees the intent: someone is preparing for an official meeting and needs a checklist. It returns: visa application requirements, hospital intake forms, DMV appointment preparation guides — whatever is relevant in your corpus.

This is the fundamental difference between keyword search and semantic search, and it’s why semantic search is the foundation of every modern RAG system.

How Semantic Search Works

Semantic search converts both documents and queries into dense vector representations (embeddings) that capture meaning. Similar meanings produce similar vectors. Retrieval finds the vectors most similar to the query vector.

Embedding Space Example:

Text: "automobile"          → [0.23, -0.45, 0.12, ...]
Text: "car"                 → [0.22, -0.44, 0.13, ...]  ← near "automobile"
Text: "vehicle"             → [0.20, -0.42, 0.11, ...]  ← near "automobile"
Text: "motorcycle"          → [0.18, -0.38, 0.09, ...]  ← somewhat near
Text: "bicycle"             → [0.10, -0.21, 0.05, ...]  ← a bit further
Text: "banana"              → [-0.45, 0.67, -0.23, ...]  ← far away

Query: "What's the fastest two-wheeled vehicle?"
Nearest vectors: motorcycle, bicycle — found without any keyword overlap

The embedding model learns these relationships from massive text corpora. It understands synonyms, paraphrases, concepts, and even cross-lingual equivalences (for multilingual models).

Dense vs Sparse Representations

Semantic search uses dense embeddings — vectors where every dimension carries meaning and most values are non-zero. This contrasts with sparse representations used in keyword search (like TF-IDF or BM25), where most dimensions are zero and only matching vocabulary terms have non-zero values.

Sparse (TF-IDF/BM25):
"The car engine overheated" →
  {"car": 0.45, "engine": 0.62, "overheat": 0.71, ...rest 100,000 terms: 0}

Dense (embedding):
"The car engine overheated" →
  [0.12, -0.34, 0.89, 0.22, -0.11, ...]  (all 768 dims non-zero)

Dense embeddings capture semantics. Sparse representations capture exact vocabulary. Both have roles — which is why hybrid search (covered in a separate section) often outperforms either alone.

Embedding Model Selection

The quality of semantic search depends heavily on the embedding model. Key considerations:

Dimensionality and Quality

Model	Dims	Context	Best For
OpenAI text-embedding-3-small	1536	8191 tokens	General purpose, cost-effective
OpenAI text-embedding-3-large	3072	8191 tokens	Maximum quality
Cohere embed-v3	1024	512 tokens	Multilingual, instruction-based
sentence-transformers/all-mpnet-base-v2	768	384 tokens	Open source, good quality
BAAI/bge-large-en-v1.5	1024	512 tokens	Open source, top MTEB performer
Jina ai-embeddings-v3	1024	8192 tokens	Long-context, open weights

Task-Specific Embedding

Some embedding models differentiate between “document” and “query” encoding. Documents get one type of encoding; queries get another. This asymmetric approach improves retrieval because what makes a document relevant to a query is different from what makes documents similar to each other.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-large-en-v1.5")

# Encode documents: no prefix
doc_embedding = model.encode("The annual report shows Q3 revenue of $4.2B")

# Encode query: add instruction prefix for BGE models
query_embedding = model.encode("Represent this sentence for searching: What was Q3 revenue?")

Building a Basic Semantic Search Pipeline

from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct, VectorParams, Distance
import uuid

openai_client = OpenAI()
qdrant_client = QdrantClient(":memory:")

# Create collection
qdrant_client.create_collection(
    collection_name="docs",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

def embed(text: str) -> list[float]:
    return openai_client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    ).data[0].embedding

# Index documents
def index_documents(docs: list[dict]):
    points = []
    for doc in docs:
        embedding = embed(doc["text"])
        points.append(PointStruct(
            id=str(uuid.uuid4()),
            vector=embedding,
            payload={"text": doc["text"], "source": doc["source"]},
        ))
    qdrant_client.upsert(collection_name="docs", points=points)

# Semantic search
def semantic_search(query: str, k: int = 5) -> list[dict]:
    query_embedding = embed(query)
    results = qdrant_client.search(
        collection_name="docs",
        query_vector=query_embedding,
        limit=k,
    )
    return [
        {"text": r.payload["text"], "score": r.score, "source": r.payload["source"]}
        for r in results
    ]

Common Failure Modes

Vocabulary Mismatch (Still Exists)

Semantic search handles synonyms but can still miss highly specific technical terms, product names, or acronyms that weren’t well-represented in training data.

Query: "What is the MTR requirement for Series C investors?"
Problem: "MTR" (Minimum Transfer Ratio) may not have a strong embedding
Solution: Hybrid retrieval (semantic + BM25) captures exact term matches

Out-of-Distribution Queries

Embedding models trained on general text may not capture domain-specific semantics well. A medical embedding model will produce better results for clinical queries than a general-purpose model.

Long Query Degradation

Most embedding models have short context windows (256–512 tokens). A long, multi-part query gets compressed into a single vector that may not represent all sub-intents equally.

Solution: Query decomposition — split complex queries into multiple sub-queries, run semantic search for each, then merge and deduplicate results.

2025 Trend: Instruction-Following Embeddings

Instruction-tuned embedding models allow you to specify the retrieval task in a short instruction prefix, improving results for task-specific queries:

# Cohere embed-v3 with instructions
from cohere import Client

co = Client("your-api-key")

# For document encoding
doc_embedding = co.embed(
    texts=["Annual report content..."],
    model="embed-english-v3.0",
    input_type="search_document"
).embeddings[0]

# For query encoding — different type
query_embedding = co.embed(
    texts=["What were the Q3 revenues?"],
    model="embed-english-v3.0",
    input_type="search_query"  # optimized for search queries
).embeddings[0]

This asymmetric approach produces better retrieval results than treating documents and queries identically.

Evaluating Semantic Search Quality

The standard evaluation framework for semantic search is BEIR (Benchmarking Information Retrieval). Key metrics:

NDCG@10: Normalized Discounted Cumulative Gain — measures ranking quality
Recall@100: What percentage of relevant docs appear in top 100 results
MRR (Mean Reciprocal Rank): How high up is the first relevant result

For production evaluation, build a golden dataset of 50–200 query-document pairs from your specific corpus and use NDCG@10 as your primary metric.

Semantic search is the entry point to RAG quality. Getting the embedding model right and understanding its failure modes is foundational before layering on more advanced retrieval techniques.