Semantic Search: When You Need Meaning, Not Keywords
A user types: “What documents do I need to bring to the appointment?”
A keyword search sees: “documents”, “bring”, “appointment” — and returns results about document management software, meeting scheduling tools, and appointment booking.
A semantic search sees the intent: someone is preparing for an official meeting and needs a checklist. It returns: visa application requirements, hospital intake forms, DMV appointment preparation guides — whatever is relevant in your corpus.
This is the fundamental difference between keyword search and semantic search, and it’s why semantic search is the foundation of every modern RAG system.
How Semantic Search Works
Semantic search converts both documents and queries into dense vector representations (embeddings) that capture meaning. Similar meanings produce similar vectors. Retrieval finds the vectors most similar to the query vector.
Embedding Space Example:
Text: "automobile" → [0.23, -0.45, 0.12, ...]Text: "car" → [0.22, -0.44, 0.13, ...] ← near "automobile"Text: "vehicle" → [0.20, -0.42, 0.11, ...] ← near "automobile"Text: "motorcycle" → [0.18, -0.38, 0.09, ...] ← somewhat nearText: "bicycle" → [0.10, -0.21, 0.05, ...] ← a bit furtherText: "banana" → [-0.45, 0.67, -0.23, ...] ← far away
Query: "What's the fastest two-wheeled vehicle?"Nearest vectors: motorcycle, bicycle — found without any keyword overlapThe embedding model learns these relationships from massive text corpora. It understands synonyms, paraphrases, concepts, and even cross-lingual equivalences (for multilingual models).
Dense vs Sparse Representations
Semantic search uses dense embeddings — vectors where every dimension carries meaning and most values are non-zero. This contrasts with sparse representations used in keyword search (like TF-IDF or BM25), where most dimensions are zero and only matching vocabulary terms have non-zero values.
Sparse (TF-IDF/BM25):"The car engine overheated" → {"car": 0.45, "engine": 0.62, "overheat": 0.71, ...rest 100,000 terms: 0}
Dense (embedding):"The car engine overheated" → [0.12, -0.34, 0.89, 0.22, -0.11, ...] (all 768 dims non-zero)Dense embeddings capture semantics. Sparse representations capture exact vocabulary. Both have roles — which is why hybrid search (covered in a separate section) often outperforms either alone.
Embedding Model Selection
The quality of semantic search depends heavily on the embedding model. Key considerations:
Dimensionality and Quality
| Model | Dims | Context | Best For |
|---|---|---|---|
| OpenAI text-embedding-3-small | 1536 | 8191 tokens | General purpose, cost-effective |
| OpenAI text-embedding-3-large | 3072 | 8191 tokens | Maximum quality |
| Cohere embed-v3 | 1024 | 512 tokens | Multilingual, instruction-based |
| sentence-transformers/all-mpnet-base-v2 | 768 | 384 tokens | Open source, good quality |
| BAAI/bge-large-en-v1.5 | 1024 | 512 tokens | Open source, top MTEB performer |
| Jina ai-embeddings-v3 | 1024 | 8192 tokens | Long-context, open weights |
Task-Specific Embedding
Some embedding models differentiate between “document” and “query” encoding. Documents get one type of encoding; queries get another. This asymmetric approach improves retrieval because what makes a document relevant to a query is different from what makes documents similar to each other.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
# Encode documents: no prefixdoc_embedding = model.encode("The annual report shows Q3 revenue of $4.2B")
# Encode query: add instruction prefix for BGE modelsquery_embedding = model.encode("Represent this sentence for searching: What was Q3 revenue?")Building a Basic Semantic Search Pipeline
from openai import OpenAIfrom qdrant_client import QdrantClientfrom qdrant_client.models import PointStruct, VectorParams, Distanceimport uuid
openai_client = OpenAI()qdrant_client = QdrantClient(":memory:")
# Create collectionqdrant_client.create_collection( collection_name="docs", vectors_config=VectorParams(size=1536, distance=Distance.COSINE),)
def embed(text: str) -> list[float]: return openai_client.embeddings.create( input=text, model="text-embedding-3-small" ).data[0].embedding
# Index documentsdef index_documents(docs: list[dict]): points = [] for doc in docs: embedding = embed(doc["text"]) points.append(PointStruct( id=str(uuid.uuid4()), vector=embedding, payload={"text": doc["text"], "source": doc["source"]}, )) qdrant_client.upsert(collection_name="docs", points=points)
# Semantic searchdef semantic_search(query: str, k: int = 5) -> list[dict]: query_embedding = embed(query) results = qdrant_client.search( collection_name="docs", query_vector=query_embedding, limit=k, ) return [ {"text": r.payload["text"], "score": r.score, "source": r.payload["source"]} for r in results ]Common Failure Modes
Vocabulary Mismatch (Still Exists)
Semantic search handles synonyms but can still miss highly specific technical terms, product names, or acronyms that weren’t well-represented in training data.
Query: "What is the MTR requirement for Series C investors?"Problem: "MTR" (Minimum Transfer Ratio) may not have a strong embeddingSolution: Hybrid retrieval (semantic + BM25) captures exact term matchesOut-of-Distribution Queries
Embedding models trained on general text may not capture domain-specific semantics well. A medical embedding model will produce better results for clinical queries than a general-purpose model.
Long Query Degradation
Most embedding models have short context windows (256–512 tokens). A long, multi-part query gets compressed into a single vector that may not represent all sub-intents equally.
Solution: Query decomposition — split complex queries into multiple sub-queries, run semantic search for each, then merge and deduplicate results.
2025 Trend: Instruction-Following Embeddings
Instruction-tuned embedding models allow you to specify the retrieval task in a short instruction prefix, improving results for task-specific queries:
# Cohere embed-v3 with instructionsfrom cohere import Client
co = Client("your-api-key")
# For document encodingdoc_embedding = co.embed( texts=["Annual report content..."], model="embed-english-v3.0", input_type="search_document").embeddings[0]
# For query encoding — different typequery_embedding = co.embed( texts=["What were the Q3 revenues?"], model="embed-english-v3.0", input_type="search_query" # optimized for search queries).embeddings[0]This asymmetric approach produces better retrieval results than treating documents and queries identically.
Evaluating Semantic Search Quality
The standard evaluation framework for semantic search is BEIR (Benchmarking Information Retrieval). Key metrics:
- NDCG@10: Normalized Discounted Cumulative Gain — measures ranking quality
- Recall@100: What percentage of relevant docs appear in top 100 results
- MRR (Mean Reciprocal Rank): How high up is the first relevant result
For production evaluation, build a golden dataset of 50–200 query-document pairs from your specific corpus and use NDCG@10 as your primary metric.
Semantic search is the entry point to RAG quality. Getting the embedding model right and understanding its failure modes is foundational before layering on more advanced retrieval techniques.