Embeddings in RAG: From Text to Vectors
At the heart of modern RAG systems lies a deceptively simple idea: convert text into numbers (vectors) such that similar text produces similar numbers. These numerical representations—called embeddings—enable semantic search.
What Are Embeddings?
An embedding is a dense numerical representation of text. Instead of keywords, embeddings capture meaning.
Example:
Text: "The cat sat on the mat"Embedding: [0.2, -0.15, 0.8, ..., -0.3] (384-dimensional vector)
Text: "The dog sat on the floor"Embedding: [0.21, -0.16, 0.78, ..., -0.31] (similar to first)
Text: "Python is a programming language"Embedding: [0.05, 0.2, -0.1, ..., 0.4] (different from both)Related texts have embeddings that point in similar directions in vector space.
How Embeddings Work
Modern embeddings come from neural networks trained on massive text corpora using contrastive learning or other objectives.
Training process (simplified):
- Take text pairs: (query, relevant_document, non-relevant_document)
- Encode each to embeddings
- Adjust weights so relevant documents embed closer to queries
- Repeat millions of times
- Result: A model that encodes meaning
The neural network learns to extract and compress semantic information into a fixed-size vector.
Embedding Dimensions
Embeddings have fixed dimensionality: typically 384 to 3072 dimensions.
Dimensionality trade-offs:
- Lower dimensions (384-512): Faster computation, less storage, but less expressivity
- Higher dimensions (768-1536): More semantic information captured, slower
- Very high (2048+): Diminishing returns, not commonly used
Common choices:
- Sentence-transformers (384d): Fast, good quality
- OpenAI embeddings (1536d): Expensive but high quality
- Cohere/Voyage (1024d): Balanced approach
General rule: Start with 384-512 dimensions. Only increase if retrieval accuracy is poor.
Embedding Models: The Landscape
OpenAI Embeddings
Model: text-embedding-3-large (3072d), text-embedding-3-small (1536d)
Pros:
- State-of-the-art quality
- Well-maintained, stable API
- Excellent documentation
Cons:
- Proprietary, no local control
- API costs accumulate with scale
- Tied to OpenAI’s updates
Use case: Companies comfortable with vendor lock-in, prioritizing quality over cost.
Sentence-Transformers (Open Source)
Popular models:
- all-MiniLM-L6-v2 (384d): Small, fast, good for CPU
- all-mpnet-base-v2 (768d): Larger, higher quality
- multilingual models: 50+ languages supported
Pros:
- Free, fully open source
- Run locally or self-hosted
- Fine-tuning possible for domain tasks
- No API costs
Cons:
- Slower than cloud APIs
- Require infrastructure
- Support quality varies
Use case: Privacy-conscious organizations, cost-sensitive deployments, domain-specific tuning needed.
Cohere Embeddings
Model: embed-english-v3.0 (1024d), multilingual-v3.0
Pros:
- High quality
- Reasonably priced per million tokens
- Excellent multilingual support
Cons:
- Proprietary API
- Requires API key management
- Costs similar to OpenAI
Use case: Companies needing multilingual support and value for money.
Voyage AI
Model: voyage-3, voyage-3-lite (1024d)
Pros:
- Specialized for RAG/search tasks
- Good retrieval performance
- Reasonable pricing
Cons:
- Newer company, smaller ecosystem
- Less documentation than OpenAI/Cohere
Use case: RAG-focused deployments wanting specialized models.
BGE and E5 Models
Large-scale training, very competitive open-source options.
BGE: Developed by Alibaba, strong multilingual support E5: Contrastively trained, strong zero-shot performance
Both available via Hugging Face, can be self-hosted.
Choosing an Embedding Model
Decision framework:
1. Infrastructure: Local (on-premise) or cloud?
- Local → Sentence-Transformers
- Cloud → OpenAI, Cohere, Voyage
2. Quality requirements: Accuracy-critical or cost-critical?
- Critical → OpenAI text-embedding-3-large
- Cost-sensitive → Sentence-Transformers or BGE
3. Domain specificity: Generic or specialized knowledge?
- Generic → Any established model
- Specialized → Fine-tune Sentence-Transformers on domain data
4. Language requirements: English-only or multilingual?
- Multilingual → Cohere or multilingual Sentence-Transformers
5. Latency requirements: Sub-100ms or acceptable with seconds?
- Sub-100ms → Local small models
- Flexible → API-based models fine
Starting recommendation:
- For prototyping: OpenAI text-embedding-3-small
- For production cost optimization: Sentence-Transformers all-mpnet-base-v2
- For quality at scale: Text-embedding-3-large or BGE
Embedding Costs at Scale
For a 1 million document knowledge base with 500 tokens/document average:
- OpenAI (API): ~$500-1000 initial embedding + ongoing updates
- Self-hosted Sentence-Transformers: ~$0 ongoing (amortized hardware cost)
- Cohere: ~$200-400 initial embedding
Retrieval queries (much cheaper): 100k queries/month costs ~$1-5 depending on provider.
Fine-Tuning Embeddings for Your Domain
When off-the-shelf models underperform:
Approach:
- Collect domain-specific Q&A pairs (100-1000 examples)
- Fine-tune a Sentence-Transformer model on your data
- Deploy fine-tuned model in RAG pipeline
Benefits:
- 5-15% improvement in retrieval accuracy
- Customized to your terminology and concepts
- Fully under your control
Trade-offs:
- Requires labeled data
- Computational cost to train
- Maintenance burden increases
Embedding Quality Metrics
Retrieve Quality: Test your embeddings empirically. Given known relevant documents for test queries:
- Recall@K: Of top K retrieved, how many were actually relevant?
- NDCG: Normalized discounted cumulative gain (ranking quality)
- MRR: Mean reciprocal rank (position of first correct result)
Run benchmarks before and after model changes.
Common Embedding Mistakes
Using document embeddings for queries: Documents and queries need consistent embedding!
Ignoring dimensionality implications: Higher dimensions don’t always mean better (diminishing returns).
Not measuring retrieval quality: Assuming better model = better RAG (measure it!).
Mixing embedding spaces: Changing models breaks existing vectors in your database.
Not accounting for drift: Semantic drift over time requires periodic re-embedding.
Future Embedding Directions in 2024
- Multimodal embeddings: Single vector space for text, images, video
- Long-context embeddings: Handling 8K+ token contexts in single embedding
- Specialized embeddings: Task-specific models for retrieval, ranking, classification
- Adaptive embeddings: Models that adjust representation based on query characteristics
Embeddings are the interface between human language and mathematical computation. Getting them right is essential to RAG success.