Embeddings in RAG: From Text to Vectors

At the heart of modern RAG systems lies a deceptively simple idea: convert text into numbers (vectors) such that similar text produces similar numbers. These numerical representations—called embeddings—enable semantic search.

What Are Embeddings?

An embedding is a dense numerical representation of text. Instead of keywords, embeddings capture meaning.

Example:

Text: "The cat sat on the mat"
Embedding: [0.2, -0.15, 0.8, ..., -0.3]  (384-dimensional vector)

Text: "The dog sat on the floor"
Embedding: [0.21, -0.16, 0.78, ..., -0.31]  (similar to first)

Text: "Python is a programming language"
Embedding: [0.05, 0.2, -0.1, ..., 0.4]  (different from both)

Related texts have embeddings that point in similar directions in vector space.

How Embeddings Work

Modern embeddings come from neural networks trained on massive text corpora using contrastive learning or other objectives.

Training process (simplified):

Take text pairs: (query, relevant_document, non-relevant_document)
Encode each to embeddings
Adjust weights so relevant documents embed closer to queries
Repeat millions of times
Result: A model that encodes meaning

The neural network learns to extract and compress semantic information into a fixed-size vector.

Embedding Dimensions

Embeddings have fixed dimensionality: typically 384 to 3072 dimensions.

Dimensionality trade-offs:

Lower dimensions (384-512): Faster computation, less storage, but less expressivity
Higher dimensions (768-1536): More semantic information captured, slower
Very high (2048+): Diminishing returns, not commonly used

Common choices:

Sentence-transformers (384d): Fast, good quality
OpenAI embeddings (1536d): Expensive but high quality
Cohere/Voyage (1024d): Balanced approach

General rule: Start with 384-512 dimensions. Only increase if retrieval accuracy is poor.

Embedding Models: The Landscape

OpenAI Embeddings

Model: text-embedding-3-large (3072d), text-embedding-3-small (1536d)

Pros:

State-of-the-art quality
Well-maintained, stable API
Excellent documentation

Cons:

Proprietary, no local control
API costs accumulate with scale
Tied to OpenAI’s updates

Use case: Companies comfortable with vendor lock-in, prioritizing quality over cost.

Sentence-Transformers (Open Source)

Popular models:

all-MiniLM-L6-v2 (384d): Small, fast, good for CPU
all-mpnet-base-v2 (768d): Larger, higher quality
multilingual models: 50+ languages supported

Pros:

Free, fully open source
Run locally or self-hosted
Fine-tuning possible for domain tasks
No API costs

Cons:

Slower than cloud APIs
Require infrastructure
Support quality varies

Use case: Privacy-conscious organizations, cost-sensitive deployments, domain-specific tuning needed.

Cohere Embeddings

Model: embed-english-v3.0 (1024d), multilingual-v3.0

Pros:

High quality
Reasonably priced per million tokens
Excellent multilingual support

Cons:

Proprietary API
Requires API key management
Costs similar to OpenAI

Use case: Companies needing multilingual support and value for money.

Voyage AI

Model: voyage-3, voyage-3-lite (1024d)

Pros:

Specialized for RAG/search tasks
Good retrieval performance
Reasonable pricing

Cons:

Newer company, smaller ecosystem
Less documentation than OpenAI/Cohere

Use case: RAG-focused deployments wanting specialized models.

BGE and E5 Models

Large-scale training, very competitive open-source options.

BGE: Developed by Alibaba, strong multilingual support E5: Contrastively trained, strong zero-shot performance

Both available via Hugging Face, can be self-hosted.

Choosing an Embedding Model

Decision framework:

1. Infrastructure: Local (on-premise) or cloud?

Local → Sentence-Transformers
Cloud → OpenAI, Cohere, Voyage

2. Quality requirements: Accuracy-critical or cost-critical?

Critical → OpenAI text-embedding-3-large
Cost-sensitive → Sentence-Transformers or BGE

3. Domain specificity: Generic or specialized knowledge?

Generic → Any established model
Specialized → Fine-tune Sentence-Transformers on domain data

4. Language requirements: English-only or multilingual?

Multilingual → Cohere or multilingual Sentence-Transformers

5. Latency requirements: Sub-100ms or acceptable with seconds?

Sub-100ms → Local small models
Flexible → API-based models fine

Starting recommendation:

For prototyping: OpenAI text-embedding-3-small
For production cost optimization: Sentence-Transformers all-mpnet-base-v2
For quality at scale: Text-embedding-3-large or BGE

Embedding Costs at Scale

For a 1 million document knowledge base with 500 tokens/document average:

OpenAI (API): ~$500-1000 initial embedding + ongoing updates
Self-hosted Sentence-Transformers: ~$0 ongoing (amortized hardware cost)
Cohere: ~$200-400 initial embedding

Retrieval queries (much cheaper): 100k queries/month costs ~$1-5 depending on provider.

Fine-Tuning Embeddings for Your Domain

When off-the-shelf models underperform:

Approach:

Collect domain-specific Q&A pairs (100-1000 examples)
Fine-tune a Sentence-Transformer model on your data
Deploy fine-tuned model in RAG pipeline

Benefits:

5-15% improvement in retrieval accuracy
Customized to your terminology and concepts
Fully under your control

Trade-offs:

Requires labeled data
Computational cost to train
Maintenance burden increases

Embedding Quality Metrics

Retrieve Quality: Test your embeddings empirically. Given known relevant documents for test queries:

Recall@K: Of top K retrieved, how many were actually relevant?
NDCG: Normalized discounted cumulative gain (ranking quality)
MRR: Mean reciprocal rank (position of first correct result)

Run benchmarks before and after model changes.

Common Embedding Mistakes

Using document embeddings for queries: Documents and queries need consistent embedding!

Ignoring dimensionality implications: Higher dimensions don’t always mean better (diminishing returns).

Not measuring retrieval quality: Assuming better model = better RAG (measure it!).

Mixing embedding spaces: Changing models breaks existing vectors in your database.

Not accounting for drift: Semantic drift over time requires periodic re-embedding.

Future Embedding Directions in 2024

Multimodal embeddings: Single vector space for text, images, video
Long-context embeddings: Handling 8K+ token contexts in single embedding
Specialized embeddings: Task-specific models for retrieval, ranking, classification
Adaptive embeddings: Models that adjust representation based on query characteristics

Embeddings are the interface between human language and mathematical computation. Getting them right is essential to RAG success.