Embeddings in RAG: Converting Text to Vectors for Semantic Search

Understand how embeddings convert text into numerical vectors. Learn embedding models, dimensions, and how to choose models for RAG systems.

Embeddings in RAG: From Text to Vectors

At the heart of modern RAG systems lies a deceptively simple idea: convert text into numbers (vectors) such that similar text produces similar numbers. These numerical representations—called embeddings—enable semantic search.

What Are Embeddings?

An embedding is a dense numerical representation of text. Instead of keywords, embeddings capture meaning.

Example:

Text: "The cat sat on the mat"
Embedding: [0.2, -0.15, 0.8, ..., -0.3] (384-dimensional vector)
Text: "The dog sat on the floor"
Embedding: [0.21, -0.16, 0.78, ..., -0.31] (similar to first)
Text: "Python is a programming language"
Embedding: [0.05, 0.2, -0.1, ..., 0.4] (different from both)

Related texts have embeddings that point in similar directions in vector space.

How Embeddings Work

Modern embeddings come from neural networks trained on massive text corpora using contrastive learning or other objectives.

Training process (simplified):

  1. Take text pairs: (query, relevant_document, non-relevant_document)
  2. Encode each to embeddings
  3. Adjust weights so relevant documents embed closer to queries
  4. Repeat millions of times
  5. Result: A model that encodes meaning

The neural network learns to extract and compress semantic information into a fixed-size vector.

Embedding Dimensions

Embeddings have fixed dimensionality: typically 384 to 3072 dimensions.

Dimensionality trade-offs:

  • Lower dimensions (384-512): Faster computation, less storage, but less expressivity
  • Higher dimensions (768-1536): More semantic information captured, slower
  • Very high (2048+): Diminishing returns, not commonly used

Common choices:

  • Sentence-transformers (384d): Fast, good quality
  • OpenAI embeddings (1536d): Expensive but high quality
  • Cohere/Voyage (1024d): Balanced approach

General rule: Start with 384-512 dimensions. Only increase if retrieval accuracy is poor.

Embedding Models: The Landscape

OpenAI Embeddings

Model: text-embedding-3-large (3072d), text-embedding-3-small (1536d)

Pros:

  • State-of-the-art quality
  • Well-maintained, stable API
  • Excellent documentation

Cons:

  • Proprietary, no local control
  • API costs accumulate with scale
  • Tied to OpenAI’s updates

Use case: Companies comfortable with vendor lock-in, prioritizing quality over cost.

Sentence-Transformers (Open Source)

Popular models:

  • all-MiniLM-L6-v2 (384d): Small, fast, good for CPU
  • all-mpnet-base-v2 (768d): Larger, higher quality
  • multilingual models: 50+ languages supported

Pros:

  • Free, fully open source
  • Run locally or self-hosted
  • Fine-tuning possible for domain tasks
  • No API costs

Cons:

  • Slower than cloud APIs
  • Require infrastructure
  • Support quality varies

Use case: Privacy-conscious organizations, cost-sensitive deployments, domain-specific tuning needed.

Cohere Embeddings

Model: embed-english-v3.0 (1024d), multilingual-v3.0

Pros:

  • High quality
  • Reasonably priced per million tokens
  • Excellent multilingual support

Cons:

  • Proprietary API
  • Requires API key management
  • Costs similar to OpenAI

Use case: Companies needing multilingual support and value for money.

Voyage AI

Model: voyage-3, voyage-3-lite (1024d)

Pros:

  • Specialized for RAG/search tasks
  • Good retrieval performance
  • Reasonable pricing

Cons:

  • Newer company, smaller ecosystem
  • Less documentation than OpenAI/Cohere

Use case: RAG-focused deployments wanting specialized models.

BGE and E5 Models

Large-scale training, very competitive open-source options.

BGE: Developed by Alibaba, strong multilingual support E5: Contrastively trained, strong zero-shot performance

Both available via Hugging Face, can be self-hosted.

Choosing an Embedding Model

Decision framework:

1. Infrastructure: Local (on-premise) or cloud?

  • Local → Sentence-Transformers
  • Cloud → OpenAI, Cohere, Voyage

2. Quality requirements: Accuracy-critical or cost-critical?

  • Critical → OpenAI text-embedding-3-large
  • Cost-sensitive → Sentence-Transformers or BGE

3. Domain specificity: Generic or specialized knowledge?

  • Generic → Any established model
  • Specialized → Fine-tune Sentence-Transformers on domain data

4. Language requirements: English-only or multilingual?

  • Multilingual → Cohere or multilingual Sentence-Transformers

5. Latency requirements: Sub-100ms or acceptable with seconds?

  • Sub-100ms → Local small models
  • Flexible → API-based models fine

Starting recommendation:

  • For prototyping: OpenAI text-embedding-3-small
  • For production cost optimization: Sentence-Transformers all-mpnet-base-v2
  • For quality at scale: Text-embedding-3-large or BGE

Embedding Costs at Scale

For a 1 million document knowledge base with 500 tokens/document average:

  • OpenAI (API): ~$500-1000 initial embedding + ongoing updates
  • Self-hosted Sentence-Transformers: ~$0 ongoing (amortized hardware cost)
  • Cohere: ~$200-400 initial embedding

Retrieval queries (much cheaper): 100k queries/month costs ~$1-5 depending on provider.

Fine-Tuning Embeddings for Your Domain

When off-the-shelf models underperform:

Approach:

  1. Collect domain-specific Q&A pairs (100-1000 examples)
  2. Fine-tune a Sentence-Transformer model on your data
  3. Deploy fine-tuned model in RAG pipeline

Benefits:

  • 5-15% improvement in retrieval accuracy
  • Customized to your terminology and concepts
  • Fully under your control

Trade-offs:

  • Requires labeled data
  • Computational cost to train
  • Maintenance burden increases

Embedding Quality Metrics

Retrieve Quality: Test your embeddings empirically. Given known relevant documents for test queries:

  • Recall@K: Of top K retrieved, how many were actually relevant?
  • NDCG: Normalized discounted cumulative gain (ranking quality)
  • MRR: Mean reciprocal rank (position of first correct result)

Run benchmarks before and after model changes.

Common Embedding Mistakes

Using document embeddings for queries: Documents and queries need consistent embedding!

Ignoring dimensionality implications: Higher dimensions don’t always mean better (diminishing returns).

Not measuring retrieval quality: Assuming better model = better RAG (measure it!).

Mixing embedding spaces: Changing models breaks existing vectors in your database.

Not accounting for drift: Semantic drift over time requires periodic re-embedding.

Future Embedding Directions in 2024

  • Multimodal embeddings: Single vector space for text, images, video
  • Long-context embeddings: Handling 8K+ token contexts in single embedding
  • Specialized embeddings: Task-specific models for retrieval, ranking, classification
  • Adaptive embeddings: Models that adjust representation based on query characteristics

Embeddings are the interface between human language and mathematical computation. Getting them right is essential to RAG success.