Embeddings
Embeddings are the backbone of semantic search, RAG, recommendation systems, and anomaly detection in AI applications. Once you understand them, a huge amount of modern AI architecture makes intuitive sense.
What Is an Embedding?
An embedding is a way of representing a piece of text (or image, audio, or any data) as a list of numbers — a vector — in a high-dimensional space. The key property is that similar meanings map to vectors that are close together.
"I love dogs" → [0.23, -0.14, 0.87, 0.42, ..., 0.19] (1536 numbers)"I adore puppies" → [0.25, -0.12, 0.84, 0.44, ..., 0.21] (very similar)"The stock fell" → [-0.67, 0.44, -0.22, 0.11, ..., -0.38] (very different)The numbers themselves are meaningless in isolation. What matters is the geometry — the distances and angles between vectors.
Why This Works
Before modern embeddings, we had simple representations like bag-of-words (count how many times each word appears). These completely miss meaning — “not good” and “quite good” would look similar if they share words.
Embedding models learn from context. They’re trained to represent words and phrases in ways that capture how they’re actually used together in real text. As a result, “king” - “man” + “woman” actually gives you something close to “queen” in the vector space. This is the famous word2vec analogy property that first demonstrated the power of embedding representations.
Modern sentence/document embeddings (like those from OpenAI, Cohere, and Voyage) extend this to entire passages, capturing full-sentence meaning.
How Embeddings Are Trained
Most modern text embeddings come from transformer models fine-tuned for semantic similarity. Training typically uses contrastive learning:
Positive pairs (similar meaning, should be close): ("What is machine learning?", "ML is a subfield of AI...")
Negative pairs (different meaning, should be far apart): ("What is machine learning?", "Recipe for chocolate cake...")
Loss: push positive pairs together, push negatives apartThe training data often includes pairs from search queries + their relevant results, question-answer pairs, and NLI (natural language inference) datasets.
Measuring Similarity: Cosine and Dot Product
Once you have embeddings, you need a way to measure similarity. Two main options:
Cosine Similarity
Measures the angle between two vectors. Returns a value from -1 (opposite) to +1 (identical), with 0 meaning unrelated.
similarity = (A · B) / (|A| × |B|)
"cat" and "kitten" → 0.87 (very similar)"cat" and "bank" → 0.12 (mostly unrelated)"hot" and "cold" → -0.23 (somewhat opposite)Best for: Normalized embeddings, when you want to compare meaning independent of vector magnitude.
Dot Product
Simply multiplies and sums corresponding values. Faster to compute. If vectors are normalized to unit length, it equals cosine similarity.
Best for: When embeddings are trained with dot product similarity (check the model’s recommendation).
import numpy as np
def cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Using OpenAI embeddingsfrom openai import OpenAIclient = OpenAI()
def embed(text: str) -> list[float]: response = client.embeddings.create( model="text-embedding-3-small", input=text ) return response.data[0].embedding
e1 = embed("The quick brown fox")e2 = embed("A fast tan fox")print(cosine_similarity(np.array(e1), np.array(e2))) # ~0.92Choosing an Embedding Model (2025–2026)
| Model | Dimensions | Best For | Notes |
|---|---|---|---|
| text-embedding-3-small (OpenAI) | 1536 | General use, cost-effective | $0.02/1M tokens |
| text-embedding-3-large (OpenAI) | 3072 | Best OpenAI quality | $0.13/1M tokens |
| voyage-large-2 (Voyage AI) | 1536 | Top retrieval quality | Strong on BEIR benchmarks |
| embed-english-v3 (Cohere) | 1024 | Multilingual retrieval | Excellent cross-language |
| BGE-M3 (BAAI, open source) | 1024 | Local deployment | Free, competitive quality |
| E5-mistral-7b (open source) | 4096 | Best open-source | Larger, slower, but strong |
Rule of thumb: For production, test voyage-large-2 or text-embedding-3-large. For local/private use, BGE-M3. For cost-sensitive at scale, text-embedding-3-small.
Embedding Dimensions: Bigger Isn’t Always Better
Higher-dimensional embeddings capture more nuance but:
- Cost more to store (1536 floats × 4 bytes = 6KB per vector)
- Are slower to search at scale
- Can cause the “curse of dimensionality” in nearest-neighbor search
Modern models like text-embedding-3-small support Matryoshka Representation Learning — you can truncate the embedding to fewer dimensions (e.g., 256) and still get reasonable quality, trading off accuracy for speed.
# Matryoshka embeddings: truncate to 256 dims for speedresponse = client.embeddings.create( model="text-embedding-3-small", input=text, dimensions=256 # truncate from 1536 to 256)Beyond Text: Multimodal Embeddings
Modern embedding models increasingly handle multiple modalities in the same vector space:
CLIP (OpenAI): Images and text in the same space. A photo of a dog and the text “puppy” land near each other. Powers reverse image search and zero-shot image classification.
ImageBind (Meta): Aligns text, images, audio, video, depth, and thermal in one embedding space. Query with audio, find matching videos.
Voyage Multimodal 3: Handle text and images together for document understanding (PDFs with mixed text and images).
Practical Considerations
Chunking Affects Embedding Quality
A 512-token chunk embeds as one vector. If that chunk contains multiple topics, the embedding splits its “attention” between them, making retrieval less precise. Smaller, more focused chunks generally retrieve better.
Long Documents Lose Detail
Embedding a 10,000-word document as one vector loses tremendous granularity — everything averages together. Always chunk before embedding for retrieval purposes.
Domain Adaptation
Generic embeddings work well for general text. For specialized domains (legal, medical, financial), fine-tuning embeddings on domain-specific similarity pairs can improve retrieval quality significantly.
Embedding Freshness
Embeddings are static — they don’t update as knowledge evolves. When you update a document, you must re-embed and re-index it. Design your indexing pipeline with incremental updates in mind.
Visualizing the Embedding Space
A practical sanity check: use UMAP or t-SNE to reduce your embeddings to 2D and plot them. Well-functioning embeddings will show clear clusters by topic:
import umapimport matplotlib.pyplot as plt
# Reduce 1536-dim embeddings to 2Dreducer = umap.UMAP(n_components=2, random_state=42)coords_2d = reducer.fit_transform(np.array(embeddings))
plt.scatter(coords_2d[:, 0], coords_2d[:, 1], c=labels, cmap='tab10')plt.show()# Should show distinct clusters for each document topicIf your embedding clusters don’t make intuitive sense, your chunking or embedding model choice might need adjustment.