Embeddings

Embeddings are the backbone of semantic search, RAG, recommendation systems, and anomaly detection in AI applications. Once you understand them, a huge amount of modern AI architecture makes intuitive sense.

What Is an Embedding?

An embedding is a way of representing a piece of text (or image, audio, or any data) as a list of numbers — a vector — in a high-dimensional space. The key property is that similar meanings map to vectors that are close together.

"I love dogs"     → [0.23, -0.14, 0.87, 0.42, ..., 0.19]  (1536 numbers)
"I adore puppies" → [0.25, -0.12, 0.84, 0.44, ..., 0.21]  (very similar)
"The stock fell"  → [-0.67, 0.44, -0.22, 0.11, ..., -0.38] (very different)

The numbers themselves are meaningless in isolation. What matters is the geometry — the distances and angles between vectors.

Why This Works

Before modern embeddings, we had simple representations like bag-of-words (count how many times each word appears). These completely miss meaning — “not good” and “quite good” would look similar if they share words.

Embedding models learn from context. They’re trained to represent words and phrases in ways that capture how they’re actually used together in real text. As a result, “king” - “man” + “woman” actually gives you something close to “queen” in the vector space. This is the famous word2vec analogy property that first demonstrated the power of embedding representations.

Modern sentence/document embeddings (like those from OpenAI, Cohere, and Voyage) extend this to entire passages, capturing full-sentence meaning.

How Embeddings Are Trained

Most modern text embeddings come from transformer models fine-tuned for semantic similarity. Training typically uses contrastive learning:

Positive pairs (similar meaning, should be close):
  ("What is machine learning?", "ML is a subfield of AI...")

Negative pairs (different meaning, should be far apart):
  ("What is machine learning?", "Recipe for chocolate cake...")

Loss: push positive pairs together, push negatives apart

The training data often includes pairs from search queries + their relevant results, question-answer pairs, and NLI (natural language inference) datasets.

Measuring Similarity: Cosine and Dot Product

Once you have embeddings, you need a way to measure similarity. Two main options:

Cosine Similarity

Measures the angle between two vectors. Returns a value from -1 (opposite) to +1 (identical), with 0 meaning unrelated.

similarity = (A · B) / (|A| × |B|)

"cat" and "kitten" → 0.87  (very similar)
"cat" and "bank"   → 0.12  (mostly unrelated)
"hot" and "cold"   → -0.23 (somewhat opposite)

Best for: Normalized embeddings, when you want to compare meaning independent of vector magnitude.

Dot Product

Simply multiplies and sums corresponding values. Faster to compute. If vectors are normalized to unit length, it equals cosine similarity.

Best for: When embeddings are trained with dot product similarity (check the model’s recommendation).

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Using OpenAI embeddings
from openai import OpenAI
client = OpenAI()

def embed(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

e1 = embed("The quick brown fox")
e2 = embed("A fast tan fox")
print(cosine_similarity(np.array(e1), np.array(e2)))  # ~0.92

Choosing an Embedding Model (2025–2026)

Model	Dimensions	Best For	Notes
text-embedding-3-small (OpenAI)	1536	General use, cost-effective	$0.02/1M tokens
text-embedding-3-large (OpenAI)	3072	Best OpenAI quality	$0.13/1M tokens
voyage-large-2 (Voyage AI)	1536	Top retrieval quality	Strong on BEIR benchmarks
embed-english-v3 (Cohere)	1024	Multilingual retrieval	Excellent cross-language
BGE-M3 (BAAI, open source)	1024	Local deployment	Free, competitive quality
E5-mistral-7b (open source)	4096	Best open-source	Larger, slower, but strong

Rule of thumb: For production, test voyage-large-2 or text-embedding-3-large. For local/private use, BGE-M3. For cost-sensitive at scale, text-embedding-3-small.

Embedding Dimensions: Bigger Isn’t Always Better

Higher-dimensional embeddings capture more nuance but:

Cost more to store (1536 floats × 4 bytes = 6KB per vector)
Are slower to search at scale
Can cause the “curse of dimensionality” in nearest-neighbor search

Modern models like text-embedding-3-small support Matryoshka Representation Learning — you can truncate the embedding to fewer dimensions (e.g., 256) and still get reasonable quality, trading off accuracy for speed.

# Matryoshka embeddings: truncate to 256 dims for speed
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=text,
    dimensions=256  # truncate from 1536 to 256
)

Beyond Text: Multimodal Embeddings

Modern embedding models increasingly handle multiple modalities in the same vector space:

CLIP (OpenAI): Images and text in the same space. A photo of a dog and the text “puppy” land near each other. Powers reverse image search and zero-shot image classification.

ImageBind (Meta): Aligns text, images, audio, video, depth, and thermal in one embedding space. Query with audio, find matching videos.

Voyage Multimodal 3: Handle text and images together for document understanding (PDFs with mixed text and images).

Practical Considerations

Chunking Affects Embedding Quality

A 512-token chunk embeds as one vector. If that chunk contains multiple topics, the embedding splits its “attention” between them, making retrieval less precise. Smaller, more focused chunks generally retrieve better.

Long Documents Lose Detail

Embedding a 10,000-word document as one vector loses tremendous granularity — everything averages together. Always chunk before embedding for retrieval purposes.

Domain Adaptation

Generic embeddings work well for general text. For specialized domains (legal, medical, financial), fine-tuning embeddings on domain-specific similarity pairs can improve retrieval quality significantly.

Embedding Freshness

Embeddings are static — they don’t update as knowledge evolves. When you update a document, you must re-embed and re-index it. Design your indexing pipeline with incremental updates in mind.

Visualizing the Embedding Space

A practical sanity check: use UMAP or t-SNE to reduce your embeddings to 2D and plot them. Well-functioning embeddings will show clear clusters by topic:

import umap
import matplotlib.pyplot as plt

# Reduce 1536-dim embeddings to 2D
reducer = umap.UMAP(n_components=2, random_state=42)
coords_2d = reducer.fit_transform(np.array(embeddings))

plt.scatter(coords_2d[:, 0], coords_2d[:, 1], c=labels, cmap='tab10')
plt.show()
# Should show distinct clusters for each document topic

If your embedding clusters don’t make intuitive sense, your chunking or embedding model choice might need adjustment.