Sentence Embeddings

Sentence embeddings convert an entire sentence — or paragraph, or document — into a single fixed-length vector that captures its meaning. Unlike word embeddings that operate on individual words, sentence embeddings encode context, order, and overall intent.

Why Sentence-Level Matters

Word embeddings give you the meaning of “bank” in isolation, but not the meaning of “The Federal Reserve raised interest rates to control inflation.” Sentence embeddings encode the full proposition as a point in semantic space — useful for comparing, searching, and clustering entire passages.

sentence-transformers (SBERT)

The sentence-transformers library is the standard tool for generating high-quality sentence embeddings in 2025:

# pip install sentence-transformers
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')  # fast, 384-dim, 80MB

sentences = [
    "The transformer architecture uses self-attention mechanisms.",
    "Self-attention allows models to relate each token to every other token.",
    "Python is a great language for data analysis.",
    "Pandas and NumPy simplify numerical computing in Python.",
    "LLMs can generate, summarize, and classify text."
]

embeddings = model.encode(sentences)
print(f"Embeddings shape: {embeddings.shape}")  # (5, 384)

Semantic Similarity

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

pairs = [
    ("How do I tokenize text in Python?", "What's the best way to split text into tokens?"),
    ("How do I tokenize text in Python?", "What's the capital of France?"),
    ("BERT uses bidirectional attention.", "Transformers are trained bidirectionally in BERT."),
]

for s1, s2 in pairs:
    emb1 = model.encode(s1)
    emb2 = model.encode(s2)
    score = util.cos_sim(emb1, emb2).item()
    print(f"Similarity {score:.4f}: '{s1[:35]}...' vs '{s2[:35]}...'")

# Similarity 0.8921: 'How do I tokenize text in Python?...' vs 'What's the best way to split text...'
# Similarity 0.0842: 'How do I tokenize text in Python?...' vs 'What's the capital of France?...'
# Similarity 0.9134: 'BERT uses bidirectional attention....' vs 'Transformers are trained bidirectio...'

Semantic Search

from sentence_transformers import SentenceTransformer, util
import torch

model = SentenceTransformer('all-MiniLM-L6-v2')

# Knowledge base
passages = [
    "Tokenization splits text into tokens for language models.",
    "BERT is a bidirectional transformer pretrained on masked language modeling.",
    "Cosine similarity measures the angle between two vectors in high-dimensional space.",
    "Fine-tuning adapts a pretrained model to a specific downstream task.",
    "Named entity recognition identifies persons, organizations, and locations in text.",
    "Sentence embeddings encode full sentences as dense vectors for semantic tasks."
]

passage_embeddings = model.encode(passages, convert_to_tensor=True)

def semantic_search(query, top_k=3):
    query_embedding = model.encode(query, convert_to_tensor=True)
    scores = util.cos_sim(query_embedding, passage_embeddings)[0]
    top_results = torch.topk(scores, k=top_k)

    print(f"\nQuery: {query}")
    for score, idx in zip(top_results.values, top_results.indices):
        print(f"  [{score:.4f}] {passages[idx]}")

semantic_search("How does BERT understand context?")
semantic_search("How do I compare sentence meanings?")

Building a Simple RAG Pipeline

from sentence_transformers import SentenceTransformer, util
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

# Document chunks (your knowledge base)
chunks = [
    "GPT-4 was released by OpenAI in March 2023 with multimodal capabilities.",
    "Claude 3 Opus achieved state-of-the-art performance on many benchmarks.",
    "Mistral 7B is an open-source model that outperforms Llama 2 13B.",
    "Retrieval-Augmented Generation combines a retriever with a language model.",
    "Vector databases store and index dense embeddings for fast retrieval."
]

chunk_embeddings = model.encode(chunks)

def retrieve(query, top_k=2):
    q_emb = model.encode([query])
    scores = util.cos_sim(q_emb, chunk_embeddings)[0].numpy()
    top_indices = np.argsort(scores)[::-1][:top_k]
    return [chunks[i] for i in top_indices]

query = "What open-source model is competitive with larger models?"
context = retrieve(query)
print("Retrieved context:")
for c in context:
    print(f"  - {c}")

# In a real RAG pipeline, you'd pass this context to an LLM:
# prompt = f"Context: {' '.join(context)}\n\nQuestion: {query}\n\nAnswer:"

Model Comparison

Model	Dimensions	Speed	Quality	Use case
all-MiniLM-L6-v2	384	Fast	Good	General, low-latency
all-mpnet-base-v2	768	Medium	Better	General, higher accuracy
multi-qa-mpnet-base	768	Medium	Great for QA	Q&A retrieval
e5-large-v2	1024	Slow	Excellent	High-accuracy retrieval
text-embedding-3-large (OpenAI)	3072	API	Excellent	Production via API

OpenAI Embeddings via API

from openai import OpenAI

client = OpenAI()

def get_embedding(text, model="text-embedding-3-small"):
    response = client.embeddings.create(input=text, model=model)
    return response.data[0].embedding

emb = get_embedding("How do sentence embeddings work?")
print(f"Embedding dimension: {len(emb)}")  # 1536 for text-embedding-3-small

OpenAI’s text-embedding-3-large produces 3072-dimensional embeddings and ranks among the top performers on the MTEB benchmark for semantic similarity and retrieval tasks.