Sentence Embeddings
Sentence embeddings convert an entire sentence — or paragraph, or document — into a single fixed-length vector that captures its meaning. Unlike word embeddings that operate on individual words, sentence embeddings encode context, order, and overall intent.
Why Sentence-Level Matters
Word embeddings give you the meaning of “bank” in isolation, but not the meaning of “The Federal Reserve raised interest rates to control inflation.” Sentence embeddings encode the full proposition as a point in semantic space — useful for comparing, searching, and clustering entire passages.
sentence-transformers (SBERT)
The sentence-transformers library is the standard tool for generating high-quality sentence embeddings in 2025:
# pip install sentence-transformersfrom sentence_transformers import SentenceTransformerimport numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2') # fast, 384-dim, 80MB
sentences = [ "The transformer architecture uses self-attention mechanisms.", "Self-attention allows models to relate each token to every other token.", "Python is a great language for data analysis.", "Pandas and NumPy simplify numerical computing in Python.", "LLMs can generate, summarize, and classify text."]
embeddings = model.encode(sentences)print(f"Embeddings shape: {embeddings.shape}") # (5, 384)Semantic Similarity
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
pairs = [ ("How do I tokenize text in Python?", "What's the best way to split text into tokens?"), ("How do I tokenize text in Python?", "What's the capital of France?"), ("BERT uses bidirectional attention.", "Transformers are trained bidirectionally in BERT."),]
for s1, s2 in pairs: emb1 = model.encode(s1) emb2 = model.encode(s2) score = util.cos_sim(emb1, emb2).item() print(f"Similarity {score:.4f}: '{s1[:35]}...' vs '{s2[:35]}...'")
# Similarity 0.8921: 'How do I tokenize text in Python?...' vs 'What's the best way to split text...'# Similarity 0.0842: 'How do I tokenize text in Python?...' vs 'What's the capital of France?...'# Similarity 0.9134: 'BERT uses bidirectional attention....' vs 'Transformers are trained bidirectio...'Semantic Search
from sentence_transformers import SentenceTransformer, utilimport torch
model = SentenceTransformer('all-MiniLM-L6-v2')
# Knowledge basepassages = [ "Tokenization splits text into tokens for language models.", "BERT is a bidirectional transformer pretrained on masked language modeling.", "Cosine similarity measures the angle between two vectors in high-dimensional space.", "Fine-tuning adapts a pretrained model to a specific downstream task.", "Named entity recognition identifies persons, organizations, and locations in text.", "Sentence embeddings encode full sentences as dense vectors for semantic tasks."]
passage_embeddings = model.encode(passages, convert_to_tensor=True)
def semantic_search(query, top_k=3): query_embedding = model.encode(query, convert_to_tensor=True) scores = util.cos_sim(query_embedding, passage_embeddings)[0] top_results = torch.topk(scores, k=top_k)
print(f"\nQuery: {query}") for score, idx in zip(top_results.values, top_results.indices): print(f" [{score:.4f}] {passages[idx]}")
semantic_search("How does BERT understand context?")semantic_search("How do I compare sentence meanings?")Building a Simple RAG Pipeline
from sentence_transformers import SentenceTransformer, utilimport numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
# Document chunks (your knowledge base)chunks = [ "GPT-4 was released by OpenAI in March 2023 with multimodal capabilities.", "Claude 3 Opus achieved state-of-the-art performance on many benchmarks.", "Mistral 7B is an open-source model that outperforms Llama 2 13B.", "Retrieval-Augmented Generation combines a retriever with a language model.", "Vector databases store and index dense embeddings for fast retrieval."]
chunk_embeddings = model.encode(chunks)
def retrieve(query, top_k=2): q_emb = model.encode([query]) scores = util.cos_sim(q_emb, chunk_embeddings)[0].numpy() top_indices = np.argsort(scores)[::-1][:top_k] return [chunks[i] for i in top_indices]
query = "What open-source model is competitive with larger models?"context = retrieve(query)print("Retrieved context:")for c in context: print(f" - {c}")
# In a real RAG pipeline, you'd pass this context to an LLM:# prompt = f"Context: {' '.join(context)}\n\nQuestion: {query}\n\nAnswer:"Model Comparison
| Model | Dimensions | Speed | Quality | Use case |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | Fast | Good | General, low-latency |
| all-mpnet-base-v2 | 768 | Medium | Better | General, higher accuracy |
| multi-qa-mpnet-base | 768 | Medium | Great for QA | Q&A retrieval |
| e5-large-v2 | 1024 | Slow | Excellent | High-accuracy retrieval |
| text-embedding-3-large (OpenAI) | 3072 | API | Excellent | Production via API |
OpenAI Embeddings via API
from openai import OpenAI
client = OpenAI()
def get_embedding(text, model="text-embedding-3-small"): response = client.embeddings.create(input=text, model=model) return response.data[0].embedding
emb = get_embedding("How do sentence embeddings work?")print(f"Embedding dimension: {len(emb)}") # 1536 for text-embedding-3-smallOpenAI’s text-embedding-3-large produces 3072-dimensional embeddings and ranks among the top performers on the MTEB benchmark for semantic similarity and retrieval tasks.