Technology  /  NLP

💬 Natural Language Processing 40 guides · updated 2026

From tokenisation and embeddings to transformer-based language understanding — the NLP fundamentals that underpin every modern LLM.

Sentence Embeddings

Sentence embeddings convert an entire sentence — or paragraph, or document — into a single fixed-length vector that captures its meaning. Unlike word embeddings that operate on individual words, sentence embeddings encode context, order, and overall intent.


Why Sentence-Level Matters

Word embeddings give you the meaning of “bank” in isolation, but not the meaning of “The Federal Reserve raised interest rates to control inflation.” Sentence embeddings encode the full proposition as a point in semantic space — useful for comparing, searching, and clustering entire passages.


sentence-transformers (SBERT)

The sentence-transformers library is the standard tool for generating high-quality sentence embeddings in 2025:

# pip install sentence-transformers
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2') # fast, 384-dim, 80MB
sentences = [
"The transformer architecture uses self-attention mechanisms.",
"Self-attention allows models to relate each token to every other token.",
"Python is a great language for data analysis.",
"Pandas and NumPy simplify numerical computing in Python.",
"LLMs can generate, summarize, and classify text."
]
embeddings = model.encode(sentences)
print(f"Embeddings shape: {embeddings.shape}") # (5, 384)

Semantic Similarity

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
pairs = [
("How do I tokenize text in Python?", "What's the best way to split text into tokens?"),
("How do I tokenize text in Python?", "What's the capital of France?"),
("BERT uses bidirectional attention.", "Transformers are trained bidirectionally in BERT."),
]
for s1, s2 in pairs:
emb1 = model.encode(s1)
emb2 = model.encode(s2)
score = util.cos_sim(emb1, emb2).item()
print(f"Similarity {score:.4f}: '{s1[:35]}...' vs '{s2[:35]}...'")
# Similarity 0.8921: 'How do I tokenize text in Python?...' vs 'What's the best way to split text...'
# Similarity 0.0842: 'How do I tokenize text in Python?...' vs 'What's the capital of France?...'
# Similarity 0.9134: 'BERT uses bidirectional attention....' vs 'Transformers are trained bidirectio...'

from sentence_transformers import SentenceTransformer, util
import torch
model = SentenceTransformer('all-MiniLM-L6-v2')
# Knowledge base
passages = [
"Tokenization splits text into tokens for language models.",
"BERT is a bidirectional transformer pretrained on masked language modeling.",
"Cosine similarity measures the angle between two vectors in high-dimensional space.",
"Fine-tuning adapts a pretrained model to a specific downstream task.",
"Named entity recognition identifies persons, organizations, and locations in text.",
"Sentence embeddings encode full sentences as dense vectors for semantic tasks."
]
passage_embeddings = model.encode(passages, convert_to_tensor=True)
def semantic_search(query, top_k=3):
query_embedding = model.encode(query, convert_to_tensor=True)
scores = util.cos_sim(query_embedding, passage_embeddings)[0]
top_results = torch.topk(scores, k=top_k)
print(f"\nQuery: {query}")
for score, idx in zip(top_results.values, top_results.indices):
print(f" [{score:.4f}] {passages[idx]}")
semantic_search("How does BERT understand context?")
semantic_search("How do I compare sentence meanings?")

Building a Simple RAG Pipeline

from sentence_transformers import SentenceTransformer, util
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
# Document chunks (your knowledge base)
chunks = [
"GPT-4 was released by OpenAI in March 2023 with multimodal capabilities.",
"Claude 3 Opus achieved state-of-the-art performance on many benchmarks.",
"Mistral 7B is an open-source model that outperforms Llama 2 13B.",
"Retrieval-Augmented Generation combines a retriever with a language model.",
"Vector databases store and index dense embeddings for fast retrieval."
]
chunk_embeddings = model.encode(chunks)
def retrieve(query, top_k=2):
q_emb = model.encode([query])
scores = util.cos_sim(q_emb, chunk_embeddings)[0].numpy()
top_indices = np.argsort(scores)[::-1][:top_k]
return [chunks[i] for i in top_indices]
query = "What open-source model is competitive with larger models?"
context = retrieve(query)
print("Retrieved context:")
for c in context:
print(f" - {c}")
# In a real RAG pipeline, you'd pass this context to an LLM:
# prompt = f"Context: {' '.join(context)}\n\nQuestion: {query}\n\nAnswer:"

Model Comparison

ModelDimensionsSpeedQualityUse case
all-MiniLM-L6-v2384FastGoodGeneral, low-latency
all-mpnet-base-v2768MediumBetterGeneral, higher accuracy
multi-qa-mpnet-base768MediumGreat for QAQ&A retrieval
e5-large-v21024SlowExcellentHigh-accuracy retrieval
text-embedding-3-large (OpenAI)3072APIExcellentProduction via API

OpenAI Embeddings via API

from openai import OpenAI
client = OpenAI()
def get_embedding(text, model="text-embedding-3-small"):
response = client.embeddings.create(input=text, model=model)
return response.data[0].embedding
emb = get_embedding("How do sentence embeddings work?")
print(f"Embedding dimension: {len(emb)}") # 1536 for text-embedding-3-small

OpenAI’s text-embedding-3-large produces 3072-dimensional embeddings and ranks among the top performers on the MTEB benchmark for semantic similarity and retrieval tasks.