TF-IDF in NLP
TF-IDF (Term Frequency–Inverse Document Frequency) assigns a weight to each word in a document based on how often it appears in that document relative to how common it is across the entire corpus. Words that are frequent in a specific document but rare overall get high scores — they’re the meaningful, distinguishing terms.
The Formula
TF (Term Frequency): How often a word appears in a document.
TF(t, d) = count(t in d) / total_words(d)IDF (Inverse Document Frequency): How rare a word is across the corpus.
IDF(t, D) = log(N / df(t))
where N = total documents, df(t) = documents containing term tTF-IDF:
TFIDF(t, d, D) = TF(t, d) × IDF(t, D)A word like “the” has high TF everywhere but very low IDF — it gets a near-zero TF-IDF score. A word like “transformer” that appears frequently in one AI paper but rarely elsewhere gets a high score.
TF-IDF with scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizerimport pandas as pdimport numpy as np
corpus = [ "Large language models like GPT and Claude use transformer architecture.", "The transformer architecture relies on self-attention mechanisms.", "GPT models generate fluent text using autoregressive decoding.", "Retrieval-augmented generation combines LLMs with document search.", "Vector databases store embeddings for fast similarity search."]
vectorizer = TfidfVectorizer( stop_words='english', max_features=20, ngram_range=(1, 1))
X = vectorizer.fit_transform(corpus)feature_names = vectorizer.get_feature_names_out()
df = pd.DataFrame( X.toarray().round(3), columns=feature_names, index=[f"Doc {i+1}" for i in range(len(corpus))])print(df)Keyword Extraction with TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizerimport numpy as np
def extract_keywords(documents, top_n=5): vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1, 2)) tfidf_matrix = vectorizer.fit_transform(documents) feature_names = vectorizer.get_feature_names_out()
results = [] for i, doc in enumerate(documents): scores = tfidf_matrix[i].toarray().flatten() top_indices = np.argsort(scores)[::-1][:top_n] keywords = [(feature_names[idx], round(scores[idx], 4)) for idx in top_indices if scores[idx] > 0] results.append({"doc": i + 1, "keywords": keywords})
return results
docs = [ "BERT and RoBERTa are pre-trained transformer models for NLP tasks.", "Python's scikit-learn provides excellent tools for classical machine learning.", "Vector databases like Pinecone and Weaviate power modern RAG pipelines."]
for result in extract_keywords(docs, top_n=4): print(f"Doc {result['doc']}: {result['keywords']}")
# Doc 1: [('bert', 0.48), ('transformer models', 0.41), ('roberta', 0.41), ('pre trained', 0.38)]# Doc 2: [('scikit learn', 0.51), ('machine learning', 0.44), ('classical machine', 0.41), ('python', 0.36)]# Doc 3: [('vector databases', 0.49), ('rag pipelines', 0.45), ('pinecone', 0.41), ('weaviate', 0.41)]TF-IDF for Text Classification
from sklearn.pipeline import Pipelinefrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import cross_val_score
texts = [ "The server crashed and the API returned a 500 error", "Beautiful sunset over the Pacific Ocean today", "Database query took 30 seconds — need to add an index", "The flowers in the garden are blooming after the rain", "The model's F1 score dropped after retraining on new data", "Hiking trail through the redwoods was breathtaking"]labels = ["tech", "nature", "tech", "nature", "tech", "nature"]
pipeline = Pipeline([ ('tfidf', TfidfVectorizer(ngram_range=(1, 2), stop_words='english', max_features=5000)), ('clf', LogisticRegression(max_iter=200))])
scores = cross_val_score(pipeline, texts, labels, cv=3)print(f"Cross-validation accuracy: {scores.mean():.2f}")Document Search and Ranking
TF-IDF powers classic information retrieval — the same idea behind BM25 used in Elasticsearch:
from sklearn.metrics.pairwise import cosine_similarityimport numpy as np
documents = [ "Python is a popular programming language for data science and AI.", "Machine learning models learn patterns from large datasets.", "Neural networks use layers of neurons to process information.", "Deep learning is a subset of machine learning using neural networks.",]
vectorizer = TfidfVectorizer(stop_words='english')doc_vectors = vectorizer.fit_transform(documents)
def search(query, top_k=2): query_vec = vectorizer.transform([query]) scores = cosine_similarity(query_vec, doc_vectors).flatten() top_indices = np.argsort(scores)[::-1][:top_k] return [(documents[i], round(scores[i], 4)) for i in top_indices]
results = search("neural network deep learning")for doc, score in results: print(f"Score {score}: {doc}")TF-IDF Tuning Parameters
vectorizer = TfidfVectorizer( lowercase=True, analyzer='word', # 'word', 'char', or 'char_wb' ngram_range=(1, 2), # unigrams + bigrams max_features=10000, # cap vocabulary size min_df=2, # min document frequency max_df=0.85, # max document frequency (removes near-universal terms) sublinear_tf=True, # use log(1 + tf) to dampen high counts use_idf=True, smooth_idf=True, # add 1 to prevent zero IDF norm='l2' # normalize each document vector to unit length)TF-IDF vs Modern Embeddings
| Aspect | TF-IDF | Sentence Embeddings |
|---|---|---|
| Captures word order | No | Yes |
| Captures synonymy | No | Yes |
| Interpretable | Yes | No |
| Memory footprint | Small (sparse) | Larger (dense) |
| Training required | No | Yes (pretrained) |
| Speed | Very fast | Moderate |
| Best for | Keyword search, topic modeling, baselines | Semantic search, RAG, similarity |
TF-IDF remains the go-to choice for fast, explainable keyword-based systems. For semantic understanding — “is this customer complaint about billing?” — dense embeddings from models like sentence-transformers perform better.