TF-IDF in NLP

TF-IDF (Term Frequency–Inverse Document Frequency) assigns a weight to each word in a document based on how often it appears in that document relative to how common it is across the entire corpus. Words that are frequent in a specific document but rare overall get high scores — they’re the meaningful, distinguishing terms.

The Formula

TF (Term Frequency): How often a word appears in a document.

TF(t, d) = count(t in d) / total_words(d)

IDF (Inverse Document Frequency): How rare a word is across the corpus.

IDF(t, D) = log(N / df(t))

where N = total documents, df(t) = documents containing term t

TF-IDF:

TFIDF(t, d, D) = TF(t, d) × IDF(t, D)

A word like “the” has high TF everywhere but very low IDF — it gets a near-zero TF-IDF score. A word like “transformer” that appears frequently in one AI paper but rarely elsewhere gets a high score.

TF-IDF with scikit-learn

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np

corpus = [
    "Large language models like GPT and Claude use transformer architecture.",
    "The transformer architecture relies on self-attention mechanisms.",
    "GPT models generate fluent text using autoregressive decoding.",
    "Retrieval-augmented generation combines LLMs with document search.",
    "Vector databases store embeddings for fast similarity search."
]

vectorizer = TfidfVectorizer(
    stop_words='english',
    max_features=20,
    ngram_range=(1, 1)
)

X = vectorizer.fit_transform(corpus)
feature_names = vectorizer.get_feature_names_out()

df = pd.DataFrame(
    X.toarray().round(3),
    columns=feature_names,
    index=[f"Doc {i+1}" for i in range(len(corpus))]
)
print(df)

Keyword Extraction with TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

def extract_keywords(documents, top_n=5):
    vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1, 2))
    tfidf_matrix = vectorizer.fit_transform(documents)
    feature_names = vectorizer.get_feature_names_out()

    results = []
    for i, doc in enumerate(documents):
        scores = tfidf_matrix[i].toarray().flatten()
        top_indices = np.argsort(scores)[::-1][:top_n]
        keywords = [(feature_names[idx], round(scores[idx], 4)) for idx in top_indices if scores[idx] > 0]
        results.append({"doc": i + 1, "keywords": keywords})

    return results

docs = [
    "BERT and RoBERTa are pre-trained transformer models for NLP tasks.",
    "Python's scikit-learn provides excellent tools for classical machine learning.",
    "Vector databases like Pinecone and Weaviate power modern RAG pipelines."
]

for result in extract_keywords(docs, top_n=4):
    print(f"Doc {result['doc']}: {result['keywords']}")

# Doc 1: [('bert', 0.48), ('transformer models', 0.41), ('roberta', 0.41), ('pre trained', 0.38)]
# Doc 2: [('scikit learn', 0.51), ('machine learning', 0.44), ('classical machine', 0.41), ('python', 0.36)]
# Doc 3: [('vector databases', 0.49), ('rag pipelines', 0.45), ('pinecone', 0.41), ('weaviate', 0.41)]

TF-IDF for Text Classification

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

texts = [
    "The server crashed and the API returned a 500 error",
    "Beautiful sunset over the Pacific Ocean today",
    "Database query took 30 seconds — need to add an index",
    "The flowers in the garden are blooming after the rain",
    "The model's F1 score dropped after retraining on new data",
    "Hiking trail through the redwoods was breathtaking"
]
labels = ["tech", "nature", "tech", "nature", "tech", "nature"]

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1, 2), stop_words='english', max_features=5000)),
    ('clf', LogisticRegression(max_iter=200))
])

scores = cross_val_score(pipeline, texts, labels, cv=3)
print(f"Cross-validation accuracy: {scores.mean():.2f}")

Document Search and Ranking

TF-IDF powers classic information retrieval — the same idea behind BM25 used in Elasticsearch:

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

documents = [
    "Python is a popular programming language for data science and AI.",
    "Machine learning models learn patterns from large datasets.",
    "Neural networks use layers of neurons to process information.",
    "Deep learning is a subset of machine learning using neural networks.",
]

vectorizer = TfidfVectorizer(stop_words='english')
doc_vectors = vectorizer.fit_transform(documents)

def search(query, top_k=2):
    query_vec = vectorizer.transform([query])
    scores = cosine_similarity(query_vec, doc_vectors).flatten()
    top_indices = np.argsort(scores)[::-1][:top_k]
    return [(documents[i], round(scores[i], 4)) for i in top_indices]

results = search("neural network deep learning")
for doc, score in results:
    print(f"Score {score}: {doc}")

TF-IDF Tuning Parameters

vectorizer = TfidfVectorizer(
    lowercase=True,
    analyzer='word',           # 'word', 'char', or 'char_wb'
    ngram_range=(1, 2),        # unigrams + bigrams
    max_features=10000,        # cap vocabulary size
    min_df=2,                  # min document frequency
    max_df=0.85,               # max document frequency (removes near-universal terms)
    sublinear_tf=True,         # use log(1 + tf) to dampen high counts
    use_idf=True,
    smooth_idf=True,           # add 1 to prevent zero IDF
    norm='l2'                  # normalize each document vector to unit length
)

TF-IDF vs Modern Embeddings

Aspect	TF-IDF	Sentence Embeddings
Captures word order	No	Yes
Captures synonymy	No	Yes
Interpretable	Yes	No
Memory footprint	Small (sparse)	Larger (dense)
Training required	No	Yes (pretrained)
Speed	Very fast	Moderate
Best for	Keyword search, topic modeling, baselines	Semantic search, RAG, similarity

TF-IDF remains the go-to choice for fast, explainable keyword-based systems. For semantic understanding — “is this customer complaint about billing?” — dense embeddings from models like sentence-transformers perform better.