Technology  /  NLP

💬 Natural Language Processing 40 guides · updated 2026

From tokenisation and embeddings to transformer-based language understanding — the NLP fundamentals that underpin every modern LLM.

Named Entity Recognition (NER)

Named Entity Recognition identifies and classifies named entities in text — people, organizations, locations, dates, monetary values, and more. It converts unstructured text into structured facts that systems can act on.


What Counts as a Named Entity?

"Apple released the iPhone 17 in San Francisco on September 9, 2025,
with CEO Tim Cook presenting to 3,000 attendees."
Entities:
Apple → ORG (organization)
iPhone 17 → PRODUCT
San Francisco → GPE (geopolitical entity / city)
September 9, 2025 → DATE
Tim Cook → PERSON
3,000 → CARDINAL

Standard Entity Types

LabelDescriptionExample
PERSONPeople and fictional charactersMarie Curie, Sherlock Holmes
ORGCompanies, agencies, institutionsGoogle, WHO, NASA
GPECountries, cities, statesFrance, Tokyo, California
LOCNon-GPE locations, landmarksMount Everest, the Amazon
DATEAbsolute and relative datesJune 2025, last Tuesday
TIMETimes of day3:00 PM, midnight
MONEYMonetary values$4.5 billion, €200
PERCENTPercentages12.5%, three percent
PRODUCTNamed productsModel 3, iPhone 17
EVENTNamed eventsWorld Cup 2026, the French Revolution

NER with spaCy

import spacy
nlp = spacy.load("en_core_web_sm")
text = """
In Q1 2025, Microsoft acquired Inflection AI for $650 million.
The deal was announced by CEO Satya Nadella at their Redmond headquarters.
"""
doc = nlp(text)
for ent in doc.ents:
print(f"{ent.text:<30} {ent.label_:<12} {spacy.explain(ent.label_)}")
# Microsoft ORG Companies, agencies, institutions
# Inflection AI ORG Companies, agencies, institutions
# $650 million MONEY Monetary values
# Satya Nadella PERSON People, including fictional
# Redmond GPE Countries, cities, states

Visualize in a notebook:

from spacy import displacy
displacy.render(doc, style="ent", jupyter=True)

NER with Hugging Face Transformers

Transformer-based NER models substantially outperform statistical models on most benchmarks:

from transformers import pipeline
ner = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")
text = "Elon Musk's SpaceX launched Starship from Boca Chica, Texas in May 2025."
results = ner(text)
for entity in results:
print(f"{entity['word']:<20} {entity['entity_group']:<8} score: {entity['score']:.3f}")
# Elon Musk PER score: 0.999
# SpaceX ORG score: 0.998
# Starship MISC score: 0.941
# Boca Chica LOC score: 0.997
# Texas LOC score: 0.998

Fine-Tuning NER for a Custom Domain

Pre-trained models miss domain-specific entities (drug names, legal clauses, custom product codes). Fine-tuning on a small annotated dataset solves this:

from transformers import AutoTokenizer, AutoModelForTokenClassification, Trainer, TrainingArguments
# 1. Prepare annotated data in CoNLL or IOB2 format
# 2. Load a base model
model_name = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(
model_name,
num_labels=len(label_list) # your custom entity labels
)
# 3. Fine-tune
training_args = TrainingArguments(
output_dir="./ner-model",
num_train_epochs=3,
per_device_train_batch_size=16,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
)
trainer = Trainer(model=model, args=training_args, ...)
trainer.train()

IOB2 Annotation Format

NER training data uses the IOB2 (Inside-Outside-Beginning) tagging scheme:

Token Tag
──────────────────
Apple B-ORG ← Beginning of an ORG entity
Inc I-ORG ← Inside the same entity
released O ← Outside any entity
iPhone B-PRODUCT
17 I-PRODUCT
in O
California B-GPE
. O

NER Applications in 2025

Automated news analysis — extract who did what to whom from thousands of articles per hour.

RAG pipeline enrichment — tag documents with entities before indexing so retrieval can filter by person, organization, or location.

Medical NLP — identify drug names, dosages, symptoms, and conditions from clinical notes (using specialized models like BioBERT or clinical spaCy models).

Contract analysis — extract party names, effective dates, and monetary terms from legal documents.

Resume parsing — identify skills, companies, job titles, and education entities for recruitment pipelines.


Current Benchmarks (2025)

On the CoNLL-2003 English benchmark:

For most production use cases with standard entities, the dslim/bert-base-NER model on Hugging Face Hub achieves excellent results out of the box.