Named Entity Recognition (NER)
Named Entity Recognition identifies and classifies named entities in text — people, organizations, locations, dates, monetary values, and more. It converts unstructured text into structured facts that systems can act on.
What Counts as a Named Entity?
"Apple released the iPhone 17 in San Francisco on September 9, 2025, with CEO Tim Cook presenting to 3,000 attendees."
Entities: Apple → ORG (organization) iPhone 17 → PRODUCT San Francisco → GPE (geopolitical entity / city) September 9, 2025 → DATE Tim Cook → PERSON 3,000 → CARDINALStandard Entity Types
| Label | Description | Example |
|---|---|---|
| PERSON | People and fictional characters | Marie Curie, Sherlock Holmes |
| ORG | Companies, agencies, institutions | Google, WHO, NASA |
| GPE | Countries, cities, states | France, Tokyo, California |
| LOC | Non-GPE locations, landmarks | Mount Everest, the Amazon |
| DATE | Absolute and relative dates | June 2025, last Tuesday |
| TIME | Times of day | 3:00 PM, midnight |
| MONEY | Monetary values | $4.5 billion, €200 |
| PERCENT | Percentages | 12.5%, three percent |
| PRODUCT | Named products | Model 3, iPhone 17 |
| EVENT | Named events | World Cup 2026, the French Revolution |
NER with spaCy
import spacy
nlp = spacy.load("en_core_web_sm")text = """In Q1 2025, Microsoft acquired Inflection AI for $650 million.The deal was announced by CEO Satya Nadella at their Redmond headquarters."""doc = nlp(text)
for ent in doc.ents: print(f"{ent.text:<30} {ent.label_:<12} {spacy.explain(ent.label_)}")
# Microsoft ORG Companies, agencies, institutions# Inflection AI ORG Companies, agencies, institutions# $650 million MONEY Monetary values# Satya Nadella PERSON People, including fictional# Redmond GPE Countries, cities, statesVisualize in a notebook:
from spacy import displacydisplacy.render(doc, style="ent", jupyter=True)NER with Hugging Face Transformers
Transformer-based NER models substantially outperform statistical models on most benchmarks:
from transformers import pipeline
ner = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")
text = "Elon Musk's SpaceX launched Starship from Boca Chica, Texas in May 2025."results = ner(text)
for entity in results: print(f"{entity['word']:<20} {entity['entity_group']:<8} score: {entity['score']:.3f}")
# Elon Musk PER score: 0.999# SpaceX ORG score: 0.998# Starship MISC score: 0.941# Boca Chica LOC score: 0.997# Texas LOC score: 0.998Fine-Tuning NER for a Custom Domain
Pre-trained models miss domain-specific entities (drug names, legal clauses, custom product codes). Fine-tuning on a small annotated dataset solves this:
from transformers import AutoTokenizer, AutoModelForTokenClassification, Trainer, TrainingArguments
# 1. Prepare annotated data in CoNLL or IOB2 format# 2. Load a base modelmodel_name = "bert-base-cased"tokenizer = AutoTokenizer.from_pretrained(model_name)model = AutoModelForTokenClassification.from_pretrained( model_name, num_labels=len(label_list) # your custom entity labels)
# 3. Fine-tunetraining_args = TrainingArguments( output_dir="./ner-model", num_train_epochs=3, per_device_train_batch_size=16, evaluation_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True,)trainer = Trainer(model=model, args=training_args, ...)trainer.train()IOB2 Annotation Format
NER training data uses the IOB2 (Inside-Outside-Beginning) tagging scheme:
Token Tag──────────────────Apple B-ORG ← Beginning of an ORG entityInc I-ORG ← Inside the same entityreleased O ← Outside any entityiPhone B-PRODUCT17 I-PRODUCTin OCalifornia B-GPE. ONER Applications in 2025
Automated news analysis — extract who did what to whom from thousands of articles per hour.
RAG pipeline enrichment — tag documents with entities before indexing so retrieval can filter by person, organization, or location.
Medical NLP — identify drug names, dosages, symptoms, and conditions from clinical notes (using specialized models like BioBERT or clinical spaCy models).
Contract analysis — extract party names, effective dates, and monetary terms from legal documents.
Resume parsing — identify skills, companies, job titles, and education entities for recruitment pipelines.
Current Benchmarks (2025)
On the CoNLL-2003 English benchmark:
- BERT-large fine-tuned: ~92.8 F1
- DeBERTa-v3-large: ~93.5 F1
- Flair with stacked embeddings: ~93.2 F1
For most production use cases with standard entities, the dslim/bert-base-NER model on Hugging Face Hub achieves excellent results out of the box.