Metadata-Aware Retrieval: When Knowing Who Said It and When Matters

Pure semantic similarity has a flat worldview — a document from three years ago is as relevant as one from last week if the text vectors are close. A document from an unofficial source ranks the same as one from the authoritative team.

In most real-world knowledge bases, that’s wrong. Recency matters. Source credibility matters. Document type matters. Metadata-aware retrieval incorporates these signals directly into how documents are ranked and selected.

The Limitation of Pure Similarity Ranking

Query: "What is the current API rate limit for our service?"

Pure semantic results:
  Rank 1: "API rate limits are 100 req/min" — from 2021 documentation (score: 0.92)
  Rank 2: "Rate limiting allows 500 req/min for enterprise" — from 2024 doc (score: 0.89)
  Rank 3: "API calls are throttled at 100 per minute" — from archived blog 2022 (score: 0.88)

The correct answer is the 2024 document. Pure semantic search ranked it second
because the 2021 doc's phrasing was slightly closer to the query.

Metadata-aware retrieval adjusts this ranking by incorporating temporal recency, source type, and other structural signals.

Recency Weighting

Time-aware retrieval boosts recent documents and penalizes stale ones:

import math
from datetime import datetime, timezone

def recency_weighted_score(
    semantic_score: float,
    doc_date: datetime,
    recency_weight: float = 0.3,     # how much recency influences final score
    half_life_days: int = 180,       # how quickly documents decay (6 months)
) -> float:
    days_old = (datetime.now(timezone.utc) - doc_date).days
    # Exponential decay: score = e^(-λt), λ = ln(2)/half_life
    decay_factor = math.exp(-math.log(2) * days_old / half_life_days)
    # Combine semantic similarity with recency signal
    return (1 - recency_weight) * semantic_score + recency_weight * decay_factor

# Example:
# 2021 doc (3 years old): semantic=0.92
#   → final = 0.7 * 0.92 + 0.3 * exp(-3*365/180*ln2) = 0.644 + 0.003 = 0.647
# 2024 doc (0 days old): semantic=0.89
#   → final = 0.7 * 0.89 + 0.3 * 1.0 = 0.623 + 0.300 = 0.923 ← wins

The half-life parameter is domain-specific:

API documentation: 90–180 days (changes frequently)
Legal regulations: 365–730 days (changes infrequently)
Historical records: no decay (older is not worse)
News/current events: 7–30 days (very short half-life)

Source Credibility Scoring

Not all sources are equal. Official documentation should rank higher than community posts for authoritative answers:

SOURCE_CREDIBILITY = {
    "official_docs": 1.0,
    "product_changelog": 0.95,
    "engineering_blog": 0.80,
    "internal_wiki": 0.75,
    "community_forum": 0.55,
    "archived_content": 0.40,
}

def credibility_weighted_score(
    semantic_score: float,
    source_type: str,
    credibility_weight: float = 0.2,
) -> float:
    credibility = SOURCE_CREDIBILITY.get(source_type, 0.5)
    return (1 - credibility_weight) * semantic_score + credibility_weight * credibility

Combining Multiple Metadata Signals

In practice, you combine multiple metadata signals into a single reranking score:

from dataclasses import dataclass

@dataclass
class MetadataSignals:
    semantic_score: float
    created_at: datetime
    source_type: str
    doc_version: str  # "current", "deprecated", "archived"
    relevance_votes: int  # user feedback signal
    language: str
    target_audience: str  # "beginner", "advanced", "internal"

def metadata_aware_score(signals: MetadataSignals, user_context: dict) -> float:
    score = signals.semantic_score

    # Recency signal
    days_old = (datetime.now(timezone.utc) - signals.created_at).days
    recency = math.exp(-math.log(2) * days_old / 180)

    # Deprecation penalty
    version_penalty = {
        "current": 1.0,
        "deprecated": 0.5,
        "archived": 0.2,
    }.get(signals.doc_version, 0.8)

    # Audience relevance
    audience_score = 1.0
    if signals.target_audience != user_context.get("expertise_level"):
        audience_score = 0.85  # slight penalty for mismatched audience

    # User feedback signal (trust but normalize)
    vote_boost = min(1.0 + signals.relevance_votes * 0.05, 1.3)

    # Language match
    lang_score = 1.0 if signals.language == user_context.get("language", "en") else 0.7

    # Weighted combination
    final_score = (
        0.60 * score         # semantic similarity (dominant signal)
        + 0.15 * recency     # freshness
        + 0.10 * version_penalty
        + 0.05 * audience_score
        + 0.05 * vote_boost
        + 0.05 * lang_score
    )
    return final_score

Mandatory Hard Filters vs Soft Scoring

Distinguish between signals that are hard constraints (exclude documents) and those that are soft scores (affect ranking):

Hard constraints (always exclude):
  - Document is confidential and user doesn't have access
  - Document is from a different tenant
  - Document language doesn't match user language (if strict)

Soft scores (affect ranking, not exclusion):
  - Recency (recent is better, old is not necessarily wrong)
  - Source credibility
  - Audience match
  - User feedback votes

Hard filtering should happen before vector search (at the database layer via metadata filters). Soft scoring happens after retrieval, in the application layer.

def metadata_aware_retrieval(
    query: str,
    user: dict,
    vectorstore,
    k: int = 10,
    over_fetch: int = 30,  # fetch more to allow reranking
) -> list:
    # Hard filters — applied at vector store level
    hard_filter = {
        "tenant_id": user["tenant_id"],
        "access_level": {"$in": user["access_levels"]},
        "status": {"$ne": "archived"},
    }

    # Fetch more candidates to allow metadata-based reranking
    candidates = vectorstore.similarity_search_with_score(
        query,
        k=over_fetch,
        filter=hard_filter,
    )

    # Apply soft metadata scoring
    reranked = []
    for doc, semantic_score in candidates:
        signals = MetadataSignals(
            semantic_score=semantic_score,
            created_at=doc.metadata["created_at"],
            source_type=doc.metadata["source_type"],
            doc_version=doc.metadata.get("version", "current"),
            relevance_votes=doc.metadata.get("votes", 0),
            language=doc.metadata.get("language", "en"),
            target_audience=doc.metadata.get("audience", "general"),
        )
        final_score = metadata_aware_score(signals, user)
        reranked.append((doc, final_score))

    reranked.sort(key=lambda x: x[1], reverse=True)
    return [doc for doc, _ in reranked[:k]]

Temporal Query Understanding

Some queries are inherently time-scoped and should trigger recency weighting automatically:

TEMPORAL_SIGNAL_KEYWORDS = [
    "current", "latest", "recent", "new", "updated", "now",
    "today", "this year", "2024", "2025", "modern", "state of the art"
]

HISTORICAL_SIGNAL_KEYWORDS = [
    "history", "original", "first", "when was", "historically",
    "in the past", "legacy", "classic"
]

def detect_temporal_intent(query: str) -> str:
    query_lower = query.lower()
    if any(kw in query_lower for kw in TEMPORAL_SIGNAL_KEYWORDS):
        return "recency_boosted"
    elif any(kw in query_lower for kw in HISTORICAL_SIGNAL_KEYWORDS):
        return "historical"
    return "neutral"

# Adjust recency_weight based on detected intent
intent = detect_temporal_intent("What's the current best practice for RAG chunking?")
recency_weight = {"recency_boosted": 0.40, "historical": 0.05, "neutral": 0.15}[intent]

2025 Trend: Implicit User Context Signals

Production systems increasingly incorporate implicit user context into metadata scoring — what the user has been reading recently, their stated expertise level, their organization’s active projects. A user in the “backend engineering” team asking about APIs gets internal API documentation prioritized over external user guides, even when semantic similarity scores are equal.

This personalization layer is thin but meaningful — a lightweight metadata boost based on user profile attributes that costs nothing at query time since it’s just score arithmetic.

Metadata-aware retrieval is where RAG becomes smart about the context of information, not just the content. For any production knowledge base where documents have meaningful metadata, integrating these signals lifts retrieval quality measurably.