Graph RAG: When Connections Matter as Much as Content

Standard RAG treats each document chunk as an isolated unit. Retrieve chunk A, retrieve chunk B — they’re independent. But in complex knowledge bases, meaning often lives in the relationships between entities: “Company X acquired Company Y, which was founded by the same team that built Product Z.”

No single chunk contains that full chain. Standard RAG would need to retrieve multiple documents and hope the LLM connects the dots. Graph RAG structures that knowledge explicitly as a graph, enabling queries that traverse relationships to answer questions no amount of semantic similarity search could handle alone.

The Core Idea

Standard RAG knowledge base:
  [Chunk 1: Company X acquired Company Y in 2022]
  [Chunk 2: Company Y was founded by John Smith and Mary Jones]
  [Chunk 3: Mary Jones previously built Product Z at StartupCo]
  [Chunk 4: Product Z is known for its graph-based search technology]

Query: "What technology does Company X now own through their acquisition?"

Standard RAG: May retrieve Chunk 1 (acquisition) but miss the chain
→ Incomplete answer: "Company X acquired Company Y"

Graph RAG knowledge graph:
  Company X → [ACQUIRED] → Company Y
  Company Y → [FOUNDED_BY] → Mary Jones
  Mary Jones → [CREATED] → Product Z
  Product Z → [USES_TECHNOLOGY] → Graph-Based Search

Query traversal: Company X → acquisition → Company Y → founder → Mary Jones → creation → Product Z
→ Complete answer: "Through the acquisition of Company Y, Company X now owns Product Z,
  a graph-based search technology developed by co-founder Mary Jones."

Microsoft GraphRAG

Microsoft Research’s GraphRAG (2024) is the most widely referenced Graph RAG implementation. It processes a document corpus through an entity and relationship extraction pipeline, builds a knowledge graph, creates community hierarchies, and generates summaries at multiple levels.

GraphRAG Pipeline:

Phase 1: Entity and Relationship Extraction
  Documents → LLM extracts → (entity, relationship, entity) triples
  "AWS launched EC2 in 2006" → (AWS, LAUNCHED, EC2), (EC2, YEAR, 2006)

Phase 2: Graph Construction
  All triples → NetworkX/Neo4j graph
  Nodes = entities, Edges = relationships with descriptions

Phase 3: Community Detection
  Leiden algorithm partitions graph into community clusters
  Community 1: AWS, EC2, S3, Lambda (AWS services)
  Community 2: Azure, AKS, CosmosDB (Azure services)
  Community 3: GCP, GKE, BigQuery (GCP services)

Phase 4: Community Summarization
  LLM generates summaries for each community
  "The AWS services community includes compute (EC2), storage (S3), and
  serverless (Lambda) offerings from Amazon Web Services..."

Phase 5: Retrieval
  Global queries: search over community summaries
  Local queries: search within specific community subgraphs

Setting Up GraphRAG

pip install graphrag
mkdir graphrag_workspace && cd graphrag_workspace
python -m graphrag.index --init --root .
# Edit settings.yaml with your OpenAI/Azure API keys

# settings.yaml configuration
# llm:
#   type: openai_chat
#   model: gpt-4o-mini
#   max_tokens: 4000
# embeddings:
#   llm:
#     type: openai_embedding
#     model: text-embedding-3-small

# Run indexing pipeline
import subprocess
subprocess.run(["python", "-m", "graphrag.index", "--root", "."])

# Query the graph
result = subprocess.run(
    ["python", "-m", "graphrag.query", "--root", ".", "--method", "global",
     "--query", "What are the main technology companies and their relationships?"],
    capture_output=True, text=True
)
print(result.stdout)

Custom Graph RAG with LangChain + Neo4j

For more control, build your own graph pipeline:

from langchain_community.graphs import Neo4jGraph
from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_openai import ChatOpenAI

# Connect to Neo4j
graph = Neo4jGraph(
    url="bolt://localhost:7687",
    username="neo4j",
    password="your-password"
)

# Extract entities and relationships with LLM
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
graph_transformer = LLMGraphTransformer(
    llm=llm,
    allowed_nodes=["Person", "Company", "Product", "Technology", "Event"],
    allowed_relationships=["WORKS_AT", "FOUNDED", "ACQUIRED", "BUILT", "USES"],
)

# Process documents into graph
from langchain_core.documents import Document
docs = [Document(page_content=text) for text in document_texts]

graph_docs = graph_transformer.convert_to_graph_documents(docs)
graph.add_graph_documents(graph_docs, baseEntityLabel=True, include_source=True)

Hybrid Vector + Graph Retrieval

The most powerful approach combines vector similarity (for semantic matching) with graph traversal (for relationship chains):

from langchain.chains import GraphCypherQAChain
from langchain_community.vectorstores import Neo4jVector

# Vector index on entity descriptions
vector_index = Neo4jVector.from_existing_graph(
    embedding=OpenAIEmbeddings(),
    url="bolt://localhost:7687",
    username="neo4j",
    password="your-password",
    node_label="Document",
    text_node_properties=["text"],
    embedding_node_property="embedding",
)

# Graph traversal for relationship queries
graph_chain = GraphCypherQAChain.from_llm(
    llm=ChatOpenAI(model="gpt-4o"),
    graph=graph,
    verbose=True,
    return_direct=False,
)

def hybrid_graph_rag(query: str) -> str:
    # Detect if query needs relationship traversal
    needs_graph = any(kw in query.lower() for kw in
        ["relationship", "connected", "acquired", "founded", "related", "how are"])

    if needs_graph:
        # Graph traversal for relationship queries
        return graph_chain.invoke({"query": query})
    else:
        # Vector search for semantic content queries
        docs = vector_index.similarity_search(query, k=5)
        return synthesize_answer(query, docs)

When Graph RAG Wins

Graph RAG significantly outperforms standard RAG on:

✓ Multi-hop relationship queries
  "Who are the investors of companies that Microsoft acquired?"

✓ Pattern queries across entities
  "Which employees worked at both Company A and Company B?"

✓ Aggregation over entity properties
  "What technologies do our top 10 enterprise customers use?"

✓ Temporal relationship chains
  "Trace the evolution of this product from original founder to current owner"

Standard RAG still wins for:
→ Direct factual lookups ("What is the refund policy?")
→ Semantic similarity queries ("Find documents about machine learning")
→ Content-dense queries that don't involve entity relationships

Knowledge Graph Quality Challenges

Graph RAG quality depends heavily on extraction accuracy:

LLM extraction errors compound:
  "Apple acquired Beats Electronics" → (Apple, ACQUIRED, Beats) ✓
  "Tim Cook leads Apple" → (Tim_Cook, LEADS, Apple) ✓

  But also:
  "Apple's new product" → (Apple, HAS_NEW, Product) ← vague
  "The company was profitable" → (unknown, WAS, profitable) ← entity resolution failure

Entity resolution (recognizing “Apple”, “Apple Inc.”, and “AAPL” as the same entity) and relationship normalization are ongoing challenges. Coref resolution models help but don’t fully solve the problem.

2025 Trend: Sparse Knowledge Graphs

Rather than extracting a dense, complete knowledge graph from all documents, newer approaches extract only the high-confidence relationships (those appearing in multiple documents or verified by domain experts). A sparse, high-precision graph outperforms a dense, noisy one. This “quality over quantity” approach to graph construction is gaining adoption in enterprise Graph RAG deployments.

Graph RAG is the right tool when your knowledge base is inherently relational and your queries require traversing those relationships. For enterprise use cases involving organizational hierarchies, product ecosystems, research citation networks, or regulatory dependency chains, it delivers answers that no amount of semantic search can produce.