Multimodal RAG: Retrieval Across Text, Images, Tables, and Documents

Build multimodal RAG systems — CLIP embeddings, image retrieval, table understanding, PDF parsing with vision models, and cross-modal search for complex documents.

Multimodal RAG: When Your Knowledge Base Has More Than Just Text

A pharmaceutical company stores research data in PDFs with molecular structure diagrams, clinical trial tables, and protocol charts. A legal team has contracts with embedded signature pages and exhibit images. An engineering team documents systems with architecture diagrams and flowcharts.

Standard text-only RAG misses all of this. Multimodal RAG extends retrieval to handle images, tables, charts, and mixed-media documents — retrieving visual content alongside text and understanding both when generating answers.

The Multimodal Challenge

Standard document:
"See Figure 3 for the neural architecture diagram"
Figure 3: [Complex diagram showing encoder-decoder with attention layers]
Text-only RAG:
Chunk: "See Figure 3 for the neural architecture diagram" → meaningless without image
Lost information: the entire architecture diagram
Multimodal RAG:
Text chunk: "See Figure 3 for the neural architecture diagram" + Figure 3 image
→ Both indexed and retrievable
→ Can answer: "Show me the architecture used in this paper"

Architecture Options for Multimodal RAG

Option 1: Extract Everything to Text

The simplest approach: use vision models to convert images, tables, and charts to text descriptions, then proceed with standard text RAG:

import anthropic
from pathlib import Path
import base64
client = anthropic.Anthropic()
def image_to_description(image_path: str) -> str:
"""Use Claude vision to describe an image."""
with open(image_path, "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")
ext = Path(image_path).suffix.lower()
media_type = {".jpg": "image/jpeg", ".png": "image/png", ".pdf": "application/pdf"}
img_type = media_type.get(ext, "image/jpeg")
response = client.messages.create(
model="claude-opus-4-8", # best vision quality
max_tokens=500,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {"type": "base64", "media_type": img_type, "data": image_data},
},
{
"type": "text",
"text": """Describe this image in detail for a technical knowledge base.
Include:
- What type of visualization this is (diagram, chart, table, etc.)
- All text labels and values visible
- The structure and relationships shown
- Key insights or information conveyed"""
}
],
}]
)
return response.content[0].text
# Process a mixed document
def process_document_with_images(doc_path: str) -> list[dict]:
chunks = []
# Extract text chunks normally
text_chunks = extract_text_chunks(doc_path)
chunks.extend([{"type": "text", "content": c} for c in text_chunks])
# Extract and describe images
images = extract_images_from_doc(doc_path)
for img_path, page_num, caption in images:
description = image_to_description(img_path)
chunks.append({
"type": "image",
"content": f"[IMAGE on page {page_num}]\nCaption: {caption}\nContent: {description}",
"image_path": img_path,
})
return chunks

Pros: Works with any text-only vector store, no special infrastructure needed. Cons: Descriptions may not capture all nuance; can’t display original image in answers.

Option 2: Native Multimodal Embeddings

Use models that embed both images and text in the same vector space, enabling cross-modal retrieval:

from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
def embed_image(image_path: str) -> list[float]:
image = Image.open(image_path)
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
embedding = model.get_image_features(**inputs)
return embedding.squeeze().numpy().tolist()
def embed_text_for_images(text: str) -> list[float]:
"""Embed text in CLIP's image-aligned space."""
inputs = processor(text=text, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
embedding = model.get_text_features(**inputs)
return embedding.squeeze().numpy().tolist()
# Index both text descriptions and image embeddings in same vector store
# Text query → CLIP text embedding → finds similar images AND text

CLIP embeddings enable “find me architecture diagrams similar to this query” directly, without needing to describe images first.

Option 3: ColPali - Document Vision Retrieval

ColPali (2024) takes a fresh approach: instead of extracting text from documents and processing it separately, it treats each PDF page as an image and indexes page-level visual embeddings. This handles tables, charts, figures, and mixed layouts naturally:

# ColPali approach: every page is an image
from colpali_engine.models import ColPali, ColPaliProcessor
model = ColPali.from_pretrained("vidore/colpali-v1.2")
processor = ColPaliProcessor.from_pretrained("vidore/colpali-v1.2")
def index_pdf_as_images(pdf_path: str) -> list[dict]:
pages = convert_pdf_to_images(pdf_path) # one image per page
page_embeddings = []
for page_idx, page_image in enumerate(pages):
inputs = processor.process_images([page_image])
embedding = model(**inputs).embeddings # patch-level embeddings
page_embeddings.append({
"page": page_idx,
"embedding": embedding,
"image": page_image,
})
return page_embeddings
# Query: text → visual embedding → find pages where content matches query

ColPali is state-of-the-art for document retrieval tasks involving complex layouts.

Table Extraction and RAG

Tables are particularly tricky — structure carries meaning that text doesn’t capture:

import pandas as pd
from io import StringIO
def extract_table_as_structured_text(table_html: str) -> str:
"""Convert HTML table to structured text for retrieval."""
df = pd.read_html(StringIO(table_html))[0]
# Describe table structure
header = "Table with columns: " + ", ".join(df.columns.tolist())
# Convert each row to a natural language description
rows = []
for _, row in df.iterrows():
row_desc = " | ".join([f"{col}: {val}" for col, val in row.items()])
rows.append(row_desc)
return header + "\n" + "\n".join(rows)
# Alternatively, keep table structure for LLM to process directly
def table_to_markdown(table_data: pd.DataFrame) -> str:
return table_data.to_markdown(index=False)

For tabular data, chunk by table (not by fixed size), and include column headers in every chunk so the table is interpretable in isolation.

Multimodal Generation: Sending Images to the LLM

When retrieved content includes images, pass them directly to the generation LLM:

def multimodal_generate(
query: str,
text_chunks: list[str],
image_paths: list[str],
) -> str:
content = [{"type": "text", "text": f"Question: {query}\n\nContext text:\n" + "\n\n".join(text_chunks)}]
for img_path in image_paths[:3]: # limit to 3 images
with open(img_path, "rb") as f:
img_data = base64.standard_b64encode(f.read()).decode("utf-8")
content.append({
"type": "image",
"source": {"type": "base64", "media_type": "image/png", "data": img_data}
})
content.append({"type": "text", "text": "Answer the question using all provided context, including any images:"})
response = client.messages.create(
model="claude-opus-4-8",
max_tokens=1000,
messages=[{"role": "user", "content": content}]
)
return response.content[0].text

Multimodal Vector Store Setup with Qdrant

from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct
# Collection supporting both text (1536D) and image (768D) vectors
client.create_collection(
collection_name="multimodal_docs",
vectors_config={
"text": VectorParams(size=1536, distance=Distance.COSINE),
"image": VectorParams(size=768, distance=Distance.COSINE),
}
)
# Upsert with both vector types
client.upsert(
collection_name="multimodal_docs",
points=[
PointStruct(
id=1,
vector={
"text": text_embedding, # from OpenAI
"image": image_embedding, # from CLIP
},
payload={"type": "image", "path": "/docs/fig3.png", "caption": "..."},
)
]
)
# Query with text vector (finds both text and image matches in same space)
results = client.search(
collection_name="multimodal_docs",
query_vector=("text", query_embedding),
limit=10,
)

2025 Trend: Native Multimodal Embeddings

OpenAI’s upcoming embedding models, Google’s multimodal embeddings (Vertex AI), and Cohere’s multimodal embed are moving towards single models that handle text, images, and audio in a unified embedding space. This eliminates the need for separate CLIP embeddings and text embeddings — one model, one vector space, all modalities retrievable together.

Multimodal RAG is no longer a niche capability — it’s becoming table stakes for any RAG system that handles real-world enterprise documents. Starting with Option 1 (describe images as text) is the pragmatic path; graduating to native multimodal embeddings is the direction the field is moving.