Multimodal RAG: When Your Knowledge Base Has More Than Just Text
A pharmaceutical company stores research data in PDFs with molecular structure diagrams, clinical trial tables, and protocol charts. A legal team has contracts with embedded signature pages and exhibit images. An engineering team documents systems with architecture diagrams and flowcharts.
Standard text-only RAG misses all of this. Multimodal RAG extends retrieval to handle images, tables, charts, and mixed-media documents — retrieving visual content alongside text and understanding both when generating answers.
The Multimodal Challenge
Standard document:"See Figure 3 for the neural architecture diagram"↓Figure 3: [Complex diagram showing encoder-decoder with attention layers]
Text-only RAG: Chunk: "See Figure 3 for the neural architecture diagram" → meaningless without image Lost information: the entire architecture diagram
Multimodal RAG: Text chunk: "See Figure 3 for the neural architecture diagram" + Figure 3 image → Both indexed and retrievable → Can answer: "Show me the architecture used in this paper"Architecture Options for Multimodal RAG
Option 1: Extract Everything to Text
The simplest approach: use vision models to convert images, tables, and charts to text descriptions, then proceed with standard text RAG:
import anthropicfrom pathlib import Pathimport base64
client = anthropic.Anthropic()
def image_to_description(image_path: str) -> str: """Use Claude vision to describe an image.""" with open(image_path, "rb") as f: image_data = base64.standard_b64encode(f.read()).decode("utf-8")
ext = Path(image_path).suffix.lower() media_type = {".jpg": "image/jpeg", ".png": "image/png", ".pdf": "application/pdf"} img_type = media_type.get(ext, "image/jpeg")
response = client.messages.create( model="claude-opus-4-8", # best vision quality max_tokens=500, messages=[{ "role": "user", "content": [ { "type": "image", "source": {"type": "base64", "media_type": img_type, "data": image_data}, }, { "type": "text", "text": """Describe this image in detail for a technical knowledge base.Include:- What type of visualization this is (diagram, chart, table, etc.)- All text labels and values visible- The structure and relationships shown- Key insights or information conveyed""" } ], }] ) return response.content[0].text
# Process a mixed documentdef process_document_with_images(doc_path: str) -> list[dict]: chunks = []
# Extract text chunks normally text_chunks = extract_text_chunks(doc_path) chunks.extend([{"type": "text", "content": c} for c in text_chunks])
# Extract and describe images images = extract_images_from_doc(doc_path) for img_path, page_num, caption in images: description = image_to_description(img_path) chunks.append({ "type": "image", "content": f"[IMAGE on page {page_num}]\nCaption: {caption}\nContent: {description}", "image_path": img_path, })
return chunksPros: Works with any text-only vector store, no special infrastructure needed. Cons: Descriptions may not capture all nuance; can’t display original image in answers.
Option 2: Native Multimodal Embeddings
Use models that embed both images and text in the same vector space, enabling cross-modal retrieval:
from transformers import CLIPProcessor, CLIPModelfrom PIL import Imageimport torch
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
def embed_image(image_path: str) -> list[float]: image = Image.open(image_path) inputs = processor(images=image, return_tensors="pt") with torch.no_grad(): embedding = model.get_image_features(**inputs) return embedding.squeeze().numpy().tolist()
def embed_text_for_images(text: str) -> list[float]: """Embed text in CLIP's image-aligned space.""" inputs = processor(text=text, return_tensors="pt", padding=True, truncation=True) with torch.no_grad(): embedding = model.get_text_features(**inputs) return embedding.squeeze().numpy().tolist()
# Index both text descriptions and image embeddings in same vector store# Text query → CLIP text embedding → finds similar images AND textCLIP embeddings enable “find me architecture diagrams similar to this query” directly, without needing to describe images first.
Option 3: ColPali - Document Vision Retrieval
ColPali (2024) takes a fresh approach: instead of extracting text from documents and processing it separately, it treats each PDF page as an image and indexes page-level visual embeddings. This handles tables, charts, figures, and mixed layouts naturally:
# ColPali approach: every page is an imagefrom colpali_engine.models import ColPali, ColPaliProcessor
model = ColPali.from_pretrained("vidore/colpali-v1.2")processor = ColPaliProcessor.from_pretrained("vidore/colpali-v1.2")
def index_pdf_as_images(pdf_path: str) -> list[dict]: pages = convert_pdf_to_images(pdf_path) # one image per page page_embeddings = []
for page_idx, page_image in enumerate(pages): inputs = processor.process_images([page_image]) embedding = model(**inputs).embeddings # patch-level embeddings page_embeddings.append({ "page": page_idx, "embedding": embedding, "image": page_image, })
return page_embeddings
# Query: text → visual embedding → find pages where content matches queryColPali is state-of-the-art for document retrieval tasks involving complex layouts.
Table Extraction and RAG
Tables are particularly tricky — structure carries meaning that text doesn’t capture:
import pandas as pdfrom io import StringIO
def extract_table_as_structured_text(table_html: str) -> str: """Convert HTML table to structured text for retrieval.""" df = pd.read_html(StringIO(table_html))[0]
# Describe table structure header = "Table with columns: " + ", ".join(df.columns.tolist())
# Convert each row to a natural language description rows = [] for _, row in df.iterrows(): row_desc = " | ".join([f"{col}: {val}" for col, val in row.items()]) rows.append(row_desc)
return header + "\n" + "\n".join(rows)
# Alternatively, keep table structure for LLM to process directlydef table_to_markdown(table_data: pd.DataFrame) -> str: return table_data.to_markdown(index=False)For tabular data, chunk by table (not by fixed size), and include column headers in every chunk so the table is interpretable in isolation.
Multimodal Generation: Sending Images to the LLM
When retrieved content includes images, pass them directly to the generation LLM:
def multimodal_generate( query: str, text_chunks: list[str], image_paths: list[str],) -> str: content = [{"type": "text", "text": f"Question: {query}\n\nContext text:\n" + "\n\n".join(text_chunks)}]
for img_path in image_paths[:3]: # limit to 3 images with open(img_path, "rb") as f: img_data = base64.standard_b64encode(f.read()).decode("utf-8") content.append({ "type": "image", "source": {"type": "base64", "media_type": "image/png", "data": img_data} })
content.append({"type": "text", "text": "Answer the question using all provided context, including any images:"})
response = client.messages.create( model="claude-opus-4-8", max_tokens=1000, messages=[{"role": "user", "content": content}] ) return response.content[0].textMultimodal Vector Store Setup with Qdrant
from qdrant_client import QdrantClientfrom qdrant_client.models import VectorParams, Distance, PointStruct
# Collection supporting both text (1536D) and image (768D) vectorsclient.create_collection( collection_name="multimodal_docs", vectors_config={ "text": VectorParams(size=1536, distance=Distance.COSINE), "image": VectorParams(size=768, distance=Distance.COSINE), })
# Upsert with both vector typesclient.upsert( collection_name="multimodal_docs", points=[ PointStruct( id=1, vector={ "text": text_embedding, # from OpenAI "image": image_embedding, # from CLIP }, payload={"type": "image", "path": "/docs/fig3.png", "caption": "..."}, ) ])
# Query with text vector (finds both text and image matches in same space)results = client.search( collection_name="multimodal_docs", query_vector=("text", query_embedding), limit=10,)2025 Trend: Native Multimodal Embeddings
OpenAI’s upcoming embedding models, Google’s multimodal embeddings (Vertex AI), and Cohere’s multimodal embed are moving towards single models that handle text, images, and audio in a unified embedding space. This eliminates the need for separate CLIP embeddings and text embeddings — one model, one vector space, all modalities retrievable together.
Multimodal RAG is no longer a niche capability — it’s becoming table stakes for any RAG system that handles real-world enterprise documents. Starting with Option 1 (describe images as text) is the pragmatic path; graduating to native multimodal embeddings is the direction the field is moving.