Multimodal RAG: When Your Knowledge Base Has More Than Just Text

A pharmaceutical company stores research data in PDFs with molecular structure diagrams, clinical trial tables, and protocol charts. A legal team has contracts with embedded signature pages and exhibit images. An engineering team documents systems with architecture diagrams and flowcharts.

Standard text-only RAG misses all of this. Multimodal RAG extends retrieval to handle images, tables, charts, and mixed-media documents — retrieving visual content alongside text and understanding both when generating answers.

The Multimodal Challenge

Standard document:
"See Figure 3 for the neural architecture diagram"
↓
Figure 3: [Complex diagram showing encoder-decoder with attention layers]

Text-only RAG:
  Chunk: "See Figure 3 for the neural architecture diagram" → meaningless without image
  Lost information: the entire architecture diagram

Multimodal RAG:
  Text chunk: "See Figure 3 for the neural architecture diagram" + Figure 3 image
  → Both indexed and retrievable
  → Can answer: "Show me the architecture used in this paper"

Architecture Options for Multimodal RAG

Option 1: Extract Everything to Text

The simplest approach: use vision models to convert images, tables, and charts to text descriptions, then proceed with standard text RAG:

import anthropic
from pathlib import Path
import base64

client = anthropic.Anthropic()

def image_to_description(image_path: str) -> str:
    """Use Claude vision to describe an image."""
    with open(image_path, "rb") as f:
        image_data = base64.standard_b64encode(f.read()).decode("utf-8")

    ext = Path(image_path).suffix.lower()
    media_type = {".jpg": "image/jpeg", ".png": "image/png", ".pdf": "application/pdf"}
    img_type = media_type.get(ext, "image/jpeg")

    response = client.messages.create(
        model="claude-opus-4-8",  # best vision quality
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {"type": "base64", "media_type": img_type, "data": image_data},
                },
                {
                    "type": "text",
                    "text": """Describe this image in detail for a technical knowledge base.
Include:
- What type of visualization this is (diagram, chart, table, etc.)
- All text labels and values visible
- The structure and relationships shown
- Key insights or information conveyed"""
                }
            ],
        }]
    )
    return response.content[0].text

# Process a mixed document
def process_document_with_images(doc_path: str) -> list[dict]:
    chunks = []

    # Extract text chunks normally
    text_chunks = extract_text_chunks(doc_path)
    chunks.extend([{"type": "text", "content": c} for c in text_chunks])

    # Extract and describe images
    images = extract_images_from_doc(doc_path)
    for img_path, page_num, caption in images:
        description = image_to_description(img_path)
        chunks.append({
            "type": "image",
            "content": f"[IMAGE on page {page_num}]\nCaption: {caption}\nContent: {description}",
            "image_path": img_path,
        })

    return chunks

Pros: Works with any text-only vector store, no special infrastructure needed. Cons: Descriptions may not capture all nuance; can’t display original image in answers.

Option 2: Native Multimodal Embeddings

Use models that embed both images and text in the same vector space, enabling cross-modal retrieval:

from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch

model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

def embed_image(image_path: str) -> list[float]:
    image = Image.open(image_path)
    inputs = processor(images=image, return_tensors="pt")
    with torch.no_grad():
        embedding = model.get_image_features(**inputs)
    return embedding.squeeze().numpy().tolist()

def embed_text_for_images(text: str) -> list[float]:
    """Embed text in CLIP's image-aligned space."""
    inputs = processor(text=text, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        embedding = model.get_text_features(**inputs)
    return embedding.squeeze().numpy().tolist()

# Index both text descriptions and image embeddings in same vector store
# Text query → CLIP text embedding → finds similar images AND text

CLIP embeddings enable “find me architecture diagrams similar to this query” directly, without needing to describe images first.

Option 3: ColPali - Document Vision Retrieval

ColPali (2024) takes a fresh approach: instead of extracting text from documents and processing it separately, it treats each PDF page as an image and indexes page-level visual embeddings. This handles tables, charts, figures, and mixed layouts naturally:

# ColPali approach: every page is an image
from colpali_engine.models import ColPali, ColPaliProcessor

model = ColPali.from_pretrained("vidore/colpali-v1.2")
processor = ColPaliProcessor.from_pretrained("vidore/colpali-v1.2")

def index_pdf_as_images(pdf_path: str) -> list[dict]:
    pages = convert_pdf_to_images(pdf_path)  # one image per page
    page_embeddings = []

    for page_idx, page_image in enumerate(pages):
        inputs = processor.process_images([page_image])
        embedding = model(**inputs).embeddings  # patch-level embeddings
        page_embeddings.append({
            "page": page_idx,
            "embedding": embedding,
            "image": page_image,
        })

    return page_embeddings

# Query: text → visual embedding → find pages where content matches query

ColPali is state-of-the-art for document retrieval tasks involving complex layouts.

Table Extraction and RAG

Tables are particularly tricky — structure carries meaning that text doesn’t capture:

import pandas as pd
from io import StringIO

def extract_table_as_structured_text(table_html: str) -> str:
    """Convert HTML table to structured text for retrieval."""
    df = pd.read_html(StringIO(table_html))[0]

    # Describe table structure
    header = "Table with columns: " + ", ".join(df.columns.tolist())

    # Convert each row to a natural language description
    rows = []
    for _, row in df.iterrows():
        row_desc = " | ".join([f"{col}: {val}" for col, val in row.items()])
        rows.append(row_desc)

    return header + "\n" + "\n".join(rows)

# Alternatively, keep table structure for LLM to process directly
def table_to_markdown(table_data: pd.DataFrame) -> str:
    return table_data.to_markdown(index=False)

For tabular data, chunk by table (not by fixed size), and include column headers in every chunk so the table is interpretable in isolation.

Multimodal Generation: Sending Images to the LLM

When retrieved content includes images, pass them directly to the generation LLM:

def multimodal_generate(
    query: str,
    text_chunks: list[str],
    image_paths: list[str],
) -> str:
    content = [{"type": "text", "text": f"Question: {query}\n\nContext text:\n" + "\n\n".join(text_chunks)}]

    for img_path in image_paths[:3]:  # limit to 3 images
        with open(img_path, "rb") as f:
            img_data = base64.standard_b64encode(f.read()).decode("utf-8")
        content.append({
            "type": "image",
            "source": {"type": "base64", "media_type": "image/png", "data": img_data}
        })

    content.append({"type": "text", "text": "Answer the question using all provided context, including any images:"})

    response = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=1000,
        messages=[{"role": "user", "content": content}]
    )
    return response.content[0].text

Multimodal Vector Store Setup with Qdrant

from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct

# Collection supporting both text (1536D) and image (768D) vectors
client.create_collection(
    collection_name="multimodal_docs",
    vectors_config={
        "text": VectorParams(size=1536, distance=Distance.COSINE),
        "image": VectorParams(size=768, distance=Distance.COSINE),
    }
)

# Upsert with both vector types
client.upsert(
    collection_name="multimodal_docs",
    points=[
        PointStruct(
            id=1,
            vector={
                "text": text_embedding,    # from OpenAI
                "image": image_embedding,  # from CLIP
            },
            payload={"type": "image", "path": "/docs/fig3.png", "caption": "..."},
        )
    ]
)

# Query with text vector (finds both text and image matches in same space)
results = client.search(
    collection_name="multimodal_docs",
    query_vector=("text", query_embedding),
    limit=10,
)

2025 Trend: Native Multimodal Embeddings

OpenAI’s upcoming embedding models, Google’s multimodal embeddings (Vertex AI), and Cohere’s multimodal embed are moving towards single models that handle text, images, and audio in a unified embedding space. This eliminates the need for separate CLIP embeddings and text embeddings — one model, one vector space, all modalities retrievable together.

Multimodal RAG is no longer a niche capability — it’s becoming table stakes for any RAG system that handles real-world enterprise documents. Starting with Option 1 (describe images as text) is the pragmatic path; graduating to native multimodal embeddings is the direction the field is moving.