Document Ingestion for RAG: Building Robust Pipelines

Document ingestion is where RAG systems meet messy reality. You have documents in different formats (PDFs, Word, HTML, images), at different quality levels, with inconsistent metadata. Ingestion pipelines transform this chaos into clean, searchable content.

The Document Ingestion Pipeline

Raw Documents → Parsing → Cleaning → Metadata → Chunking → Embedding → Storage

Each step matters. Skip quality checks early and you’ll waste resources later training embeddings on garbage.

File Format Handling

Plain Text and Markdown

Simplest case: already clean, human-readable structure.

Challenges:

Inconsistent encoding (UTF-8, Latin-1, others)
Embedded metadata not easily extracted
Special characters and control codes

Approach: Read carefully, validate encoding, preserve formatting.

PDF Documents

The most common format companies provide, and the trickiest to parse correctly.

Challenges:

PDFs can be scanned images, searchable text, or both
Layout preservation affects meaning (headers, tables, footnotes)
Inconsistent text extraction quality
Security: watermarks, encryption, permissions

Tools:

PyPDF2 - Simple, reliable for well-structured PDFs
pdfplumber - Excellent for tables and layout-aware extraction
PDFMiner - Industry standard, good at preserving structure
OCR (Tesseract, Paddle-OCR) - For scanned documents

Strategy: Extract text, preserve document structure, flag images for separate processing.

Microsoft Office Documents

.docx, .xlsx, .ppt files often contain embedded images, tables, and metadata.

Tools:

python-docx - Extract text from Word documents
openpyxl, pandas - Handle Excel files
python-pptx - Parse PowerPoint slides

Strategy: Extract all content types, preserve formatting hierarchy, handle embedded objects.

Web Content (HTML, XML)

Websites, API responses, and XML feeds need careful parsing.

Challenges:

Boilerplate content (navigation, ads, scripts)
Inconsistent DOM structure across sites
Dynamic content requiring JavaScript execution

Tools:

BeautifulSoup - Parse and extract from HTML
lxml - Fast XML parsing
Selenium - Handle JavaScript-heavy pages
Playwright - Modern browser automation

Strategy: Extract main content, remove boilerplate, preserve semantic structure.

Images and Scanned Documents

Documents stored as images need OCR (Optical Character Recognition).

Tools:

Tesseract - Open-source, reliable
Paddle-OCR - Fast, multilingual
Cloud APIs - Google Cloud Vision, Azure Computer Vision

Challenge: OCR accuracy varies; poor quality originals yield poor results.

Strategy: Use OCR quality assessment, flag uncertain extractions for review.

Text Cleaning and Normalization

After extraction, text needs cleaning:

Remove noise:

Extra whitespace and line breaks
Invisible characters and encoding artifacts
HTML tags and XML markup
Boilerplate text

Normalize:

Standardize newlines (all \n)
Fix common encoding errors
Standardize dates and numbers
Consistent capitalization for special terms

Example Python approach:

import re
import unicodedata

def clean_text(text):
    # Remove control characters
    text = ''.join(ch for ch in text if unicodedata.category(ch)[0]!='C')
    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text)
    # Remove URLs and emails if needed
    text = re.sub(r'https?://\S+', '[URL]', text)
    return text.strip()

Metadata Extraction and Assignment

Metadata helps context and ranking during retrieval. Extract or assign:

Source: File name, URL, database ID
Date: Creation, modification, publication date
Author: Who created/updated the document
Type: Document category, format, level (executive summary vs. detailed)
Language: Detected language for multilingual systems
Version: Track document versions
Confidence: OCR confidence, automated extraction quality scores

Handling Special Content

Tables

Tables contain structured information that embeddings struggle with. Options:

Convert to prose: “Column A, Row B contains value X”
Preserve structure: Keep table formatting in extracted text
Separate indexing: Index tables differently from narrative text
Vector approach: Embed table metadata separately from content

Lists and Hierarchies

Nested lists, outlines, and hierarchical information need careful handling.

Approach: Preserve hierarchy in extracted text:

Document Title
├─ Section 1
│  ├─ Subsection 1.1
│  └─ Subsection 1.2
└─ Section 2

Code and Examples

Programming code and examples are common in technical documentation.

Strategy: Either preserve as-is in code blocks, or create separate embeddings for code with natural language explanations.

Batch Processing Strategies

Large document sets require efficient batch processing:

Approach 1: Stream Processing Process documents one at a time, embed immediately, stream results to database.

Pros: Low memory, incremental progress Cons: Slower total throughput

Approach 2: Micro-batching Process documents in small batches, optimal for GPU embedding models.

Pros: Good parallelization, efficient resource use Cons: Memory requirements scale with batch size

Approach 3: Map-Reduce Distribute parsing across multiple workers, aggregate results.

Pros: Handles massive datasets Cons: Requires infrastructure complexity

Quality Gates in Ingestion

Add quality checks at each stage:

Post-parsing: Check extraction success rate, flag anomalies
Post-cleaning: Verify text readability, check for data loss
Post-chunking: Sample chunks, verify coherence
Post-embedding: Check embedding dimension, value ranges

Failed documents should trigger alerts—don’t silently ingest garbage.

Incremental vs. Batch Ingestion

Batch ingestion: Process all documents at once. Good for initial setup, clean cutover.

Incremental ingestion: Add documents continuously. Better for production systems where new documents arrive constantly.

Most systems use hybrid: batch load initial corpus, then incremental updates.

Performance Benchmarks

Typical ingestion rates (excluding embedding):

Plain text: 1000+ documents/second
PDFs with extraction: 10-100 documents/second
OCR’d documents: 1-10 documents/second
Complex multi-format sets: 5-50 documents/second

Embedding rates depend on model size and hardware (see Embeddings section).

Common Ingestion Mistakes

Processing without validation: Assuming extraction works perfectly.

Ignoring encoding issues: Creating garbage embeddings from corrupted text.

Losing metadata: Making it impossible to cite sources later.

Over-aggressive cleaning: Removing meaningful formatting or context.

No retry logic: Failing completely on single problematic documents.

Production Ingestion Pipelines

Modern RAG deployments use:

Airflow/Prefect: Orchestrate ingestion workflows
Kafka/RabbitMQ: Stream documents for processing
Lambda/Cloud Functions: Scale parsing horizontally
Monitoring: Track ingestion metrics, failure rates, latency

Ingestion isn’t glamorous, but getting it right prevents downstream problems throughout your RAG system.