Document Ingestion for RAG: Building Robust Pipelines
Document ingestion is where RAG systems meet messy reality. You have documents in different formats (PDFs, Word, HTML, images), at different quality levels, with inconsistent metadata. Ingestion pipelines transform this chaos into clean, searchable content.
The Document Ingestion Pipeline
Raw Documents → Parsing → Cleaning → Metadata → Chunking → Embedding → StorageEach step matters. Skip quality checks early and you’ll waste resources later training embeddings on garbage.
File Format Handling
Plain Text and Markdown
Simplest case: already clean, human-readable structure.
Challenges:
- Inconsistent encoding (UTF-8, Latin-1, others)
- Embedded metadata not easily extracted
- Special characters and control codes
Approach: Read carefully, validate encoding, preserve formatting.
PDF Documents
The most common format companies provide, and the trickiest to parse correctly.
Challenges:
- PDFs can be scanned images, searchable text, or both
- Layout preservation affects meaning (headers, tables, footnotes)
- Inconsistent text extraction quality
- Security: watermarks, encryption, permissions
Tools:
- PyPDF2 - Simple, reliable for well-structured PDFs
- pdfplumber - Excellent for tables and layout-aware extraction
- PDFMiner - Industry standard, good at preserving structure
- OCR (Tesseract, Paddle-OCR) - For scanned documents
Strategy: Extract text, preserve document structure, flag images for separate processing.
Microsoft Office Documents
.docx, .xlsx, .ppt files often contain embedded images, tables, and metadata.
Tools:
- python-docx - Extract text from Word documents
- openpyxl, pandas - Handle Excel files
- python-pptx - Parse PowerPoint slides
Strategy: Extract all content types, preserve formatting hierarchy, handle embedded objects.
Web Content (HTML, XML)
Websites, API responses, and XML feeds need careful parsing.
Challenges:
- Boilerplate content (navigation, ads, scripts)
- Inconsistent DOM structure across sites
- Dynamic content requiring JavaScript execution
Tools:
- BeautifulSoup - Parse and extract from HTML
- lxml - Fast XML parsing
- Selenium - Handle JavaScript-heavy pages
- Playwright - Modern browser automation
Strategy: Extract main content, remove boilerplate, preserve semantic structure.
Images and Scanned Documents
Documents stored as images need OCR (Optical Character Recognition).
Tools:
- Tesseract - Open-source, reliable
- Paddle-OCR - Fast, multilingual
- Cloud APIs - Google Cloud Vision, Azure Computer Vision
Challenge: OCR accuracy varies; poor quality originals yield poor results.
Strategy: Use OCR quality assessment, flag uncertain extractions for review.
Text Cleaning and Normalization
After extraction, text needs cleaning:
Remove noise:
- Extra whitespace and line breaks
- Invisible characters and encoding artifacts
- HTML tags and XML markup
- Boilerplate text
Normalize:
- Standardize newlines (all \n)
- Fix common encoding errors
- Standardize dates and numbers
- Consistent capitalization for special terms
Example Python approach:
import reimport unicodedata
def clean_text(text): # Remove control characters text = ''.join(ch for ch in text if unicodedata.category(ch)[0]!='C') # Normalize whitespace text = re.sub(r'\s+', ' ', text) # Remove URLs and emails if needed text = re.sub(r'https?://\S+', '[URL]', text) return text.strip()Metadata Extraction and Assignment
Metadata helps context and ranking during retrieval. Extract or assign:
- Source: File name, URL, database ID
- Date: Creation, modification, publication date
- Author: Who created/updated the document
- Type: Document category, format, level (executive summary vs. detailed)
- Language: Detected language for multilingual systems
- Version: Track document versions
- Confidence: OCR confidence, automated extraction quality scores
Handling Special Content
Tables
Tables contain structured information that embeddings struggle with. Options:
- Convert to prose: “Column A, Row B contains value X”
- Preserve structure: Keep table formatting in extracted text
- Separate indexing: Index tables differently from narrative text
- Vector approach: Embed table metadata separately from content
Lists and Hierarchies
Nested lists, outlines, and hierarchical information need careful handling.
Approach: Preserve hierarchy in extracted text:
Document Title├─ Section 1│ ├─ Subsection 1.1│ └─ Subsection 1.2└─ Section 2Code and Examples
Programming code and examples are common in technical documentation.
Strategy: Either preserve as-is in code blocks, or create separate embeddings for code with natural language explanations.
Batch Processing Strategies
Large document sets require efficient batch processing:
Approach 1: Stream Processing Process documents one at a time, embed immediately, stream results to database.
Pros: Low memory, incremental progress Cons: Slower total throughput
Approach 2: Micro-batching Process documents in small batches, optimal for GPU embedding models.
Pros: Good parallelization, efficient resource use Cons: Memory requirements scale with batch size
Approach 3: Map-Reduce Distribute parsing across multiple workers, aggregate results.
Pros: Handles massive datasets Cons: Requires infrastructure complexity
Quality Gates in Ingestion
Add quality checks at each stage:
- Post-parsing: Check extraction success rate, flag anomalies
- Post-cleaning: Verify text readability, check for data loss
- Post-chunking: Sample chunks, verify coherence
- Post-embedding: Check embedding dimension, value ranges
Failed documents should trigger alerts—don’t silently ingest garbage.
Incremental vs. Batch Ingestion
Batch ingestion: Process all documents at once. Good for initial setup, clean cutover.
Incremental ingestion: Add documents continuously. Better for production systems where new documents arrive constantly.
Most systems use hybrid: batch load initial corpus, then incremental updates.
Performance Benchmarks
Typical ingestion rates (excluding embedding):
- Plain text: 1000+ documents/second
- PDFs with extraction: 10-100 documents/second
- OCR’d documents: 1-10 documents/second
- Complex multi-format sets: 5-50 documents/second
Embedding rates depend on model size and hardware (see Embeddings section).
Common Ingestion Mistakes
Processing without validation: Assuming extraction works perfectly.
Ignoring encoding issues: Creating garbage embeddings from corrupted text.
Losing metadata: Making it impossible to cite sources later.
Over-aggressive cleaning: Removing meaningful formatting or context.
No retry logic: Failing completely on single problematic documents.
Production Ingestion Pipelines
Modern RAG deployments use:
- Airflow/Prefect: Orchestrate ingestion workflows
- Kafka/RabbitMQ: Stream documents for processing
- Lambda/Cloud Functions: Scale parsing horizontally
- Monitoring: Track ingestion metrics, failure rates, latency
Ingestion isn’t glamorous, but getting it right prevents downstream problems throughout your RAG system.