PageIndex takes a fundamentally different approach to retrieval-augmented generation. Traditional RAG systems split documents into chunks, embed them as vectors, and retrieve based on semantic similarity — a process that often fails with professional documents where relevance requires multi-step reasoning, not just proximity in embedding space. PageIndex instead builds a hierarchical tree index from each document, similar to a table of contents but optimized for LLM navigation, and lets the model reason over that structure to find exactly what it needs. The result is context-aware, traceable retrieval that mirrors how a human expert would read a complex report.
The framework is available as an open-source Python package for self-hosted use with standard PDF parsing, and as a production-grade cloud service with enhanced OCR and tree-building pipelines for complex documents. An MCP server integration (pageindex-mcp) lets Claude, Cursor, and other MCP-compatible agents query document indexes directly without vector databases. The system is LLM-agnostic via LiteLLM — works with OpenAI, Anthropic, or any provider — and handles PDFs, Markdown files, and multi-document corpora through the PageIndex File System extension.
PageIndex powers Mafin 2.5, a financial document analysis system that achieved 98.7% accuracy on the FinanceBench benchmark — significantly above what traditional vector-based RAG systems typically reach on the same tasks. The benchmark covers SEC filings, earnings disclosures, and complex multi-page financial reports where precise section retrieval matters. For teams working with long professional documents — legal filings, technical manuals, academic papers — PageIndex offers a path to retrieval quality that embedding similarity alone cannot reliably deliver.
