Kreuzberg is a polyglot document intelligence framework built on a high-performance Rust core. It extracts text, metadata, images, and structured information from over 91 file formats including PDFs, Office documents, images, spreadsheets, and more. Originally a Python library, it has evolved into a multi-language framework with native bindings for Python, Ruby, Java, Go, PHP, C#, R, C, and TypeScript, plus deployment via CLI, REST API, or MCP server.
The framework offers multiple OCR backends — Tesseract, EasyOCR, and PaddleOCR — giving developers flexibility based on accuracy and speed requirements. Table extraction preserves document structure, making it particularly valuable for RAG pipelines and LLM preprocessing where layout matters. The Rust core provides native PDFium integration and SIMD optimizations for high-throughput processing.
Kreuzberg supports fully async workflows for Python developers and provides an extensible plugin system for custom format handlers. With active development including regular releases and a growing contributor community, it has become a go-to choice for teams building document-heavy AI applications that need reliable, local text extraction across diverse file types.