aicoolies logo

Kreuzberg

Polyglot document intelligence framework with Rust core

Share
open-sourceOpen Source
Visit Website →

Kreuzberg is a polyglot document intelligence framework with a high-performance Rust core that extracts text, metadata, images, and structured data from 91+ file formats. Available for Python, Ruby, Java, Go, PHP, C#, TypeScript, plus CLI, REST API, and MCP server. Features multiple OCR backends (Tesseract, EasyOCR, PaddleOCR), table extraction with structure preservation, and native async support.

Kreuzberg is a polyglot document intelligence framework built on a high-performance Rust core. It extracts text, metadata, images, and structured information from over 91 file formats including PDFs, Office documents, images, spreadsheets, and more. Originally a Python library, it has evolved into a multi-language framework with native bindings for Python, Ruby, Java, Go, PHP, C#, R, C, and TypeScript, plus deployment via CLI, REST API, or MCP server.

The framework offers multiple OCR backends — Tesseract, EasyOCR, and PaddleOCR — giving developers flexibility based on accuracy and speed requirements. Table extraction preserves document structure, making it particularly valuable for RAG pipelines and LLM preprocessing where layout matters. The Rust core provides native PDFium integration and SIMD optimizations for high-throughput processing.

Kreuzberg supports fully async workflows for Python developers and provides an extensible plugin system for custom format handlers. With active development including regular releases and a growing contributor community, it has become a go-to choice for teams building document-heavy AI applications that need reliable, local text extraction across diverse file types.

Pricing

Free and open-source

Platforms

Python (cross-platform)

Categories

Tags

Use Cases

Alternatives

Unstructured logo

Unstructured

ETL for LLMs — preprocess any document format

Unstructured is an open-source ETL library that preprocesses and transforms documents from diverse formats into clean, structured data ready for LLM ingestion and RAG pipelines. It handles PDF, HTML, Word, PowerPoint, and many other file types through partitioning, cleaning, and chunking operations. The library offers connector-based architecture for integrating with various data sources and destinations, making it a key component in document processing workflows for AI applications.

freemiumOpen Source
Docling logo

Docling

Get your documents ready for gen AI

Docling is an open-source document processing toolkit by IBM Research that converts complex documents into structured formats optimized for generative AI applications. It parses PDF, DOCX, PPTX, XLSX, HTML, images, audio, and LaTeX with advanced PDF understanding including layout analysis, reading order detection, and table structure recognition. Docling exports to Markdown, HTML, JSON, and DocTags, and integrates natively with LangChain, LlamaIndex, and other AI frameworks for RAG workflows.

open-sourceOpen Source

MarkItDown

Convert any file to Markdown for LLM pipelines

MarkItDown is a lightweight Python utility by Microsoft that converts files into clean Markdown optimized for LLM pipelines and text analysis. It supports PDF, Word, Excel, PowerPoint, HTML, images with OCR, audio with transcription, and text formats like CSV, JSON, and XML. The tool preserves document structure including headings, tables, lists, and links while keeping output token-efficient. It offers a CLI, a four-line Python API, Docker support, and a plugin architecture for extensions.

open-sourceOpen Source
txtai logo

txtai

All-in-one embeddings database with RAG, search, and agent capabilities

txtai is a self-contained AI search and RAG platform that combines vector embeddings, semantic search, LLM pipelines, and agent workflows in a single Python library. It handles embedding generation, similarity search, extractive QA, summarization, translation, and custom pipelines without external dependencies. Runs locally with over 12,400 GitHub stars and Apache 2.0 license.

open-sourceOpen Source

Related Tools

Hermes Agent logo

Hermes Agent

Top Pick

Open-source AI agent framework with persistent memory, reusable skills, tools, and messaging gateways

Hermes Agent is an open-source AI agent framework with persistent memory, reusable skills, 40+ tools, cron jobs, and messaging gateways.

open-sourceOpen Source
BeeAI Framework logo

BeeAI Framework

Python and TypeScript framework for production multi-agent systems

BeeAI Framework is an Apache-2.0 toolkit for building production-ready AI agents and multi-agent systems in Python and TypeScript. Its docs cover agents, tools, RAG, memory, workflows, backend providers, serving, and A2A/MCP integration surfaces, making it a vendor-neutral option for teams comparing LangGraph, CrewAI, Mastra, and related agent runtimes.

open-sourceOpen SourceTelemetry
Superserve logo

Superserve

Open-source Firecracker sandboxes for long-running AI agents

Superserve is an open-source sandbox infrastructure layer for AI agents that need durable computers instead of short-lived shells. It runs isolated Firecracker microVMs, supports pause, resume, snapshot, fork, preview URLs, MCP connectivity, SDK/API control, Docker workloads, and self-hosting, while the hosted service adds pay-as-you-go agent sandboxes for teams.

open-sourceOpen Source

Anthropic Agent Skills

Official Claude Agent Skills examples, spec, and plugin marketplace for reusable agent capabilities

Anthropic Agent Skills is Anthropic's official reference repo and Claude Code plugin marketplace for reusable Skill folders. It packages example SKILL.md workflows, document skills, a Claude API skill, templates, and the Agent Skills spec so teams can turn repeatable instructions, scripts, and resources into on-demand Claude capabilities instead of copying prompts across sessions.

freeTelemetry
agmsg logo

agmsg

Cross-agent messaging for CLI coding agents

agmsg is an MIT-licensed Bash and SQLite messaging layer for CLI coding agents. It lets Claude Code, Codex, Gemini CLI, GitHub Copilot CLI, Antigravity, OpenCode, Hermes, and other terminal agents exchange messages through a shared local database instead of relying on a human copy-paste relay. It is intentionally not MCP, not a broker, and not a subagent framework.

open-sourceOpen Source
eve vercel

eve by Vercel

Filesystem-first framework for durable AI agents

Eve is Vercel's filesystem-first TypeScript framework for building durable AI agents as ordinary project files. It combines Markdown instructions and skills, typed tools, channels, connections, subagents, schedules, sandboxes, and evals with Vercel's agent runtime so teams can ship deployable agents without hand-rolling orchestration. The current beta fits Vercel-native backend agent projects.

open-sourceOpen Source