aicoolies logo

# document-conversion

4 tools tagged

Showing 4 of 4 tools

Kreuzberg

Polyglot document intelligence framework with Rust core

Kreuzberg is a polyglot document intelligence framework with a high-performance Rust core that extracts text, metadata, images, and structured data from 91+ file formats. Available for Python, Ruby, Java, Go, PHP, C#, TypeScript, plus CLI, REST API, and MCP server. Features multiple OCR backends (Tesseract, EasyOCR, PaddleOCR), table extraction with structure preservation, and native async support.

open-sourceOpen Source
Unstructured logo

Unstructured

ETL for LLMs — preprocess any document format

Unstructured is an open-source ETL library that preprocesses and transforms documents from diverse formats into clean, structured data ready for LLM ingestion and RAG pipelines. It handles PDF, HTML, Word, PowerPoint, and many other file types through partitioning, cleaning, and chunking operations. The library offers connector-based architecture for integrating with various data sources and destinations, making it a key component in document processing workflows for AI applications.

freemiumOpen Source
Docling logo

Docling

Get your documents ready for gen AI

Docling is an open-source document processing toolkit by IBM Research that converts complex documents into structured formats optimized for generative AI applications. It parses PDF, DOCX, PPTX, XLSX, HTML, images, audio, and LaTeX with advanced PDF understanding including layout analysis, reading order detection, and table structure recognition. Docling exports to Markdown, HTML, JSON, and DocTags, and integrates natively with LangChain, LlamaIndex, and other AI frameworks for RAG workflows.

open-sourceOpen Source

MarkItDown

Convert any file to Markdown for LLM pipelines

MarkItDown is a lightweight Python utility by Microsoft that converts files into clean Markdown optimized for LLM pipelines and text analysis. It supports PDF, Word, Excel, PowerPoint, HTML, images with OCR, audio with transcription, and text formats like CSV, JSON, and XML. The tool preserves document structure including headings, tables, lists, and links while keeping output token-efficient. It offers a CLI, a four-line Python API, Docker support, and a plugin architecture for extensions.

open-sourceOpen Source
Tools tagged "Document Conversion" — aicoolies