3 tools tagged
Showing 3 of 3 tools
ETL for LLMs — preprocess any document format
Unstructured is an open-source ETL library that preprocesses and transforms documents from diverse formats into clean, structured data ready for LLM ingestion and RAG pipelines. It handles PDF, HTML, Word, PowerPoint, and many other file types through partitioning, cleaning, and chunking operations. The library offers connector-based architecture for integrating with various data sources and destinations, making it a key component in document processing workflows for AI applications.
Get your documents ready for gen AI
Docling is an open-source document processing toolkit by IBM Research that converts complex documents into structured formats optimized for generative AI applications. It parses PDF, DOCX, PPTX, XLSX, HTML, images, audio, and LaTeX with advanced PDF understanding including layout analysis, reading order detection, and table structure recognition. Docling exports to Markdown, HTML, JSON, and DocTags, and integrates natively with LangChain, LlamaIndex, and other AI frameworks for RAG workflows.
Convert any file to Markdown for LLM pipelines
MarkItDown is a lightweight Python utility by Microsoft that converts files into clean Markdown optimized for LLM pipelines and text analysis. It supports PDF, Word, Excel, PowerPoint, HTML, images with OCR, audio with transcription, and text formats like CSV, JSON, and XML. The tool preserves document structure including headings, tables, lists, and links while keeping output token-efficient. It offers a CLI, a four-line Python API, Docker support, and a plugin architecture for extensions.