Unstructured started in 2022 as a PDF splitter and evolved into a full ETL stack purpose-built for turning messy enterprise documents into clean, structured data that large language models can actually work with. The open-source core library partitions over 25 document types — PDF, HTML, Word, PowerPoint, email, images, and more — into typed elements like titles, narrative text, tables, list items, and page breaks. Each element carries metadata including coordinates, page numbers, detected languages, and content hashes, giving downstream systems precise provenance for every chunk of extracted content.
The processing pipeline follows a partition-enrich-chunk-embed flow. A fast strategy uses pdfminer with heuristic chunking for quick results, while the hi-res strategy invokes layout ML models for accurate table boundary detection, header recognition, and image OCR — useful when dealing with scanned documents or complex multi-column layouts. The ingest module connects to over 40 data sources including S3, Gmail, Jira, SharePoint, and Confluence, crawling and partitioning at scale. Output formats include JSON, Markdown, HTML, and Arrow, and the same code runs locally during development and against the hosted API in production without changing the data shape.
The hosted Unstructured Platform adds a low-code UI, horizontal scaling with 300x concurrency per organization, and continuously updated image-to-text and embedding models. Pricing starts with 15,000 free pages with no expiration, then moves to flat-rate pay-per-page processing. Enterprise deployments get dedicated VPC instances with full data isolation. An MCP server integration is in development, which will let AI agents interact with Unstructured pipelines through natural language — closing the loop between document ingestion and agentic workflows.