Dolphin tackles the fundamental challenge of extracting structured information from complex documents that contain intertwined text, tables, mathematical formulas, and embedded figures. Unlike OCR-only solutions that produce raw text or generic vision-language models that lack document structure awareness, Dolphin uses a purpose-built two-stage architecture: first analyzing the full page layout to identify elements in natural reading order, then parsing each element in parallel using heterogeneous anchor prompts tailored to the specific content type.
The vision encoder is based on Swin Transformer for extracting rich visual features from document images, while an MBart-based text decoder generates structured output preserving semantic relationships between elements. This architecture handles both digitally rendered documents and photographed or scanned pages, maintaining accuracy across varying image quality and document formats. The heterogeneous anchor prompting system provides context-aware cues for each element type, improving parsing accuracy for complex layouts like multi-column tables, nested lists, and inline equations.
Dolphin-v2 extends the original model with document-type awareness, enabling a single model to handle invoices, research papers, forms, receipts, and contracts without fine-tuning for each category. The model is released under MIT license with acceptance at ACL 2025, indicating peer-reviewed quality. Integration with inference platforms like vLLM and Replicate makes Dolphin accessible for production document processing workflows at scale.