AI Data Tools

Marqo

Embedding-first search and discovery engine for AI-powered product experiences.

Marqo is an open-source tensor search engine that combines embedding generation and vector search in a single API, removing the need to manage separate embedding pipelines and vector databases. Built for product discovery and multi-modal search, it lets teams index text, images, and structured data together, returning ranked results based on semantic similarity rather than keyword overlap.

freemium

Magika

AI-powered file-type detection at Google scale

Open-source AI-powered file-type detection tool from Google that uses a custom deep-learning model under a few megabytes to identify more than 200 binary and textual content types in milliseconds, even on a single CPU. Magika ships as a CLI, Python package, JavaScript/TypeScript library, and an ONNX model, achieves around 99% accuracy on its test set, and is already used at Google scale across Gmail, Drive, and Safe Browsing as well as by VirusTotal and abuse.ch.

freeOpen Source

Zep

Context engineering platform for AI agents with temporal knowledge graphs

Zep is a context engineering platform that assembles relationship-aware context for AI agents from conversations, business data, documents, and events. It maintains a temporal knowledge graph that automatically extracts entities and relationships, tracking how context evolves over time. Zep delivers formatted context blocks optimized for LLMs with sub-200ms latency, integrating with LangChain, LlamaIndex, AutoGen, and Google ADK through Python, TypeScript, and Go SDKs.

freemium

Hindsight

Agent memory system that learns, not just remembers

Hindsight is an agent memory system that enables AI agents to learn from experience rather than just store conversations. It organizes memories into three biomimetic categories: World knowledge for facts, Experiences for agent events, and Mental Models for learned understanding. The system provides retain, recall, and reflect operations backed by a temporal knowledge graph with parallel retrieval strategies including semantic, keyword, graph traversal, and temporal search.

freemiumOpen Source

Weights & Biases

ML experiment tracking and model monitoring

Weights and Biases is the AI developer platform for experiment tracking, model monitoring, and ML workflow orchestration. Weave extends W&B with LLM ops capabilities for prompt engineering, evaluation, and deployment. Enables teams to track experiments, monitor model performance in production, manage datasets, log LLM application traces, and collaborate on ML projects with visualization dashboards, automated logging, and enterprise SSO and RBAC compliance.

freemium

Labelbox

Data factory for AI teams and model training

Labelbox is a comprehensive data platform for AI teams handling reinforcement learning, evaluations, robotics, and human feedback workflows. Core capabilities include RL data generation with knowledge work rubrics, custom evaluations for private benchmarks and model comparisons, robotics data with full-stack video and trajectories, and an expert network of 1.5M+ knowledge workers including 50K+ PhDs. Trusted by 80% of leading AI labs for production data operations.

paid

Hopsworks

AI Lakehouse with Feature Store for real-time ML

Hopsworks is a data-intensive AI platform combining a Python-centric Feature Store with MLOps capabilities for production ML systems. Provides sub-millisecond feature retrieval powered by RonDB, dual offline and online storage for batch and real-time inference, experiment tracking, model registry, and deployment pipelines. Available as managed cloud on AWS, Azure, and GCP, self-hosted on Kubernetes, or serverless platform.

freemiumOpen Source

React Native ExecuTorch

On-device AI inference for React Native apps

Declarative framework for running AI models on-device in React Native applications, powered by Meta ExecuTorch runtime. Supports LLMs including Llama 3.2, computer vision, OCR, embeddings, and vision-language models on iOS 17+ and Android 13+. Developed by Software Mansion with pre-built optimized models, custom model export support, and privacy-first inference without any cloud dependency for mobile AI development.

open-sourceOpen Source

OpenDataLoader PDF

AI-ready PDF parser with benchmark-leading accuracy

OpenDataLoader PDF is a high-performance parser that extracts structured, AI-ready data from PDFs with industry-leading 0.907 benchmark accuracy. Combines deterministic local processing with optional AI hybrid mode for complex layouts, OCR support across 80+ languages, formula extraction in LaTeX, chart descriptions, and built-in prompt injection filtering. Available as Python, Node.js, and Java SDKs for seamless RAG pipeline and data preparation integration.

freemiumOpen Source

WeKnora

Enterprise RAG framework by Tencent

WeKnora is a Tencent-developed LLM-powered knowledge management and Q&A framework for enterprise document understanding and semantic retrieval. Supports 10+ document formats including PDF, Word, Excel, and images with seamless IM platform integration for WeCom, Feishu, Slack, and Telegram. Offers Quick Q&A mode using RAG pipelines and Intelligent Reasoning mode with ReACT agents for complex multi-step reasoning tasks across organizational knowledge bases.

open-sourceOpen Source

DUSt3R

3D reconstruction without camera parameters

DUSt3R is Naver's breakthrough 3D reconstruction method that generates dense 3D scenes from unconstrained image pairs without known camera intrinsics or extrinsics. It casts pairwise reconstruction as pointmap regression, removing hard geometric constraints of projective camera models. Supports multi-view alignment, depth estimation, visual localization, and extends to MASt3R and MUSt3R for large-scale applications.

open-sourceOpen Source

Dolphin

ByteDance multimodal document image parser

Dolphin is ByteDance's multimodal document parsing model that handles intertwined text, tables, formulas, and figures in complex documents. Using a two-stage analyze-then-parse approach with a Swin Transformer vision encoder and MBart decoder, it performs layout analysis and parallel element parsing with heterogeneous anchor prompts. Dolphin-v2 adds document-type awareness for invoices, papers, and forms.

open-sourceOpen Source

Pixeltable

Declarative multimodal AI data infrastructure

Pixeltable is a declarative data infrastructure for multimodal AI that stores video, audio, images, and documents as first-class column types. Define Python computed columns for inference and transformations, and Pixeltable auto-orchestrates execution with incremental updates. Built-in vector search eliminates the need for separate vector databases while supporting RAG and semantic search workflows.

open-sourceOpen Source

Meltano

Declarative code-first ELT data integration

Meltano is a declarative, code-first data integration engine with 500+ Singer connectors for building ELT pipelines. It replaces custom API integration code with configuration-driven pipeline definitions that live in version control alongside application code. Integrates with dbt for transformation, supports scheduling and monitoring through a unified CLI, and powers production pipelines at scale.

open-sourceOpen Source

Great Expectations

Data quality validation framework for Python

Great Expectations is an open-source Python framework for validating, documenting, and profiling data quality. Teams define expectations as expressive unit tests for their data using an intuitive API, then validate datasets against those rules in CI/CD pipelines or production workflows. It connects to pandas, Spark, and SQL sources, generates data documentation automatically, and integrates with orchestrators like Airflow and Prefect for continuous data quality monitoring.

freemiumOpen Source

NCNN

High-performance mobile neural network inference

NCNN is Tencent's high-performance neural network inference framework optimized for mobile and embedded platforms. It features pure C++ with zero dependencies, ARM NEON assembly optimization, Vulkan GPU acceleration, and sophisticated memory management for minimal footprint. Supports importing models from PyTorch, ONNX, Caffe, TensorFlow, and Keras with 8-bit quantization and half-precision storage for efficient on-device deployment across Android, iOS, and Linux.

open-sourceOpen Source

FlashAttention

Fast memory-efficient GPU attention kernels

FlashAttention is a fast and memory-efficient exact attention implementation that reduces GPU memory usage from quadratic to linear in sequence length. Created by Tri Dao, it achieves 3-4x speedups over baseline implementations through IO-aware tiling that minimizes HBM reads and writes. Versions include FlashAttention-2 with improved parallelism, FlashAttention-3 optimized for Hopper H100 GPUs, and FlashAttention-4 targeting Hopper and Blackwell architectures.

open-sourceOpen Source

MinIO

High-performance S3-compatible object storage

MinIO is a high-performance, S3-compatible object storage server designed for AI, machine learning, and data-intensive workloads. Written in Go, it delivers industry-leading throughput for both read and write operations while maintaining full compatibility with the Amazon S3 API. MinIO includes an embedded web console for bucket management, a command-line client, and supports erasure coding, bitrot protection, and encryption at rest for enterprise-grade data durability.

freemiumOpen Source

ClickHouse

Real-time analytics OLAP database

ClickHouse is an open-source column-oriented database built for real-time analytical queries on massive datasets. Its columnar storage with advanced compression and vectorized query execution using SIMD instructions deliver exceptional performance for aggregations and scans. It handles billions of rows per second, supports SQL with analytical extensions, and scales horizontally for petabyte-scale data warehousing and real-time dashboards.

freemiumOpen Source

RunAnywhere SDK

Cross-platform on-device AI inference SDK

RunAnywhere SDK is a production-ready toolkit for running AI models entirely on-device across iOS, macOS, Android, Web, React Native, and Flutter. It provides a unified C++ core with platform-specific bindings for LLM text generation via llama.cpp, vision-language models, Whisper speech-to-text, Piper text-to-speech, and on-device image generation. All processing stays local with zero cloud dependency, ensuring privacy and low latency for mobile and edge AI applications.

open-sourceOpen Source

RAG-Anything

All-in-one multimodal RAG framework

RAG-Anything is an all-in-one multimodal RAG framework from the University of Hong Kong that processes text, images, tables, and equations through a unified pipeline built on LightRAG. It constructs multi-modal knowledge graphs by extracting multimodal entities and establishing cross-modal relationships. The VLM-Enhanced Query mode integrates visual content into large language models for deeper document understanding beyond plain text retrieval.

open-sourceOpen Source

Amphion

Open-source toolkit for audio, music, and speech generation

Amphion is an open-source audio generation toolkit from OpenMMLab designed for reproducible research in speech synthesis, voice conversion, singing voice synthesis, and text-to-audio generation. It implements state-of-the-art models including MaskGCT, DualCodec, VITS, and VALL-E with built-in architecture visualizations for educational use. The project ships with the Emilia-Large dataset of 200,000 hours of speech data and includes multiple vocoders and evaluation metrics for benchmarking.

open-sourceOpen Source

VoxCPM

Tokenizer-free multilingual TTS with voice cloning

VoxCPM is an open-source text-to-speech system from OpenBMB generating continuous speech across 30 languages without traditional tokenization. Its 2B parameter end-to-end diffusion architecture produces 48kHz studio-quality audio with natural prosody and emotion. Key capabilities include voice design from text descriptions, few-shot voice cloning, and multilingual synthesis without language-specific modules. The Apache 2.0 project has 8,700 GitHub stars.

open-sourceOpen Source

Cognee

Knowledge graph memory engine for AI agents

Cognee is an open-source knowledge engine that builds persistent memory for AI agents by combining vector search with graph databases. It ingests data from 38+ source formats, structures information into a knowledge graph with embeddings, and enables semantic and relational queries through its ECL pipeline. Its cognitive science-inspired architecture provides superior cross-document entity identification compared to traditional RAG approaches.

freemiumOpen Source