Tools for data labeling, annotation, versioning, feature stores, synthetic data generation, and data curation for machine learning workflows.
Showing 14 of 14 tools
Real-time web search and retrieval via MCP
Exa MCP Server provides AI coding agents with real-time web search and content crawling capabilities through the Model Context Protocol. It leverages Exa's neural search API for semantic understanding of queries, returning clean, structured results with full page content extraction. Supports both remote hosted MCP endpoints and local client configurations.
Web scraping and crawling via MCP for AI agents
Firecrawl MCP Server is the official MCP integration for Firecrawl that gives AI coding agents web scraping, crawling, search, and structured data extraction capabilities. It supports batch operations, deep research mode, and agent-friendly extraction with configurable output formats across multiple AI client environments.
Entity-based synthetic data generation for enterprise
K2view is an enterprise data platform that generates synthetic data using an entity-based micro-database architecture. It ensures referential integrity across complex multi-relational datasets by treating each business entity as a self-contained unit. Used for privacy-compliant test data generation, data masking, and AI training data creation in financial services, telecom, and healthcare industries.
Open-source library for generating synthetic tabular data
Synthetic Data Vault (SDV) is an MIT-backed open-source Python library for generating synthetic tabular, relational, and time-series data. It learns statistical patterns from real datasets and produces synthetic versions that preserve distributions, correlations, and referential integrity. Supports single-table, multi-table, and sequential data with built-in privacy and quality metrics.
Data versioning and pipeline automation for ML
Pachyderm is a data versioning and pipeline automation platform that provides Git-like version control for datasets with automatic data lineage tracking. Acquired by HPE, it enables reproducible ML workflows by connecting data versioning to containerized processing pipelines. Features include automatic provenance tracking, incremental processing, and deduplication for efficient storage of large datasets.
Data-centric AI platform for programmatic data labeling
Snorkel AI is a data-centric AI platform that enables programmatic labeling of training data through labeling functions rather than manual annotation. Spun out of Stanford AI Lab, it lets teams write Python functions that encode domain heuristics to label data at scale, with the platform combining weak labels into high-quality training sets. Used by Fortune 500 companies for text, image, and structured data labeling.
Multimodal data labeling and curation for production AI
Encord is a data labeling and curation platform for teams building production AI systems with complex multimodal data. It supports image, video, audio, DICOM medical imaging, and 3D point cloud annotation with AI-assisted labeling, advanced ontology management, and quality assurance workflows. Features active learning for prioritizing high-value samples and integrates with major ML frameworks.
Enterprise feature platform for real-time ML
Tecton is an enterprise feature platform for building and serving ML features at scale. Created by the team behind Feast, it provides managed feature engineering, real-time feature computation from streaming data, feature monitoring, and a unified feature store with offline/online consistency. Used by production ML teams to eliminate training-serving skew and accelerate model deployment cycles.
Synthetic data generation platform for privacy and ML
Gretel is a synthetic data platform that generates realistic, privacy-preserving datasets for ML training, testing, and data sharing. It supports tabular, text, and time-series data with configurable privacy guarantees including differential privacy. Features include data augmentation for imbalanced datasets, PII detection and anonymization, and API/SDK access for pipeline integration with BigQuery, Snowflake, and Databricks.
Git-like version control for data lakes and object storage
lakeFS is an open-source platform that brings Git-like branching, committing, and merging to data lakes and object storage. It works on top of S3, GCS, Azure Blob, and MinIO, enabling teams to create isolated data branches for experimentation, run CI/CD for data pipelines, and maintain full data lineage. Acquired DVC in 2025, uniting data version control for both small and enterprise-scale workloads.
Open-source data curation platform for LLM fine-tuning
Argilla is an open-source platform for curating and annotating data for LLM fine-tuning and RLHF workflows. It provides collaborative annotation interfaces for text classification, ranking, and preference labeling with integrated quality metrics. Part of the Hugging Face ecosystem, Argilla supports direct dataset publishing to the Hub and integrates with major training frameworks for seamless model improvement pipelines.
Open-source feature store for machine learning
Feast is an open-source feature store that manages and serves ML features for both training and online inference. It prevents training-serving skew by providing consistent feature access across offline and real-time environments. Feast supports batch materialization from data warehouses, real-time feature retrieval, on-demand transformations, and integrates with major data platforms including BigQuery, Snowflake, Redshift, and DynamoDB.
Git-based version control for ML data and pipelines
DVC (Data Version Control) is a free open-source tool that brings Git-like version control to datasets, ML models, and experiment pipelines. It stores pointer files in Git while keeping large data in remote storage like S3, GCS, or Azure. Features include reproducible ML pipelines with DAG-based dependency tracking, experiment management, metrics comparison, and a VS Code extension for visual experiment tracking.
Open-source multi-type data labeling platform
Label Studio is an open-source data labeling tool by HumanSignal supporting images, text, audio, video, and time series. It offers ML-assisted pre-labeling, customizable XML-based annotation interfaces, multi-user review workflows, and REST API access. Used for computer vision, NLP, speech, and LLM fine-tuning including RLHF annotation pipelines.