# data-pipeline
10 tools tagged
Showing 10 of 10 tools
Hopsworks
AI Lakehouse with Feature Store for real-time ML
Hopsworks is a data-intensive AI platform combining a Python-centric Feature Store with MLOps capabilities for production ML systems. Provides sub-millisecond feature retrieval powered by RonDB, dual offline and online storage for batch and real-time inference, experiment tracking, model registry, and deployment pipelines. Available as managed cloud on AWS, Azure, and GCP, self-hosted on Kubernetes, or serverless platform.
Pixeltable
Declarative multimodal AI data infrastructure
Pixeltable is a declarative data infrastructure for multimodal AI that stores video, audio, images, and documents as first-class column types. Define Python computed columns for inference and transformations, and Pixeltable auto-orchestrates execution with incremental updates. Built-in vector search eliminates the need for separate vector databases while supporting RAG and semantic search workflows.
Meltano
Declarative code-first ELT data integration
Meltano is a declarative, code-first data integration engine with 500+ Singer connectors for building ELT pipelines. It replaces custom API integration code with configuration-driven pipeline definitions that live in version control alongside application code. Integrates with dbt for transformation, supports scheduling and monitoring through a unified CLI, and powers production pipelines at scale.
Great Expectations
Data quality validation framework for Python
Great Expectations is an open-source Python framework for validating, documenting, and profiling data quality. Teams define expectations as expressive unit tests for their data using an intuitive API, then validate datasets against those rules in CI/CD pipelines or production workflows. It connects to pandas, Spark, and SQL sources, generates data documentation automatically, and integrates with orchestrators like Airflow and Prefect for continuous data quality monitoring.
JuiceFS
Cloud-native POSIX filesystem on object storage
JuiceFS is a high-performance distributed POSIX filesystem built on object storage like S3 and metadata engines like Redis or MySQL. It enables seamless data sharing across thousands of clients with low latency and elastic throughput. JuiceFS ships with a Kubernetes CSI driver, Hadoop SDK compatibility, and FUSE mount support for AI training, big data analytics, and shared storage workloads. Apache 2.0 licensed with 13K+ GitHub stars.
dlt
Python library for declarative data loading that LLMs can generate
dlt (data load tool) is a Python library for building data pipelines with declarative, schema-aware loading that is simple enough for LLMs to generate correctly. It extracts data from APIs, databases, and files, normalizes nested structures, handles schema evolution, and loads into warehouses and lakes. Supports 30+ destinations including BigQuery, Snowflake, DuckDB, and PostgreSQL. Over 5,200 GitHub stars.
Daft
High-performance data engine for multimodal AI workloads
Daft is a high-performance distributed data engine designed specifically for AI and multimodal workloads. It processes structured data alongside images, audio, video, and embeddings natively, outperforming Spark and Polars on AI-specific data pipelines. Built in Rust with a Python API, Daft handles the data engineering challenges unique to machine learning workflows.
Prefect
Modern workflow orchestration for data pipelines
Prefect is an open-source workflow orchestration framework with 18K+ GitHub stars providing a Python-native approach to building, scheduling, and monitoring data pipelines. Turns any Python function into a schedulable, observable workflow with decorators. Features automatic retries, caching, concurrency controls, event-driven triggers, and a modern dashboard. Easier to adopt than Airflow with less boilerplate. Prefect Cloud provides managed orchestration with team collaboration features.
Dagster
Modern data orchestration for ML and analytics
Dagster is an open-source data orchestration platform with 12K+ GitHub stars combining pipeline scheduling with software-defined assets, built-in data quality checks, and a modern developer experience. Defines data assets declaratively rather than imperatively. Features asset lineage visualization, partitioned processing, sensor-based triggers, comprehensive testing, and integrated observability. A modern alternative to Airflow for teams wanting asset-centric orchestration.
dbt
SQL-based data transformation framework
dbt (data build tool) is an open-source SQL transformation framework with 10K+ GitHub stars that lets analytics engineers transform data in their warehouse using select statements. Brings software engineering practices to data — version control, testing, documentation, and CI/CD for SQL. Supports Snowflake, BigQuery, Redshift, Databricks, PostgreSQL, and more. Features Jinja templating, incremental models, snapshots, and a package hub of reusable transformations.