# data-pipeline

10 tools tagged

showing 10 of 10 tools

Hopsworks

AI Lakehouse with Feature Store for real-time ML

Hopsworks is a data-intensive AI platform combining a Python-centric Feature Store with MLOps capabilities for production ML systems. Provides sub-millisecond feature retrieval powered by RonDB, dual offline and online storage for batch and real-time inference, experiment tracking, model registry, and deployment pipelines. Available as managed cloud on AWS, Azure, and GCP, self-hosted on Kubernetes, or serverless platform.

freemiumOpen Source

Pixeltable

Declarative multimodal AI data infrastructure

Pixeltable is a declarative data infrastructure for multimodal AI that stores video, audio, images, and documents as first-class column types. Define Python computed columns for inference and transformations, and Pixeltable auto-orchestrates execution with incremental updates. Built-in vector search eliminates the need for separate vector databases while supporting RAG and semantic search workflows.

open-sourceOpen Source

Meltano

Declarative code-first ELT data integration

Meltano is a declarative, code-first data integration engine with 500+ Singer connectors for building ELT pipelines. It replaces custom API integration code with configuration-driven pipeline definitions that live in version control alongside application code. Integrates with dbt for transformation, supports scheduling and monitoring through a unified CLI, and powers production pipelines at scale.

open-sourceOpen Source

Great Expectations

Data quality validation framework for Python

Great Expectations is an open-source Python framework for validating, documenting, and profiling data quality. Teams define expectations as expressive unit tests for their data using an intuitive API, then validate datasets against those rules in CI/CD pipelines or production workflows. It connects to pandas, Spark, and SQL sources, generates data documentation automatically, and integrates with orchestrators like Airflow and Prefect for continuous data quality monitoring.

freemiumOpen Source

JuiceFS

Cloud-native POSIX filesystem on object storage

JuiceFS is a high-performance distributed POSIX filesystem built on object storage like S3 and metadata engines like Redis or MySQL. It enables seamless data sharing across thousands of clients with low latency and elastic throughput. JuiceFS ships with a Kubernetes CSI driver, Hadoop SDK compatibility, and FUSE mount support for AI training, big data analytics, and shared storage workloads. Apache 2.0 licensed with 13K+ GitHub stars.

freemiumOpen Source

dlt

Python library for declarative data loading that LLMs can generate

dlt (data load tool) is a Python library for building data pipelines with declarative, schema-aware loading that is simple enough for LLMs to generate correctly. It extracts data from APIs, databases, and files, normalizes nested structures, handles schema evolution, and loads into warehouses and lakes. Supports 30+ destinations including BigQuery, Snowflake, DuckDB, and PostgreSQL. Over 5,200 GitHub stars.

open-sourceOpen Source

Daft

High-performance data engine for multimodal AI workloads

Daft is a high-performance distributed data engine designed specifically for AI and multimodal workloads. It processes structured data alongside images, audio, video, and embeddings natively, outperforming Spark and Polars on AI-specific data pipelines. Built in Rust with a Python API, Daft handles the data engineering challenges unique to machine learning workflows.

open-sourceOpen Source

Prefect

Modern workflow orchestration for data pipelines

Prefect is an open-source workflow orchestration framework with 18K+ GitHub stars providing a Python-native approach to building, scheduling, and monitoring data pipelines. Turns any Python function into a schedulable, observable workflow with decorators. Features automatic retries, caching, concurrency controls, event-driven triggers, and a modern dashboard. Easier to adopt than Airflow with less boilerplate. Prefect Cloud provides managed orchestration with team collaboration features.

open-sourceOpen Source

Dagster

Modern data orchestration for ML and analytics

Dagster is an open-source data orchestration platform with 15K+ GitHub stars combining pipeline scheduling with software-defined assets, built-in data quality checks, and a modern developer experience. Defines data assets declaratively rather than imperatively. Features asset lineage visualization, partitioned processing, sensor-based triggers, comprehensive testing, and integrated observability. A modern alternative to Airflow for teams wanting asset-centric orchestration.

open-sourceOpen Source

dbt

SQL-based data transformation framework

dbt (data build tool) is an open-source SQL transformation framework with 10K+ GitHub stars that lets analytics engineers transform data in their warehouse using select statements. Brings software engineering practices to data — version control, testing, documentation, and CI/CD for SQL. Supports Snowflake, BigQuery, Redshift, Databricks, PostgreSQL, and more. Features Jinja templating, incremental models, snapshots, and a package hub of reusable transformations.

open-sourceOpen Source