lakeFS applies Git semantics — branches, commits, merges, and diffs — to object storage at petabyte scale. Rather than copying data for each experiment or pipeline run, lakeFS creates lightweight branches that share unchanged objects while isolating modifications. This enables data engineers and scientists to experiment on production-scale datasets without risk of corrupting the canonical data, merge validated changes back to main, and maintain a complete audit trail of every data modification with commit history.
The platform operates as a layer on top of existing object storage — S3, GCS, Azure Blob, or MinIO — without requiring data migration. Applications access data through lakeFS's S3-compatible API, making it transparent to existing tools like Spark, Trino, dbt, Airflow, and ML frameworks. lakeFS supports pre-commit and pre-merge hooks for data quality validation, enabling CI/CD-style pipelines that prevent bad data from reaching production. The garbage collection system automatically reclaims storage from deleted branches and unreferenced objects.
lakeFS acquired DVC in November 2025, creating a unified data version control ecosystem that spans from individual data science projects (DVC's strength) to enterprise-scale data lakes. The open-source edition provides full branching and versioning capabilities, while lakeFS Enterprise adds features like SSO, RBAC, and advanced garbage collection for large-scale deployments. For organizations managing data lakes where data quality and reproducibility are critical, lakeFS provides the version control infrastructure that brings software engineering discipline to data management.