DVC extends Git workflows to handle the unique challenges of machine learning projects — tracking large datasets, versioning model artifacts, and reproducing experiments. With over 15,000 GitHub stars, DVC has become a standard tool for data scientists who want reproducibility without abandoning their existing Git workflows. It works by storing lightweight .dvc pointer files in your Git repository while the actual data lives in configurable remote storage backends including Amazon S3, Google Cloud Storage, Azure Blob, SSH servers, and local network drives.

The pipeline system lets teams define multi-stage ML workflows in dvc.yaml files, creating directed acyclic graphs of dependencies between data, code, and outputs. Running dvc repro intelligently re-executes only the stages affected by changes, saving significant compute time. The experiment tracking system enables comparing parameters, metrics, and plots across runs without leaving the terminal or VS Code, making it easy to iterate on model development and share findings with teammates.

Originally created by Iterative.ai, DVC was acquired by lakeFS in November 2025, uniting two data version control pioneers. DVC remains free and open-source under Apache 2.0, with the lakeFS platform providing enterprise-scale data versioning for teams needing petabyte-level multimodal object store management. DVC supports any programming language and ML framework, integrating with Python, R, Julia, PyTorch, TensorFlow, and CI/CD systems for fully automated MLOps workflows.

DVC vs lakeFS — Git-Like ML File Versioning or Data-Lake Branch and Merge

DVC and lakeFS both bring version-control ideas to data, but they operate at different layers. DVC is best for ML teams versioning datasets, models, metrics, and experiment pipelines alongside Git. lakeFS is stronger for data-platform teams that need branch, commit, merge, and CI/CD semantics across object-storage data lakes. Choose DVC for model-centric reproducibility; choose lakeFS for lake-wide data operations and isolation.