DVC extends Git workflows to handle the unique challenges of machine learning projects — tracking large datasets, versioning model artifacts, and reproducing experiments. With over 15,000 GitHub stars, DVC has become a standard tool for data scientists who want reproducibility without abandoning their existing Git workflows. It works by storing lightweight .dvc pointer files in your Git repository while the actual data lives in configurable remote storage backends including Amazon S3, Google Cloud Storage, Azure Blob, SSH servers, and local network drives.
The pipeline system lets teams define multi-stage ML workflows in dvc.yaml files, creating directed acyclic graphs of dependencies between data, code, and outputs. Running dvc repro intelligently re-executes only the stages affected by changes, saving significant compute time. The experiment tracking system enables comparing parameters, metrics, and plots across runs without leaving the terminal or VS Code, making it easy to iterate on model development and share findings with teammates.
Originally created by Iterative.ai, DVC was acquired by lakeFS in November 2025, uniting two data version control pioneers. DVC remains free and open-source under Apache 2.0, with the lakeFS platform providing enterprise-scale data versioning for teams needing petabyte-level multimodal object store management. DVC supports any programming language and ML framework, integrating with Python, R, Julia, PyTorch, TensorFlow, and CI/CD systems for fully automated MLOps workflows.