The Synthetic Data Vault is a Python library originating from MIT research that generates synthetic datasets preserving the statistical properties of real data. SDV learns joint distributions, correlations, and constraints using generative models including Gaussian copulas, CTGAN, and TVAE, producing synthetic samples that maintain these properties while containing no real records. The library handles single tables, multi-table relational databases with foreign key relationships, and sequential time-series data.
For relational data, SDV's multi-table synthesizers preserve referential integrity across related tables, generating consistent synthetic databases where parent-child relationships and cardinality distributions match the original structure. Built-in quality metrics compare synthetic data against real data on column distributions, pairwise correlations, and boundary adherence. Privacy metrics evaluate disclosure risk to ensure generated data cannot re-identify individuals from the source dataset.
SDV is open-source under MIT license and backed by DataCebo, which offers commercial extensions. The library integrates naturally into Python ML workflows, producing pandas DataFrames for model training, testing, and analysis. For teams needing realistic test data, privacy-safe datasets for sharing, or augmented training data for ML models, SDV provides the most accessible open-source entry point to synthetic data generation.