DataHub is a modern metadata platform originally built at LinkedIn to solve the challenge of discovering, understanding, and governing data assets across a sprawling data ecosystem. The platform provides a searchable catalog where engineers and analysts can find datasets, dashboards, ML models, and data pipelines along with their ownership, documentation, quality metrics, and upstream and downstream dependencies. With over 80 pre-built integrations covering Snowflake, BigQuery, Redshift, dbt, Airflow, Spark, Tableau, Looker, and many more, DataHub can ingest metadata from virtually any data tool in a modern stack.
Column-level lineage tracking is one of DataHub's standout capabilities, showing exactly how data flows from source tables through transformations into downstream dashboards and reports. This lineage graph enables impact analysis before making schema changes and root cause analysis when data quality issues surface. DataHub also provides automated data quality assertions, glossary-based governance with fine-grained access policies, and a domain-based organizational model that maps to how teams actually own and consume data.
With nearly 12,000 GitHub stars and Apache 2.0 licensing, DataHub is used in production by over 3,000 organizations ranging from startups to enterprises. The project is maintained by Acryl Data, which also offers a managed cloud version for teams that prefer not to self-host. For data teams struggling with tribal knowledge, broken pipelines, and unclear data ownership, DataHub provides a centralized platform that makes the entire data ecosystem discoverable, observable, and governable.