aicoolies logo
lakeFS logo

lakeFS

Git-like version control for data lakes and object storage

Share
freemiumOpen Source
Visit Website →

lakeFS is an open-source platform that brings Git-like branching, committing, and merging to data lakes and object storage. It works on top of S3, GCS, Azure Blob, and MinIO, enabling teams to create isolated data branches for experimentation, run CI/CD for data pipelines, and maintain full data lineage. Acquired DVC in 2025, uniting data version control for both small and enterprise-scale workloads.

lakeFS applies Git semantics — branches, commits, merges, and diffs — to object storage at petabyte scale. Rather than copying data for each experiment or pipeline run, lakeFS creates lightweight branches that share unchanged objects while isolating modifications. This enables data engineers and scientists to experiment on production-scale datasets without risk of corrupting the canonical data, merge validated changes back to main, and maintain a complete audit trail of every data modification with commit history.

The platform operates as a layer on top of existing object storage — S3, GCS, Azure Blob, or MinIO — without requiring data migration. Applications access data through lakeFS's S3-compatible API, making it transparent to existing tools like Spark, Trino, dbt, Airflow, and ML frameworks. lakeFS supports pre-commit and pre-merge hooks for data quality validation, enabling CI/CD-style pipelines that prevent bad data from reaching production. The garbage collection system automatically reclaims storage from deleted branches and unreferenced objects.

lakeFS acquired DVC in November 2025, creating a unified data version control ecosystem that spans from individual data science projects (DVC's strength) to enterprise-scale data lakes. The open-source edition provides full branching and versioning capabilities, while lakeFS Enterprise adds features like SSO, RBAC, and advanced garbage collection for large-scale deployments. For organizations managing data lakes where data quality and reproducibility are critical, lakeFS provides the version control infrastructure that brings software engineering discipline to data management.

Pricing

Free open-source; lakeFS Enterprise for teams

Platforms

Server + CLI — on top of S3, GCS, Azure, MinIO

Categories

Tags

Use Cases

Alternatives

Related Tools

Deep Lake logo

Deep Lake

AI data runtime for multimodal datasets and vector search

Deep Lake is an open-source AI data runtime from Activeloop for storing, versioning, and querying multimodal data and embeddings. It fits teams building RAG, training, evaluation, or dataset-heavy agent workflows that need a bridge between vector search, structured metadata, and large image, text, audio, or video collections.

open-sourceOpen Source
SeekDB logo

SeekDB

AI-native state store with hybrid vector and full-text search

SeekDB is an open-source AI-native state store from the OceanBase ecosystem that combines MySQL-compatible data access with hybrid vector and full-text retrieval. It targets agent and AI application teams that need embedded or server deployment, copy-on-write style sandboxes, and searchable state without gluing together several separate storage layers.

open-sourceOpen Source
Marqo logo

Marqo

Embedding-first search and discovery engine for AI-powered product experiences.

Marqo is an open-source tensor search engine that combines embedding generation and vector search in a single API, removing the need to manage separate embedding pipelines and vector databases. Built for product discovery and multi-modal search, it lets teams index text, images, and structured data together, returning ranked results based on semantic similarity rather than keyword overlap.

freemium
VectorChord logo

VectorChord

High-recall Postgres vector search at billion scale

VectorChord is a Postgres extension from the supervc-stack/VectorChord project that brings high-recall vector search to PostgreSQL. As the spiritual successor to pgvecto.rs, it combines IVF indexes with RaBitQ quantization to deliver Pinecone-class performance at billion-vector scale while keeping all data inside a single Postgres database — no separate vector store, no two-system sync, no rewrites when the workload grows.

open-sourceOpen Source
Infinity logo

Infinity

AI-native database for hybrid RAG retrieval

Infinity is an AI-native database from InfiniFlow that unifies dense vectors, sparse vectors, tensors, and full-text search in a single engine. Built for retrieval-augmented generation (RAG) at scale, it powers hybrid search workflows where lexical matching, semantic similarity, and reranking all happen against one storage layer instead of four loosely coupled services.

open-sourceOpen Source
Magika logo

Magika

AI-powered file-type detection at Google scale

Open-source AI-powered file-type detection tool from Google that uses a custom deep-learning model under a few megabytes to identify more than 200 binary and textual content types in milliseconds, even on a single CPU. Magika ships as a CLI, Python package, JavaScript/TypeScript library, and an ONNX model, achieves around 99% accuracy on its test set, and is already used at Google scale across Gmail, Drive, and Safe Browsing as well as by VirusTotal and abuse.ch.

freeOpen Source

Comparisons