Best tools for Data Engineering

Building data pipelines, ETL processes, and managing large-scale data infrastructure

Showing 24 of 138 tools

Deep Lake

AI data runtime for multimodal datasets and vector search

Deep Lake is an open-source AI data runtime from Activeloop for storing, versioning, and querying multimodal data and embeddings. It fits teams building RAG, training, evaluation, or dataset-heavy agent workflows that need a bridge between vector search, structured metadata, and large image, text, audio, or video collections.

open-sourceOpen Source

SeekDB

AI-native state store with hybrid vector and full-text search

SeekDB is an open-source AI-native state store from the OceanBase ecosystem that combines MySQL-compatible data access with hybrid vector and full-text retrieval. It targets agent and AI application teams that need embedded or server deployment, copy-on-write style sandboxes, and searchable state without gluing together several separate storage layers.

open-sourceOpen Source

pgvectorscale

DiskANN-powered vector search extension for PostgreSQL

pgvectorscale is an open-source PostgreSQL extension from Timescale that complements pgvector with DiskANN-based approximate vector search. It is useful for teams that want faster embedding retrieval while keeping vectors, filters, and application data inside the Postgres ecosystem instead of adopting a separate hosted vector database.

open-sourceOpen Source

Vald

Cloud-native distributed vector search engine built for Kubernetes with automatic indexing and horizontal scaling.

Vald is a highly scalable distributed approximate nearest neighbor (ANN) vector search engine designed for cloud-native, Kubernetes-based architectures. Maintained by LY Corporation and listed in the CNCF Landscape, it uses the NGT algorithm (developed at Yahoo Japan), supports automatic incremental index backup, and handles billion-scale datasets across loosely coupled microservice components that scale horizontally via Helm.

open-sourceOpen Source

Traceway

OpenTelemetry-native observability with AI tracing, logs, traces, metrics, and session replay — self-hosted in 90 seconds.

Traceway is an open-source, OpenTelemetry-native observability platform that combines logs, traces, metrics, exceptions, session replay, and AI tracing in a single self-hosted system. MIT licensed with no open-core restrictions, it deploys in 90 seconds via Docker Compose and accepts OTLP/HTTP from any OTel SDK without a Collector or per-language vendor SDK.

open-sourceOpen Source

PageIndex

Vectorless, reasoning-based RAG that reads documents like a human expert — no vector DB, no chunking.

PageIndex is a vectorless, reasoning-based RAG system that builds hierarchical tree indexes from long documents and uses LLMs to navigate them like a human expert would. Instead of chunking text and comparing embeddings, it constructs a table-of-contents-style structure and reasons its way to the right sections — no vector database required. Available as an open-source Python package, cloud API, MCP server, and chat platform.

freemium

Intuned Agent

Production-grade browser automation with AI self-healing and Playwright code ownership

Intuned is a code-first browser automation platform that turns natural language prompts into production-ready Playwright code, deploys it, and self-heals it when target sites change. Supports TypeScript and Python with Anthropic Computer Use, OpenAI CUA, Stagehand, Browser-Use, and Gemini Computer Use integrations. Built-in stealth, captcha solving, auth session management, and scheduled runs with concurrency control. No vendor lock-in—you own the code.

freemiumTelemetry

FAISS

Library for efficient similarity search and clustering of dense vectors at billion-scale.

FAISS is Meta AI Research's open-source library for efficient similarity search and clustering of dense vectors. It implements approximate nearest-neighbor algorithms designed to scale to billions of vectors, with optimized indexes that fit in RAM and GPU acceleration for the largest workloads. Engineering teams use FAISS as the retrieval primitive underneath custom RAG pipelines, recommendation systems, and large-scale embedding search infrastructure.

free

hnswlib

Header-only C++ implementation of HNSW for fast approximate nearest-neighbor search.

hnswlib is a header-only C++ library implementing the Hierarchical Navigable Small World (HNSW) graph algorithm for approximate nearest-neighbor search, with Python bindings and a tiny dependency footprint. Originally developed by the nmslib team, it has become the default HNSW implementation embedded inside many vector databases and search products. Engineers use it directly when they want HNSW retrieval without pulling in a heavyweight vector DB.

free

Marqo

Embedding-first search and discovery engine for AI-powered product experiences.

Marqo is an open-source tensor search engine that combines embedding generation and vector search in a single API, removing the need to manage separate embedding pipelines and vector databases. Built for product discovery and multi-modal search, it lets teams index text, images, and structured data together, returning ranked results based on semantic similarity rather than keyword overlap.

freemium

VectorChord

High-recall Postgres vector search at billion scale

VectorChord is a Postgres extension from the supervc-stack/VectorChord project that brings high-recall vector search to PostgreSQL. As the spiritual successor to pgvecto.rs, it combines IVF indexes with RaBitQ quantization to deliver Pinecone-class performance at billion-vector scale while keeping all data inside a single Postgres database — no separate vector store, no two-system sync, no rewrites when the workload grows.

open-sourceOpen Source

Infinity

AI-native database for hybrid RAG retrieval

Infinity is an AI-native database from InfiniFlow that unifies dense vectors, sparse vectors, tensors, and full-text search in a single engine. Built for retrieval-augmented generation (RAG) at scale, it powers hybrid search workflows where lexical matching, semantic similarity, and reranking all happen against one storage layer instead of four loosely coupled services.

open-sourceOpen Source

sqlite-vec

Vector search extension for SQLite that runs anywhere

sqlite-vec is a lightweight vector search extension for SQLite written in pure C with zero dependencies. It brings nearest-neighbor search capabilities directly into SQLite databases, enabling AI applications to store and query embeddings without running a separate vector database. The extension works everywhere SQLite runs including Linux, macOS, Windows, WebAssembly in browsers, and even Raspberry Pi devices. Sponsored by Mozilla Builders, Fly.io, and Turso.

freeOpen Source

Hindsight

Agent memory system that learns, not just remembers

Hindsight is an agent memory system that enables AI agents to learn from experience rather than just store conversations. It organizes memories into three biomimetic categories: World knowledge for facts, Experiences for agent events, and Mental Models for learned understanding. The system provides retain, recall, and reflect operations backed by a temporal knowledge graph with parallel retrieval strategies including semantic, keyword, graph traversal, and temporal search.

freemiumOpen Source

Labelbox

Data factory for AI teams and model training

Labelbox is a comprehensive data platform for AI teams handling reinforcement learning, evaluations, robotics, and human feedback workflows. Core capabilities include RL data generation with knowledge work rubrics, custom evaluations for private benchmarks and model comparisons, robotics data with full-stack video and trajectories, and an expert network of 1.5M+ knowledge workers including 50K+ PhDs. Trusted by 80% of leading AI labs for production data operations.

paid

Hopsworks

AI Lakehouse with Feature Store for real-time ML

Hopsworks is a data-intensive AI platform combining a Python-centric Feature Store with MLOps capabilities for production ML systems. Provides sub-millisecond feature retrieval powered by RonDB, dual offline and online storage for batch and real-time inference, experiment tracking, model registry, and deployment pipelines. Available as managed cloud on AWS, Azure, and GCP, self-hosted on Kubernetes, or serverless platform.

freemiumOpen Source

OpenDataLoader PDF

AI-ready PDF parser with benchmark-leading accuracy

OpenDataLoader PDF is a high-performance parser that extracts structured, AI-ready data from PDFs with industry-leading 0.907 benchmark accuracy. Combines deterministic local processing with optional AI hybrid mode for complex layouts, OCR support across 80+ languages, formula extraction in LaTeX, chart descriptions, and built-in prompt injection filtering. Available as Python, Node.js, and Java SDKs for seamless RAG pipeline and data preparation integration.

freemiumOpen Source

WeKnora

Enterprise RAG framework by Tencent

WeKnora is a Tencent-developed LLM-powered knowledge management and Q&A framework for enterprise document understanding and semantic retrieval. Supports 10+ document formats including PDF, Word, Excel, and images with seamless IM platform integration for WeCom, Feishu, Slack, and Telegram. Offers Quick Q&A mode using RAG pipelines and Intelligent Reasoning mode with ReACT agents for complex multi-step reasoning tasks across organizational knowledge bases.

open-sourceOpen Source

DUSt3R

3D reconstruction without camera parameters

DUSt3R is Naver's breakthrough 3D reconstruction method that generates dense 3D scenes from unconstrained image pairs without known camera intrinsics or extrinsics. It casts pairwise reconstruction as pointmap regression, removing hard geometric constraints of projective camera models. Supports multi-view alignment, depth estimation, visual localization, and extends to MASt3R and MUSt3R for large-scale applications.

open-sourceOpen Source

Dolphin

ByteDance multimodal document image parser

Dolphin is ByteDance's multimodal document parsing model that handles intertwined text, tables, formulas, and figures in complex documents. Using a two-stage analyze-then-parse approach with a Swin Transformer vision encoder and MBart decoder, it performs layout analysis and parallel element parsing with heterogeneous anchor prompts. Dolphin-v2 adds document-type awareness for invoices, papers, and forms.

open-sourceOpen Source

Pixeltable

Declarative multimodal AI data infrastructure

Pixeltable is a declarative data infrastructure for multimodal AI that stores video, audio, images, and documents as first-class column types. Define Python computed columns for inference and transformations, and Pixeltable auto-orchestrates execution with incremental updates. Built-in vector search eliminates the need for separate vector databases while supporting RAG and semantic search workflows.

open-sourceOpen Source

Meltano

Declarative code-first ELT data integration

Meltano is a declarative, code-first data integration engine with 500+ Singer connectors for building ELT pipelines. It replaces custom API integration code with configuration-driven pipeline definitions that live in version control alongside application code. Integrates with dbt for transformation, supports scheduling and monitoring through a unified CLI, and powers production pipelines at scale.

open-sourceOpen Source

USearch

Fast embeddable vector search engine

USearch is a high-performance vector search engine implementing HNSW algorithms for approximate nearest neighbor queries across C++, Python, JavaScript, Rust, Java, Go, and more. It supports user-defined distance metrics, memory-mapped persistence for datasets larger than RAM, and filtered search with predicates. Used by YugabyteDB and ScyllaDB as their production vector indexing backend.

open-sourceOpen Source

Great Expectations

Data quality validation framework for Python

Great Expectations is an open-source Python framework for validating, documenting, and profiling data quality. Teams define expectations as expressive unit tests for their data using an intuitive API, then validate datasets against those rules in CI/CD pipelines or production workflows. It connects to pandas, Spark, and SQL sources, generates data documentation automatically, and integrates with orchestrators like Airflow and Prefect for continuous data quality monitoring.

freemiumOpen Source