# data-engineering

12 tools tagged

Showing 12 of 12 tools

OpenBB

Open-source financial data platform for quants, analysts, and AI agents

OpenBB is an open-source financial data platform that normalizes data from 100+ providers into a unified Python SDK, REST API, and Excel Add-in. It serves as the open-source alternative to Bloomberg Terminal for developers building fintech applications, quantitative research pipelines, and AI-powered financial analysis tools. With over 65,000 GitHub stars and SOC 2 Type II certification, it is one of the most popular open-source developer tools for financial data.

paidOpen Source

Dolt

Git for data — version-controlled SQL database with branch, merge, and diff

Dolt is a SQL database that implements Git-style version control directly on structured data. Table changes can be staged, committed, branched, merged, diffed, and reverted through SQL workflows and a Git-like CLI. It speaks the MySQL wire protocol so existing MySQL clients, ORMs, and tools can connect with minimal driver changes. Dolt is used for AI training data management, reproducible analytics, collaborative data editing, and agent-memory experiments.

open-sourceOpen Source

PandasAI

Conversational data analysis with natural language queries over databases

PandasAI enables natural-language queries against databases, data lakes, CSVs, and parquet files using LLMs and RAG pipelines. With 23,400+ GitHub stars, it bridges the gap between database tools and AI by letting developers and analysts interact with data conversationally, supporting SQL, PostgreSQL, and various file formats.

open-sourceOpen Source

Marimo

Reactive Python notebooks that version with git and deploy as apps

Marimo is a reactive Python notebook environment with 20,000+ GitHub stars and $4M seed funding. Unlike Jupyter, marimo notebooks automatically update dependent cells when values change, version cleanly with git as pure Python files, and deploy directly as interactive web applications without conversion steps.

open-sourceOpen Source

Airbyte

Open-source ELT platform with 350+ data connectors

Airbyte is an open-source ELT platform with 350+ pre-built connectors for syncing data from any source to warehouses, lakes, and AI pipelines. It handles incremental syncs, schema evolution, and change data capture with a connector builder for custom integrations. Used by DoorDash, Replit, and thousands of data teams. Over 15,000 GitHub stars and $150M+ in funding.

freemiumOpen Source

K2view

Entity-based synthetic data generation for enterprise

K2view is an enterprise data platform that generates synthetic data using an entity-based micro-database architecture. It ensures referential integrity across complex multi-relational datasets by treating each business entity as a self-contained unit. Used for privacy-compliant test data generation, data masking, and AI training data creation in financial services, telecom, and healthcare industries.

paid

Monte Carlo

Data and AI observability for enterprise teams

Monte Carlo is the leading data and AI observability platform using ML to monitor pipelines, warehouses, and lakes for quality issues. It detects freshness delays, volume anomalies, schema changes, and distribution shifts before they impact analytics. With 500+ deployments at Nasdaq, Honeywell, and Roche, it provides automated root cause analysis, field-level lineage, and incident management. Available on AWS and Azure Marketplace.

paid

TaskWeaver

Code-first agent framework for data analytics tasks

TaskWeaver is Microsoft's open-source code-first agent framework that converts natural language requests into executable Python code for data analytics and workflow automation. Unlike text-based agent frameworks, it preserves rich in-memory data structures like DataFrames across conversation turns, supports custom algorithm plugins as callable functions, and verifies generated code before execution. It includes a Planner for task decomposition and a Code Interpreter for generation and execution.

open-sourceOpen Source

Kestra

Declarative orchestration for data, AI, and infra

Kestra is an open-source orchestration platform that uses declarative YAML to define event-driven and scheduled workflows for data pipelines, infrastructure automation, and AI workloads. With over 1,200 plugins, it connects to databases, cloud services, APIs, and SaaS tools without custom glue code. Kestra reached version 1.0 LTS with agentic AI capabilities, SDKs for Python, TypeScript, Java, and Go, and SOC 2 compliance. Clients include Leroy Merlin, Huawei, Tencent, and Decathlon.

freemiumOpen Source

Unstructured

ETL for LLMs — preprocess any document format

Unstructured is an open-source ETL library that preprocesses and transforms documents from diverse formats into clean, structured data ready for LLM ingestion and RAG pipelines. It handles PDF, HTML, Word, PowerPoint, and many other file types through partitioning, cleaning, and chunking operations. The library offers connector-based architecture for integrating with various data sources and destinations, making it a key component in document processing workflows for AI applications.

freemiumOpen Source

Docling

Get your documents ready for gen AI

Docling is an open-source document processing toolkit by IBM Research that converts complex documents into structured formats optimized for generative AI applications. It parses PDF, DOCX, PPTX, XLSX, HTML, images, audio, and LaTeX with advanced PDF understanding including layout analysis, reading order detection, and table structure recognition. Docling exports to Markdown, HTML, JSON, and DocTags, and integrates natively with LangChain, LlamaIndex, and other AI frameworks for RAG workflows.

open-sourceOpen Source

MarkItDown

Convert any file to Markdown for LLM pipelines

MarkItDown is a lightweight Python utility by Microsoft that converts files into clean Markdown optimized for LLM pipelines and text analysis. It supports PDF, Word, Excel, PowerPoint, HTML, images with OCR, audio with transcription, and text formats like CSV, JSON, and XML. The tool preserves document structure including headings, tables, lists, and links while keeping output token-efficient. It offers a CLI, a four-line Python API, Docker support, and a plugin architecture for extensions.

open-sourceOpen Source