Loading...
Loading...
Building data pipelines, ETL processes, and managing large-scale data infrastructure
Showing 24 of 84 tools
Modern data pipeline orchestration with built-in AI
Mage AI is an open-source data pipeline orchestration tool positioned as a modern alternative to Apache Airflow. It provides a visual pipeline editor, native AI integrations for generating pipeline code, real-time streaming support, and built-in data quality checks. Mage handles batch and streaming workloads with a developer-friendly notebook-style interface and deploys to any cloud provider.
Reusable computer vision tools for developers
Supervision is an open-source Python toolkit by Roboflow providing reusable CV utilities for detection, tracking, annotation, and dataset management. It works with any model including YOLO and Hugging Face via a standardized Detections class. Features include 20+ annotators, ByteTrack object tracking, zone counting, speed estimation, and dataset conversion between COCO, YOLO, and Pascal VOC formats.
Deep learning optimization for distributed training
DeepSpeed is Microsoft's open-source deep learning optimization library that makes distributed training and inference easy, efficient, and effective. Its ZeRO optimizer eliminates memory redundancies across data-parallel processes, enabling training of models with trillions of parameters. DeepSpeed supports 3D parallelism combining data, pipeline, and tensor parallelism, along with mixed precision training, gradient checkpointing, and CPU/NVMe offloading for memory-constrained environments.
Python library for declarative data loading that LLMs can generate
dlt (data load tool) is a Python library for building data pipelines with declarative, schema-aware loading that is simple enough for LLMs to generate correctly. It extracts data from APIs, databases, and files, normalizes nested structures, handles schema evolution, and loads into warehouses and lakes. Supports 30+ destinations including BigQuery, Snowflake, DuckDB, and PostgreSQL. Over 5,200 GitHub stars.
Adaptive web scraping library with anti-bot evasion and smart selectors
Scrapling is a Python web scraping library that uses adaptive selectors and anti-bot evasion techniques to extract data from websites reliably. It generates selectors that survive website layout changes by understanding element context rather than relying on brittle CSS paths. Features stealth browser automation, automatic retry logic, and proxy rotation. Over 34,500 GitHub stars.
No-code AI web scraping platform with visual workflow builder
Maxun is a no-code web scraping platform that uses AI to extract structured data from websites through a visual workflow builder. Users point and click on the data they want to extract, and Maxun generates resilient scraping workflows that handle pagination, authentication, and dynamic content. Features anti-bot detection avoidance, scheduled runs, and API access for integration. Over 15,300 GitHub stars.
Managed Postgres platform with 200+ extensions as pre-built stacks
Tembo is a managed PostgreSQL platform that packages 200+ Postgres extensions into purpose-built stacks for specific workloads. Stacks include OLAP analytics, vector search, message queues, geospatial, and machine learning, turning PostgreSQL into a specialized database for each use case. Eliminates the need for separate Redis, Elasticsearch, or Kafka instances alongside Postgres.
Open-source financial data platform for quants, analysts, and AI agents
OpenBB is an open-source financial data platform that normalizes data from 100+ providers into a unified Python SDK, REST API, and Excel Add-in. It serves as the open-source alternative to Bloomberg Terminal for developers building fintech applications, quantitative research pipelines, and AI-powered financial analysis tools. With over 65,000 GitHub stars and SOC 2 Type II certification, it is one of the most popular open-source developer tools for financial data.
Multi-model database for the AI era — document, graph, vector, and relational in one
SurrealDB is a multi-model database that natively combines document, graph, relational, key-value, and vector storage in a single engine. It eliminates the need for separate databases by handling structured queries, graph traversals, full-text search, and vector similarity in one SQL-like query language called SurrealQL. Built in Rust for performance and safety, it supports real-time subscriptions, row-level permissions, and embedded or distributed deployment modes.
Git for data — version-controlled SQL database with branch, merge, and diff
Dolt is a SQL database that implements Git-style version control directly on your data. Every write creates a commit, and you can branch, merge, diff, and revert tables just like source code. It speaks the MySQL wire protocol so existing MySQL clients, ORMs, and tools work out of the box. Dolt is used for AI training data management, reproducible analytics, collaborative data editing, and agent memory stores.
Conversational data analysis with natural language queries over databases
PandasAI enables natural-language queries against databases, data lakes, CSVs, and parquet files using LLMs and RAG pipelines. With 23,400+ GitHub stars, it bridges the gap between database tools and AI by letting developers and analysts interact with data conversationally, supporting SQL, PostgreSQL, and various file formats.
Reactive Python notebooks that version with git and deploy as apps
Marimo is a reactive Python notebook environment with 20,000+ GitHub stars and $4M seed funding. Unlike Jupyter, marimo notebooks automatically update dependent cells when values change, version cleanly with git as pure Python files, and deploy directly as interactive web applications without conversion steps.
State-of-the-art OCR toolkit supporting 100+ languages from Baidu
PaddleOCR is an open-source OCR toolkit from Baidu's PaddlePaddle ecosystem with over 73,000 GitHub stars. It provides ultra-lightweight and high-accuracy text detection and recognition for 100+ languages including CJK, Arabic, and Indic scripts. The toolkit offers pre-trained models, easy deployment via pip, and server/edge inference options for document digitization workflows.
High-performance S3-compatible object storage built in Rust
RustFS is an open-source distributed object storage system built entirely in Rust, offering 2.3x faster performance than MinIO for small object payloads. It provides full S3 API compatibility, enabling seamless migration from MinIO, Ceph, and AWS S3 with existing SDKs and CLI tools. Released under Apache 2.0 license, it avoids MinIO's restrictive AGPL terms. Features include distributed architecture, erasure coding, WORM compliance, encryption via RustyVault, and a web management console.
Google's pretrained foundation model for zero-shot time-series forecasting
TimesFM is a pretrained time-series foundation model from Google Research that performs zero-shot forecasting on diverse datasets without task-specific training. It handles univariate and multivariate time series across domains including finance, logistics, energy, and infrastructure monitoring with accuracy competitive against traditional statistical methods like ARIMA and Prophet.
Fully managed RAG-as-a-Service platform for enterprise AI applications
Ragie is a managed retrieval-augmented generation platform that handles document ingestion, indexing, and retrieval so developers can build grounded AI applications without managing vector databases or chunking pipelines. It connects to Google Drive, Notion, Slack, Confluence, and other enterprise data sources with simple APIs for hybrid search and entity extraction.
Blazing-fast Rust-based CSV Swiss Army knife for the terminal
xan is a fast command-line tool for working with CSV files, built in Rust by the Sciences Po medialab team. It provides over 50 subcommands for filtering, sorting, joining, aggregating, and transforming CSV data directly in the terminal. With 3,900 GitHub stars and near-instant processing of multi-gigabyte files, xan replaces workflows that previously required loading data into Python or spreadsheets.
High-performance data engine for multimodal AI workloads
Daft is a high-performance distributed data engine designed specifically for AI and multimodal workloads. It processes structured data alongside images, audio, video, and embeddings natively, outperforming Spark and Polars on AI-specific data pipelines. Built in Rust with a Python API, Daft handles the data engineering challenges unique to machine learning workflows.
SQL-native memory infrastructure for AI agents and applications
Memori is an AI memory engine that provides persistent, queryable memory for agents and applications using SQL-native storage. It stores structured memories with semantic search, temporal awareness, and relationship tracking, enabling AI systems to remember user preferences, past interactions, and contextual facts across sessions. With 12,900 GitHub stars, it offers a database-native approach to the agent memory problem.
Production-grade reinforcement learning framework for LLM training
verl is an open-source reinforcement learning framework designed specifically for training and aligning large language models. Built for production use with support for distributed training across multiple GPUs and nodes, it implements RLHF, DPO, and other alignment algorithms that make LLMs follow instructions, avoid harmful outputs, and generate higher quality responses. Over 580 contributors and 20,000 GitHub stars signal strong adoption.
Build real-time temporal knowledge graphs for AI agents
Graphiti is an open-source Python framework by Zep for building temporally-aware knowledge graphs for AI agents. It continuously integrates conversations, business data, and external information into queryable graphs with bi-temporal tracking. The hybrid retrieval combines semantic search, BM25 keywords, and graph traversal for sub-300ms queries without LLM calls at retrieval time.
BM25 full-text search extension for PostgreSQL
pg_textsearch is a PostgreSQL extension from Timescale that adds BM25 relevance-ranked full-text search directly inside Postgres. Using the same ranking algorithm as Elasticsearch and Lucene, it provides search-engine quality results without requiring a separate search cluster — particularly valuable for developers building RAG pipelines on PostgreSQL who want semantic-quality ranking alongside pgvector.
Serverless vector and full-text search on object storage
turbopuffer is a serverless vector and full-text search engine built on object storage that delivers 10x lower costs than traditional vector databases. Used by Anthropic, Cursor, Notion, and Atlassian for production search workloads. Manages 2+ trillion vectors across 8+ petabytes with automatic scaling and no infrastructure management. Funded by Thrive Capital.
High-performance open-source web crawler optimized for AI pipelines
Crawl4AI is an open-source Python web crawler built specifically for AI and data pipeline use cases. It features parallel crawling, heuristic-based content extraction, cosine similarity chunking for LLM context optimization, and multiple output formats including LLM-ready markdown. Frequently reaches GitHub trending and is adopted by teams building large-scale RAG datasets and training corpora.