aicoolies logo

Data-Versioned AI Training Stack

varies

A stack for teams that need reproducible AI training pipelines with full dataset version control. Combines Dolt's Git-for-data SQL database with OpenBB for financial data ingestion and SWE-bench for agent evaluation, providing branching, diffing, and audit trails across the entire data lifecycle.

Share

What This Stack Does

This stack addresses the reproducibility crisis in AI development where teams cannot reliably recreate training conditions because the underlying data changed between runs. Dolt serves as the versioned data store where every dataset modification is committed with full Git semantics. Teams branch datasets to experiment with different preprocessing strategies, diff branches to understand exactly which rows changed, and merge validated modifications back to the main branch.

Versioned Data and Unified Ingestion

Dolt anchors the stack as the single source of truth for structured training data. Its MySQL wire protocol means existing data engineering tools connect without modification. Data scientists use familiar SQL to query, transform, and audit datasets while the database automatically tracks every change. When a model's performance regresses, the team can diff the current training data against the version used for the previous successful training run to identify the exact rows that changed.

OpenBB feeds the stack with financial and market data from over 100 providers through a unified Python SDK. For teams building fintech AI models, OpenBB normalizes data from disparate sources into consistent formats that Dolt can version. Each data pull is committed with metadata about the source, timestamp, and provider, creating a complete provenance chain from raw market data through to the model that was trained on it.

Evaluation and Branch-Per-Experiment Workflow

SWE-bench provides the evaluation harness for measuring how code generation agents perform across dataset versions. Teams can branch the training data, retrain or fine-tune their agent, and run SWE-bench evaluations to measure whether the data change improved real-world software engineering task resolution. This creates a tight feedback loop between data curation and agent capability measurement.

The workflow follows a branch-per-experiment pattern. A data scientist creates a branch, applies a preprocessing change, commits it, runs training on the branched dataset, evaluates results, and submits a pull request on DoltHub if the change improves model performance. Reviewers inspect the row-level diff, verify the preprocessing logic, and merge. The main branch always contains the team's best-validated training data.

The Bottom Line

All tools in this stack are open-source. Dolt is Apache 2.0, OpenBB is Apache 2.0, and SWE-bench is MIT. The managed offerings — Hosted Dolt and OpenBB Enterprise — provide team features for organizations that need SLA guarantees and dedicated support. The stack runs entirely self-hosted for teams with privacy requirements around training data.

Stack Overview

ToolRolePricingOpen Source
DoltVersioned SQL Database for Training DataFree open-source core; Hosted Dolt managed service availableYes
OpenBBFinancial Data Ingestion from 100+ ProvidersFree for individuals; enterprise tiers with premium dataYes
SWE-benchAgent Evaluation on Real Software Engineering TasksFree and open-source under MIT licenseYes
Data-Versioned AI Training Stack — aicoolies