aicoolies logo

Data Science & ML Stack

varies

Python-powered data science with modern tooling for notebooks, experimentation, and deployment.

Share

What This Stack Does

Cursor has quietly become one of the best environments for data science work, thanks to its native support for Jupyter notebooks combined with AI capabilities that understand data manipulation patterns deeply. Unlike traditional Jupyter environments where you work in isolation without intelligent code completion, Cursor provides full AI-assisted coding inside notebook cells — it understands your DataFrame schemas, suggests appropriate pandas transformations, generates matplotlib and seaborn visualizations from natural language descriptions, and can even explain the statistical methods it recommends. The notebook experience in Cursor retains all the interactive, cell-by-cell execution that data scientists love while adding the code navigation, refactoring, and multi-file awareness that notebooks traditionally lack. When you need to extract a data processing function from a notebook cell into a reusable Python module, Cursor's AI agent handles the refactoring seamlessly — moving the code, adding proper type hints, creating unit tests, and updating the notebook import in a single operation. For exploratory data analysis, you can describe what you want to understand about your dataset in plain English and Cursor will generate the appropriate pandas queries, statistical tests, and visualizations. This is particularly valuable for data scientists who are experts in their domain (biology, finance, marketing) but may not remember every pandas method or matplotlib parameter — Cursor bridges the gap between domain knowledge and Python implementation, dramatically accelerating the exploration phase of any data science project.

Claude serves as the AI research assistant in this stack, fulfilling a role that goes far beyond simple code generation. Data science work involves constant decision-making about methodology — which statistical test is appropriate for your data distribution, whether to use random forests or gradient boosting for your prediction task, how to handle missing values without introducing bias, what feature engineering techniques might improve model performance. Claude excels at these methodological conversations because it can reason about the tradeoffs between different approaches, explain the assumptions underlying each method, and suggest alternatives you may not have considered. The workflow is to use Claude for high-level reasoning and strategy while using Cursor for implementation: you might discuss your experiment design with Claude, get recommendations for model architectures and evaluation metrics, then switch to Cursor to implement the pipeline. Claude is also invaluable for literature review and research synthesis — you can describe your problem domain and ask Claude to explain relevant techniques from recent machine learning research, compare different approaches, and suggest which methods are most applicable to your specific dataset characteristics. For data cleaning, one of the most time-consuming phases of any project, Claude can help you develop a strategy for handling outliers, imputing missing values, encoding categorical variables, and normalizing features based on your specific data distribution and downstream modeling goals. The combination of Claude for strategic thinking and Cursor for tactical implementation creates a workflow that is faster and produces better results than either tool alone.

Data Storage and the Python Ecosystem

Supabase serves as the data storage layer in this stack, providing a PostgreSQL database that handles everything from raw data ingestion to experiment result tracking. PostgreSQL is an excellent choice for data science work because it supports advanced data types (arrays, JSON, geometric types), has powerful analytical functions (window functions, CTEs, lateral joins), and can handle datasets of several hundred gigabytes efficiently with proper indexing. For experiment tracking, you can create tables that store model hyperparameters, training metrics, evaluation scores, and artifact references — building a lightweight experiment tracking system without the overhead of dedicated MLOps platforms like MLflow or Weights and Biases. Supabase's REST API makes it trivial to log experiment results from Python scripts: a simple HTTP POST from your training script records the run parameters and metrics, and you can query the results through Supabase's dashboard or via SQL. For data ingestion, Supabase supports bulk inserts via CSV upload, programmatic insertion through the Python client library, and direct PostgreSQL connections for tools like pandas read_sql and to_sql methods. The real-time subscription feature enables live dashboards that update as new data arrives or as training runs complete — you can build a monitoring interface that shows model performance metrics streaming in real-time during a training run. Supabase's row-level security and role-based access control also solve the data governance challenges that arise in team data science environments, ensuring that sensitive datasets are only accessible to authorized team members.

The Python ecosystem remains the foundation of data science and machine learning, and this stack leverages its most powerful libraries alongside the modern tooling layer. For data manipulation, pandas and polars (the Rust-based alternative offering 10-100x speedups on large datasets) handle tabular data processing, while NumPy provides the numerical computing foundation. For machine learning, scikit-learn covers classical algorithms (random forests, SVMs, clustering, dimensionality reduction) with a consistent API that makes experimentation fast, while PyTorch and TensorFlow serve deep learning use cases — natural language processing, computer vision, and custom neural architectures. The visualization ecosystem includes matplotlib for publication-quality static plots, seaborn for statistical visualizations, plotly for interactive charts, and streamlit for building data apps and dashboards with pure Python. Cursor's AI understands all of these libraries deeply — it can generate a complete scikit-learn pipeline with preprocessing, feature selection, model training, cross-validation, and evaluation in response to a natural language description of your prediction task. For deployment, the stack uses FastAPI or Next.js API routes on Vercel to serve trained models as REST endpoints, with joblib or ONNX for model serialization and pydantic for request validation. The combination of these libraries with Cursor's AI assistance means that data scientists spend less time on boilerplate code and more time on the creative, intellectually demanding aspects of their work — feature engineering, model selection, and result interpretation.

Managing Research Knowledge

Obsidian fills a critical gap in the data science workflow that most tooling stacks ignore: research knowledge management. Data science projects generate enormous amounts of contextual knowledge — literature reviews, experiment hypotheses, data source documentation, methodology decisions and their rationale, meeting notes with domain experts, and lessons learned from failed approaches. This knowledge traditionally lives in scattered Google Docs, Slack messages, and the researcher's memory, making it nearly impossible to onboard new team members or revisit decisions months later. Obsidian solves this with a local-first, Markdown-based knowledge base that supports bidirectional linking, creating a personal wiki where every concept connects to related ideas. A well-structured Obsidian vault for data science might include a folder for literature notes (one note per paper with key findings and relevance to your work), a folder for experiment logs (linked to the corresponding code commits and Supabase result records), a folder for dataset documentation (schema descriptions, data quality issues, collection methodology), and a folder for methodology notes (when to use which algorithm, feature engineering techniques, evaluation strategies). The graph view in Obsidian reveals connections between concepts that you might not have noticed — a technique mentioned in one paper might be relevant to a problem documented in an experiment log, and the visual link between them sparks new ideas. Obsidian's plugin ecosystem includes tools for LaTeX rendering, Dataview (database-like queries over your notes), and Templater for standardized note formats, making it adaptable to rigorous research workflows.

The Bottom Line

Deploying machine learning models and data dashboards is where Vercel and GitHub Actions complete the stack, turning Jupyter notebook experiments into production services. For model serving, the recommended approach is to train your model in a notebook or Python script, serialize it using joblib or ONNX format, and create a FastAPI or Next.js API route that loads the model and serves predictions. Vercel's serverless functions handle this well for lightweight models — a serialized scikit-learn model under 50MB can be deployed as a Vercel serverless function that responds to prediction requests in under 100 milliseconds. For larger deep learning models, the API route on Vercel acts as a proxy to a dedicated inference service running on a GPU provider like Replicate, Modal, or a self-hosted server. For dashboards, Next.js on Vercel combined with a charting library like Recharts, Nivo, or Tremor creates beautiful, interactive data dashboards that query Supabase for the underlying data. GitHub Actions automates the entire pipeline: a scheduled workflow can run your data collection scripts daily, trigger model retraining when new data exceeds a threshold, run evaluation tests to ensure the new model outperforms the previous version, and deploy the updated model to Vercel — all without manual intervention. For reproducibility, GitHub Actions ensures that every model training run uses the same Python version, library versions locked via requirements.txt or poetry.lock, and data snapshot, making it possible to reproduce any previous result exactly. This automation transforms data science from ad-hoc notebook exploration into a reliable, repeatable engineering process.

Stack Overview

ToolRolePricingOpen Source
CursorAI IDE + NotebooksHobby (Free) / Pro $20/mo / Pro+ $60/mo / Ultra $200/moNo
ClaudeAI Research AssistantFree / Pro $20/mo / Team $25/user/mo / Max $100-200/mo / API usage-basedNo
SupabaseData StorageFree tier / Pro $25/mo / Team $599/moYes
VercelDashboard HostingFree (Hobby) / Pro $20/mo / Enterprise customNo
ObsidianResearch NotesFree (personal) / Commercial $50/user/year / Sync $4/moNo
GitHub ActionsPipeline AutomationFree for public repos with standard runners; private repo minutes: Free 2,000/mo, Pro/Team 3,000/mo, Enterprise Cloud 50,000/moNo