What Weights & Biases Does
Weights & Biases (W&B) is the experiment tracking platform that became the default for serious ML teams over the past five years, and 2026 has not slowed it down. The product spans run logging, artifact versioning, hyperparameter sweeps, and the W&B Weave evaluation suite for LLM applications, all stitched together by an SDK that drops into any Python training script. PyTorch, TensorFlow, JAX, HuggingFace, Keras, and Lightning users can integrate in two lines; the dashboard takes care of the rest.
Experiment Tracking and the Run Dashboard
The core flow is unchanged from the early days: call wandb.init() at the start of training, log metrics and hyperparameters with wandb.log(), and the data flows to a project dashboard where you can scrub through runs, compare configurations, and overlay metrics. Where W&B has pulled away from open-source rivals is the polish of that dashboard — parallel coordinates plots, custom panel layouts, group-by faceting, and report-style annotations that turn a dump of training runs into a document a team can actually review.
Multi-run comparison is the killer feature for teams running hyperparameter sweeps or A/B-testing model variants. You can pick ten runs, layer their loss curves on a single chart, sort the runs table by validation accuracy, and step through configs side-by-side without exporting anything to a notebook. For teams that have outgrown TensorBoard but cannot quite justify building their own MLflow + Grafana stack, the W&B dashboard is the path of least resistance.
Artifacts, Sweeps, and the Evaluation Suite
Artifacts are W&B's answer to dataset and model versioning. Every artifact carries a hash, a lineage graph showing which runs produced or consumed it, and a flexible alias system (latest, production, v1.2) that lets you reference artifacts symbolically. The lineage view is genuinely useful when an eval regression appears two months after a model ships and you need to retrace which dataset version it trained on.
Sweeps handle hyperparameter search with grid, random, and Bayesian strategies, with the controller running either centrally on W&B's infrastructure or locally on your hardware. The W&B Weave eval framework, layered on top, captures prompt traces for LLM applications, runs scoring functions across response samples, and routes ambiguous cases to a human review queue. Together they cover the full ML lifecycle from training-time experimentation to production prompt evaluation.
Pricing, Data Residency, and Self-Hosting
The pricing model is where W&B starts to bite for larger teams. The free tier is generous for individuals and academic users — 100 GB of storage, unlimited public runs — but a Team plan starts at roughly $50 per seat per month, and the Enterprise tier (which unlocks SSO, audit logs, and W&B Server self-hosting) is a meaningful line item once you scale past a handful of researchers. Data retention is metered separately, so long-running projects with hundreds of artifact versions can surprise the AWS bill team.
Self-hosting is possible via W&B Server, but it is a heavyweight Kubernetes deployment with a separate Enterprise license. Teams that genuinely need air-gapped training metadata — defense contractors, healthcare research labs handling PHI — can run W&B fully on-prem, but the operational burden is real. Most teams either accept cloud data residency on W&B's SaaS or default to MLflow plus custom dashboards for the OSS path.
Alternatives and the LLM-Specific Angle
Alternatives split along two axes: control and polish. MLflow is the canonical OSS escape hatch — fully self-hostable, no per-seat cost, integrated with Databricks — but the UI lags W&B by a generation and multi-run comparison requires elbow grease. DVC plus Prometheus or Grafana gives you full ownership but the assembly cost is high. Comet ML and Neptune offer similar SaaS experiences with their own pricing twists. Pick W&B when team size exceeds three, multi-run comparison is part of weekly workflow, and artifact lineage matters; pick MLflow when budget or self-hosting is the constraint.
For LLM and prompt-engineering teams, W&B Weave is the relevant entry point rather than the classical experiment tracker. Weave focuses on prompt traces, dataset-driven evaluations, and human-in-the-loop scoring, and it overlaps with tools like Langfuse, LangSmith, and Helicone. W&B's pitch here is the unified platform — one workspace for both training runs and prompt evals — though dedicated LLM observability tools currently feel more focused for pure prompt-engineering workflows.
The Bottom Line
W&B is the right default for ML teams that take experiment tracking seriously and have a budget that absorbs per-seat SaaS pricing. The dashboard polish, multi-run comparison, and artifact lineage are class-leading and the SDK is the lowest-friction integration in the category. Teams operating in regulated environments or running entirely on personal compute should map MLflow first; teams scaling collaborative ML work will spend money on W&B and not regret it.