What Weights & Biases Does
Weights & Biases (W&B) is the experiment tracking platform that became the default for serious ML teams over the past five years, and 2026 has not slowed it down. The product spans run logging, artifact versioning, hyperparameter sweeps, and the W&B Weave evaluation suite for LLM applications, all stitched together by an SDK that drops into any Python training script. PyTorch, TensorFlow, JAX, HuggingFace, Keras, and Lightning users can integrate in two lines; the dashboard takes care of the rest.
Experiment Tracking and the Run Dashboard
The core flow is unchanged from the early days: call wandb.init() at the start of training, log metrics and hyperparameters with wandb.log(), and the data flows to a project dashboard where you can scrub through runs, compare configurations, and overlay metrics. Where W&B has pulled away from open-source rivals is the polish of that dashboard — parallel coordinates plots, custom panel layouts, group-by faceting, and report-style annotations that turn a dump of training runs into a document a team can actually review.
Multi-run comparison is the killer feature for teams running hyperparameter sweeps or A/B-testing model variants. You can pick ten runs, layer their loss curves on a single chart, sort the runs table by validation accuracy, and step through configs side-by-side without exporting anything to a notebook. For teams that have outgrown TensorBoard but cannot quite justify building their own MLflow + Grafana stack, the W&B dashboard is the path of least resistance.
Artifacts, Sweeps, and the Evaluation Suite
Artifacts are W&B's answer to dataset and model versioning. Every artifact carries a hash, a lineage graph showing which runs produced or consumed it, and a flexible alias system (latest, production, v1.2) that lets you reference artifacts symbolically. The lineage view is genuinely useful when an eval regression appears two months after a model ships and you need to retrace which dataset version it trained on.
Sweeps handle hyperparameter search with grid, random, and Bayesian strategies, with the controller running either centrally on W&B's infrastructure or locally on your hardware. The W&B Weave eval framework, layered on top, captures prompt traces for LLM applications, runs scoring functions across response samples, and routes ambiguous cases to a human review queue. Together they cover the full ML lifecycle from training-time experimentation to production prompt evaluation.
Pricing, Data Residency, and Self-Hosting
The pricing model is where W&B starts to bite for larger teams. The free tier is generous for individuals and academic users — 100 GB of storage, unlimited public runs — but a Team plan starts at roughly $50 per seat per month, and the Enterprise tier (which unlocks SSO, audit logs, and W&B Server self-hosting) is a meaningful line item once you scale past a handful of researchers. Data retention is metered separately, so long-running projects with hundreds of artifact versions can surprise the AWS bill team.