What Together AI Does
Together AI is a full-stack open-weight model platform that sits between raw GPU rental and hosted frontier APIs. The company offers serverless inference across 200+ open models, on-demand dedicated endpoints that pin a model to reserved H100 or B200 GPUs, managed fine-tuning with LoRA and full-weight options, and an AI-native GPU cloud for teams that want bare-metal control. Every layer speaks an OpenAI-compatible API, so the same client code works whether the team is calling Llama 3.3 70B on serverless, a custom fine-tune on a dedicated endpoint, or a private checkpoint running on a reserved cluster.
The Model Catalog and Inference Speed
The serverless catalog is the broadest in the open-weight market: Llama 3.3 70B, Llama 4 Scout and Maverick, Qwen 3 32B and 235B (including the Thinking reasoning variants), Kimi K2, GPT-OSS 20B and 120B, DeepSeek V3 and R1, Mistral Saba and Mixtral families, plus image, embedding, and reranker models. Together publishes a dedicated Inference Engine stack built on top of vLLM and SGLang with custom kernels, which posts up to 2x faster throughput on demanding models compared to vanilla open-source servers.
Speed is competitive with Fireworks and a step behind Groq and Cerebras on raw tokens-per-second, but Together pulls ahead on larger context windows, longer-tail models, and reasoning-mode endpoints. Streaming, tool calling, JSON mode, structured outputs, function calling, vision, and multi-modal endpoints are all supported through the OpenAI-compatible surface, and the Python and TypeScript SDKs require only a base URL change from existing OpenAI code.
Pricing and When Dedicated Beats Serverless
Serverless prices run from $0.05 per million tokens on the smallest Llama and Qwen variants up to $7/M for the largest flagship models, with most production workloads landing between $0.20 and $2/M. New accounts get a modest free credit that is enough to test every model. The pricing page is one of the clearest in the industry — separate input/output rates per model, no hidden multipliers, and transparent batch and embedding rates.
Together's dedicated endpoints change the economics at scale. Deploying a two-H100 dedicated endpoint typically becomes cheaper than serverless around 130,000 tokens per minute of sustained traffic on a 70B model, and the break-even point is lower for smaller models. Fine-tuning adds another clean tier: LoRA runs $0.48 to $2.90 per million tokens, full fine-tuning $0.54 to $3.20, and the resulting custom weights can be deployed to serverless, dedicated, or downloaded to run anywhere — a flexibility most closed-model platforms do not offer.
Developer Experience and Fine-Tuning Pipeline
The platform documentation is strong, with clear quickstarts for inference, fine-tuning, embeddings, and dedicated endpoints. A web playground supports model comparisons side by side, and the dashboard surfaces per-model usage, latency percentiles, and monthly spend. Together's Python and TypeScript SDKs mirror the OpenAI SDK closely, and integrations cover LangChain, LlamaIndex, LiteLLM, Vercel AI SDK, and most major agent frameworks.
Fine-tuning is where Together differentiates from Groq, Cerebras, and the hosted-model-only providers. Teams upload training data as JSONL, pick LoRA or full fine-tuning, choose a base model from the supported list, and receive a trained adapter or checkpoint that can be deployed on Together serverless, pinned to a dedicated endpoint, or exported to Hugging Face for self-hosting. Evaluation, loss curves, and safe tensors downloads are built in, removing the usual pipeline of managing W&B, Axolotl, and a GPU cluster manually.
Where Together AI Fits in the 2026 Market
Together's main competitors are Fireworks AI on serverless and fine-tuning, Replicate on hosted models, Modal and RunPod on raw GPU compute, and Groq and Cerebras on pure inference speed. Against Fireworks, Together offers a marginally larger model catalog and slightly better pricing transparency, while Fireworks tends to win on the absolute newest model availability. Against Groq and Cerebras, Together loses on raw speed but wins on model breadth, fine-tuning, and larger context windows.
The platform is best suited for teams running meaningful open-weight workloads that value flexibility — mixing serverless for variable traffic, dedicated endpoints for steady production, and fine-tuning without migrating vendors. It is a weaker fit for teams that only need one frontier model (call OpenAI or Anthropic directly) or teams whose main bottleneck is inference latency on a single open model (Groq or Cerebras will feel faster).
The Bottom Line
Together AI in 2026 is the most complete open-weight platform on the market. The catalog is the broadest, the fine-tuning pipeline is the smoothest outside of roll-your-own Axolotl setups, and the dedicated-endpoint pricing story is honest about when serverless stops being the right shape. Speed is a step behind Groq and Cerebras but good enough for the vast majority of production workloads, and the flexibility to fine-tune, pin, and export models without vendor lock-in is unusual in a world where most LLM APIs treat weights as proprietary. For any team running open-weight models at scale and planning to fine-tune within the year, Together is a strong default.