What Together AI Does
Together AI is a full-stack open-weight model platform that sits between raw GPU rental and hosted frontier APIs. The company offers serverless inference across 200+ open models, on-demand dedicated endpoints that pin a model to reserved H100 or B200 GPUs, managed fine-tuning with LoRA and full-weight options, and an AI-native GPU cloud for teams that want bare-metal control. Every layer speaks an OpenAI-compatible API, so the same client code works whether the team is calling Llama 3.3 70B on serverless, a custom fine-tune on a dedicated endpoint, or a private checkpoint running on a reserved cluster.
The Model Catalog and Inference Speed
The serverless catalog is the broadest in the open-weight market: Llama 3.3 70B, Llama 4 Scout and Maverick, Qwen 3 32B and 235B (including the Thinking reasoning variants), Kimi K2, GPT-OSS 20B and 120B, DeepSeek V3 and R1, Mistral Saba and Mixtral families, plus image, embedding, and reranker models. Together publishes a dedicated Inference Engine stack built on top of vLLM and SGLang with custom kernels, which posts up to 2x faster throughput on demanding models compared to vanilla open-source servers.
Speed is competitive with Fireworks and a step behind Groq and Cerebras on raw tokens-per-second, but Together pulls ahead on larger context windows, longer-tail models, and reasoning-mode endpoints. Streaming, tool calling, JSON mode, structured outputs, function calling, vision, and multi-modal endpoints are all supported through the OpenAI-compatible surface, and the Python and TypeScript SDKs require only a base URL change from existing OpenAI code.
Pricing and When Dedicated Beats Serverless
Serverless prices run from $0.05 per million tokens on the smallest Llama and Qwen variants up to $7/M for the largest flagship models, with most production workloads landing between $0.20 and $2/M. New accounts get a modest free credit that is enough to test every model. The pricing page is one of the clearest in the industry — separate input/output rates per model, no hidden multipliers, and transparent batch and embedding rates.
Together's dedicated endpoints change the economics at scale. Deploying a two-H100 dedicated endpoint typically becomes cheaper than serverless around 130,000 tokens per minute of sustained traffic on a 70B model, and the break-even point is lower for smaller models. Fine-tuning adds another clean tier: LoRA runs $0.48 to $2.90 per million tokens, full fine-tuning $0.54 to $3.20, and the resulting custom weights can be deployed to serverless, dedicated, or downloaded to run anywhere — a flexibility most closed-model platforms do not offer.
Developer Experience and Fine-Tuning Pipeline
The platform documentation is strong, with clear quickstarts for inference, fine-tuning, embeddings, and dedicated endpoints. A web playground supports model comparisons side by side, and the dashboard surfaces per-model usage, latency percentiles, and monthly spend. Together's Python and TypeScript SDKs mirror the OpenAI SDK closely, and integrations cover LangChain, LlamaIndex, LiteLLM, Vercel AI SDK, and most major agent frameworks.