What Groq Does
Groq is an inference-only API provider that runs open-weight LLMs on its custom Language Processing Unit (LPU) chips instead of GPUs. The company does not train models; instead, it takes popular open-weight releases — Llama 3.3 70B, Llama 4 Scout and Maverick, GPT-OSS 20B and 120B, Mistral Saba, DeepSeek R1 distillations, Gemma 2 — and serves them through an OpenAI-compatible REST API at throughput that GPU clouds cannot match. The LPU is a deterministic streaming processor that keeps model weights in large on-die SRAM, eliminating the HBM bottleneck that caps GPU token generation.
Speed, Models, and API Surface
Speed is the headline. Groq's Llama 3.3 70B sustains around 394 output tokens per second, GPT-OSS 20B tops 950 tokens per second, and smaller Llama 8B variants clear 1,000 tokens per second on production endpoints — roughly 5 to 10 times faster than the same weights on Together or Fireworks, and an order of magnitude faster than hyperscaler GPU endpoints. Latency to first token is similarly low, which matters more than peak throughput for real-time agents and voice applications.
The API is OpenAI-compatible, so dropping Groq into existing LangChain, LiteLLM, or Vercel AI SDK code usually takes a single base URL change. Streaming, tool calling, JSON mode, structured outputs, and batch processing (with a 50 percent discount) are all supported, and cached input tokens are billed at half price. The model lineup is strictly open-weight, which means no access to GPT-5, Claude 4.x, or Gemini 2.5, but every model Groq does serve is available to every paid account without tier gating.
Pricing and Free Tier
Groq's pricing is aggressive by 2026 standards. Llama 3.1 8B lists at $0.05 per million input tokens and $0.08 per million output, Llama 3.3 70B is $0.59 / $0.79, and the GPT-OSS and Llama 4 families sit between the two. A real free tier allows 30 requests per minute, 6,000 tokens per minute, and 14,400 requests per day on every model with no credit card required — enough to prototype a serious agent before adding a payment method.
For teams optimizing blended cost, DeepSeek V4 on other providers is roughly half the per-token price of Llama 70B on Groq, but it is also 8–10x slower, so the effective cost per completed request often favors Groq for interactive workloads. Batch and cached-input discounts further close the gap for high-volume pipelines. The main pricing caveat is that LPU capacity is not as elastic as GPU clouds, so burst traffic to the newest models occasionally hits 429s until rate limits are raised.
Developer Experience and Ecosystem
The console, SDKs, and docs have matured quickly. Python and TypeScript SDKs mirror the OpenAI SDK almost exactly, a GroqCloud dashboard surfaces per-model rate limits and usage, and Playground lets teams compare models side by side without code. First-party integrations include LangChain, LlamaIndex, LiteLLM, Vercel AI SDK, Agno, and most major agent frameworks — the community treats Groq as a standard inference provider rather than a niche vendor.