# llm-api

22 tools tagged

Showing 22 of 22 tools

Cerebras

Wafer-scale inference at thousands of tokens per second

Cerebras Inference serves open-weight LLMs like Llama, Qwen, and GPT-OSS on wafer-scale CS-3 chips through an OpenAI-compatible API, benchmarking between 1,800 and 2,600 output tokens per second on Llama 3.1 8B and several hundred on 70B models. A free tier offers one million tokens per day with no credit card, while paid pay-per-token pricing starts at $0.04 per million tokens for the smaller Llama models.

freemium

New API

Unified LLM API gateway and proxy hub

New API is an open-source multi-tenant AI gateway that aggregates and distributes LLM API requests across providers like OpenAI, Claude, and Gemini through a unified proxy interface. It cross-converts requests into OpenAI-compatible, Claude-compatible, or Gemini-compatible formats, with built-in channel management, quota control, token-based authentication, and billing capabilities. Deploy via Docker with SQLite or MySQL for centralized model management.

open-sourceOpen Source

Tokscale

CLI token usage tracker for AI coding agents

Tokscale is a CLI tool that tracks token usage and costs across AI coding agents including Claude Code, Codex, OpenCode, Gemini CLI, Cursor, and more. Built with a native Rust core for high-performance processing, it provides detailed breakdowns of input, output, cache, and reasoning tokens with real-time pricing calculations via LiteLLM data. Features include interactive 2D/3D contribution graphs, web visualization dashboards, global leaderboards, and JSON export for cost analysis.

open-sourceOpen Source

DeepInfra

Cost-effective AI inference platform with 86+ models from $0.02/M tokens

DeepInfra is an AI inference platform offering 86+ LLM models with pricing starting at $0.02 per million tokens. Backed by $20.6M in funding including an $18M Series A from Felicis Ventures, it provides OpenAI-compatible endpoints for models including DeepSeek, Llama, and Mistral with pay-as-you-go pricing.

api-usage-based

TensorZero

Open-source LLM gateway with built-in optimization and A/B testing

TensorZero is an open-source LLMOps platform in Rust that unifies an LLM gateway, observability, prompt optimization, and A/B experimentation in a single binary. It routes requests across providers with sub-millisecond P99 latency at 10K+ QPS while capturing structured data for continuous improvement. Supports dynamic in-context learning, fine-tuning workflows, and production feedback loops. Backed by $7.3M seed funding, 11K+ GitHub stars.

open-sourceOpen Source

Tavily

Real-time search API built for AI agents

Tavily is an AI-native search API that provides real-time web search, content extraction, and crawling capabilities specifically designed for LLM applications and autonomous agents. It returns structured, citation-ready results optimized for RAG workflows with built-in safety features including prompt injection protection and PII leak prevention. Acquired by Nebius in 2026, Tavily integrates with LangChain, LlamaIndex, and major agent frameworks, serving over one million developers worldwide.

freemiumOpen Source

LM Studio

Run local LLMs with an intuitive desktop GUI and OpenAI-compatible API server.

Free desktop application by Element Labs for discovering, downloading, and running open-source LLMs locally. Features a curated Hugging Face model browser, side-by-side model comparison, parameter tuning, and an OpenAI-compatible API server on localhost:1234. Powered by llama.cpp with Metal acceleration for Apple Silicon.

free

OpenAI Assistants API

Thread-based AI assistant API with tools and file support

OpenAI's platform API for building stateful AI assistants. Manages conversation threads, supports function calling, code interpreter, and file search (RAG) out of the box. Usage-based pricing makes it accessible for startups and enterprises alike, with built-in memory and tool orchestration for production-grade conversational applications.

api-usage-based

OpenAI API

API for GPT-4, o1, DALL-E, Whisper, and embeddings

Official API platform for GPT-4o, o1/o3 reasoning models, DALL-E image generation, Whisper speech-to-text, and text embeddings. Features Assistants API, function calling, JSON mode, fine-tuning, and batch processing. The most widely used AI API in the industry, powering millions of applications from chatbots to complex multi-step agent systems across every sector.

api-usage-based

Anthropic API

Direct API access to Claude models with tool use

Official API for Claude models including Opus, Sonnet, and Haiku. Supports tool use, computer use, extended thinking, and batch processing. Features prompt caching, streaming, and Messages API with vision capabilities. Known for strong performance on complex reasoning tasks, nuanced instruction following, and safety-conscious design that makes it trusted for enterprise and production applications.

api-usage-based

Google Vertex AI

Google Cloud ML platform with Gemini and custom models

Google Cloud's end-to-end ML platform with Gemini models, Model Garden featuring 150+ models, AutoML, and custom training pipelines. Features Vertex AI Search, Conversation, and Agent Builder for enterprise AI applications. The comprehensive platform for organizations building production AI systems at scale within the Google Cloud ecosystem, with enterprise governance and compliance built in.

api-usage-based

Azure OpenAI

OpenAI models with Azure enterprise security

Microsoft's enterprise gateway to OpenAI models — GPT-5, o3, GPT-realtime, GPT-audio — with Azure security, compliance, and global infrastructure. Azure OpenAI keeps data within the customer's tenant (no OpenAI training), offers zero-trust architecture, private endpoints, dedicated capacity, Microsoft Agent Framework integration, and Azure AI Studio for orchestration. 80,000+ enterprise customers.

api-usage-based

AWS Bedrock

Managed foundation models on AWS

Fully managed AWS service providing enterprise access to 100+ foundation models from Anthropic, Meta, Mistral, Cohere, and Amazon's Nova family through a single API. Bedrock includes AgentCore for agent runtime, Knowledge Bases for RAG, Guardrails blocking 88% of harmful content, plus Model Distillation, Prompt Caching, and Intelligent Prompt Routing for cost optimization.

api-usage-based

Cohere

Enterprise AI for text generation, search, and RAG

Enterprise-focused AI platform from former Google Brain researchers offering Command (chat), Embed (semantic search), and Rerank (result ordering) model families. Cohere Embed v4 supports 100+ languages with multimodal text/image inputs, North agent workspace processes documents and spreadsheets, and Model Vault enables secure VPC or on-premises deployment for regulated enterprises.

freemium

Hugging Face

The GitHub of ML — model hub, datasets, and inference

Open-source platform for building, sharing, and deploying machine learning models and datasets. Hosts 500k+ models, 100k+ datasets, and Spaces for interactive demos. The central hub of the open-source AI ecosystem, providing model discovery, inference APIs, and collaborative tools that make it the GitHub of machine learning for researchers and developers worldwide.

freemiumOpen Source

Replicate

Run and deploy ML models via API with simple pricing

Cloud platform that lets developers run 50,000+ open-source ML models through a simple API without managing GPUs or infrastructure. Replicate hosts production-ready models like FLUX, Stable Diffusion, Llama, and Whisper for image, text, audio, and video, with custom model deployment, LoRA support, automatic scaling, version history with rollback, and pay-per-use pricing.

api-usage-based

Fireworks AI

Production-grade inference with serverless and on-demand GPUs

High-performance inference platform serving open-source and custom AI models at global scale, processing 13+ trillion tokens daily at ~180K requests per second. Fireworks AI delivers 1,000+ tokens per second on large models through quantization-aware tuning and adaptive speculation, with serverless, fine-tuning, and dedicated GPU options across text, image, and audio modalities.

freemium

Groq

Ultra-fast LPU inference with fastest token generation

AI inference company building the Language Processing Unit (LPU), purpose-built silicon that delivers the fastest LLM token generation speeds available. GroqCloud serves popular open-source models like Llama at 300+ tokens per second with sub-millisecond latency — roughly 10x faster than NVIDIA H100 GPU clusters — through a simple API without infrastructure management.

freemium

Together AI

Fast inference platform for open-source models

Cloud platform for running, fine-tuning, and training open-source AI models with optimized inference speeds up to 4x faster than traditional deployments. Together AI supports serverless endpoints and dedicated GPUs, fine-tuning of 100B+ parameter models like DeepSeek-V3 and Qwen3-235B, plus async batch processing scaling to 30B tokens for cost-effective large workloads.

api-usage-based

OpenRouter

Unified API gateway for 200+ AI models

Unified API gateway providing access to 500+ AI models from leading providers through a single OpenAI-compatible interface. OpenRouter eliminates the need to manage separate keys, billing, and integrations across providers like OpenAI, Anthropic, Google, and Meta, with built-in plugins for web search, PDF processing, automatic fallback routing, and per-model cost tracking.

api-usage-based

DeepSeek

Reasoning-focused LLM with competitive pricing

Chinese AI research lab developing high-performance open-source language models with a focus on reasoning quality, mathematical accuracy, and cost-efficient training. DeepSeek-R1 excels at step-by-step problem solving while V3 and V3.2 combine thinking and non-thinking modes via Mixture-of-Experts architecture. Free chat assistant and API access to the latest models.

freemiumTelemetry

Mistral AI

Open-weight frontier lab with a full European developer stack

Mistral AI is the French frontier-AI lab behind an integrated developer stack: open-weight and commercial models (Mistral Large 3, Small 4, Codestral, Devstral, Magistral, Voxtral), the Le Chat assistant, Studio agent platform, Vibe agentic coding suite, and the European-hosted Mistral Compute cloud. It offers a sovereign alternative to US labs with strong reasoning, coding, and multimodal performance, Apache 2.0 weights on Hugging Face, and an API priced well below incumbents.

freemium