22 tools tagged
Showing 22 of 22 tools
Wafer-scale inference at thousands of tokens per second
Cerebras Inference serves open-weight LLMs like Llama, Qwen, and GPT-OSS on wafer-scale CS-3 chips through an OpenAI-compatible API, benchmarking between 1,800 and 2,600 output tokens per second on Llama 3.1 8B and several hundred on 70B models. A free tier offers one million tokens per day with no credit card, while paid pay-per-token pricing starts at $0.04 per million tokens for the smaller Llama models.
Unified LLM API gateway and proxy hub
New API is an open-source multi-tenant AI gateway that aggregates and distributes LLM API requests across providers like OpenAI, Claude, and Gemini through a unified proxy interface. It cross-converts requests into OpenAI-compatible, Claude-compatible, or Gemini-compatible formats, with built-in channel management, quota control, token-based authentication, and billing capabilities. Deploy via Docker with SQLite or MySQL for centralized model management.
CLI token usage tracker for AI coding agents
Tokscale is a CLI tool that tracks token usage and costs across AI coding agents including Claude Code, Codex, OpenCode, Gemini CLI, Cursor, and more. Built with a native Rust core for high-performance processing, it provides detailed breakdowns of input, output, cache, and reasoning tokens with real-time pricing calculations via LiteLLM data. Features include interactive 2D/3D contribution graphs, web visualization dashboards, global leaderboards, and JSON export for cost analysis.
Cost-effective AI inference platform with 86+ models from $0.02/M tokens
DeepInfra is an AI inference platform offering 86+ LLM models with pricing starting at $0.02 per million tokens. Backed by $20.6M in funding including an $18M Series A from Felicis Ventures, it provides OpenAI-compatible endpoints for models including DeepSeek, Llama, and Mistral with pay-as-you-go pricing.
Open-source LLM gateway with built-in optimization and A/B testing
TensorZero is an open-source LLMOps platform in Rust that unifies an LLM gateway, observability, prompt optimization, and A/B experimentation in a single binary. It routes requests across providers with sub-millisecond P99 latency at 10K+ QPS while capturing structured data for continuous improvement. Supports dynamic in-context learning, fine-tuning workflows, and production feedback loops. Backed by $7.3M seed funding, 11K+ GitHub stars.
Real-time search API built for AI agents
Tavily is an AI-native search API that provides real-time web search, content extraction, and crawling capabilities specifically designed for LLM applications and autonomous agents. It returns structured, citation-ready results optimized for RAG workflows with built-in safety features including prompt injection protection and PII leak prevention. Acquired by Nebius in 2026, Tavily integrates with LangChain, LlamaIndex, and major agent frameworks, serving over one million developers worldwide.
Run local LLMs with an intuitive desktop GUI and OpenAI-compatible API server.
Free desktop application by Element Labs for discovering, downloading, and running open-source LLMs locally. Features a curated Hugging Face model browser, side-by-side model comparison, parameter tuning, and an OpenAI-compatible API server on localhost:1234. Powered by llama.cpp with Metal acceleration for Apple Silicon.
Thread-based AI assistant API with tools and file support
OpenAI's platform API for building stateful AI assistants. Manages conversation threads, supports function calling, code interpreter, and file search (RAG) out of the box. Usage-based pricing makes it accessible for startups and enterprises alike, with built-in memory and tool orchestration for production-grade conversational applications.
API for GPT-4, o1, DALL-E, Whisper, and embeddings
Official API platform for GPT-4o, o1/o3 reasoning models, DALL-E image generation, Whisper speech-to-text, and text embeddings. Features Assistants API, function calling, JSON mode, fine-tuning, and batch processing. The most widely used AI API in the industry, powering millions of applications from chatbots to complex multi-step agent systems across every sector.
Direct API access to Claude models with tool use
Official API for Claude models including Opus, Sonnet, and Haiku. Supports tool use, computer use, extended thinking, and batch processing. Features prompt caching, streaming, and Messages API with vision capabilities. Known for strong performance on complex reasoning tasks, nuanced instruction following, and safety-conscious design that makes it trusted for enterprise and production applications.
Google Cloud ML platform with Gemini and custom models
Google Cloud's end-to-end ML platform with Gemini models, Model Garden featuring 150+ models, AutoML, and custom training pipelines. Features Vertex AI Search, Conversation, and Agent Builder for enterprise AI applications. The comprehensive platform for organizations building production AI systems at scale within the Google Cloud ecosystem, with enterprise governance and compliance built in.
OpenAI models with Azure enterprise security
Microsoft's enterprise gateway to OpenAI models — GPT-5, o3, GPT-realtime, GPT-audio — with Azure security, compliance, and global infrastructure. Azure OpenAI keeps data within the customer's tenant (no OpenAI training), offers zero-trust architecture, private endpoints, dedicated capacity, Microsoft Agent Framework integration, and Azure AI Studio for orchestration. 80,000+ enterprise customers.
Managed foundation models on AWS
Fully managed AWS service providing enterprise access to 100+ foundation models from Anthropic, Meta, Mistral, Cohere, and Amazon's Nova family through a single API. Bedrock includes AgentCore for agent runtime, Knowledge Bases for RAG, Guardrails blocking 88% of harmful content, plus Model Distillation, Prompt Caching, and Intelligent Prompt Routing for cost optimization.
Enterprise AI for text generation, search, and RAG
Enterprise-focused AI platform from former Google Brain researchers offering Command (chat), Embed (semantic search), and Rerank (result ordering) model families. Cohere Embed v4 supports 100+ languages with multimodal text/image inputs, North agent workspace processes documents and spreadsheets, and Model Vault enables secure VPC or on-premises deployment for regulated enterprises.
The GitHub of ML — model hub, datasets, and inference
Open-source platform for building, sharing, and deploying machine learning models and datasets. Hosts 500k+ models, 100k+ datasets, and Spaces for interactive demos. The central hub of the open-source AI ecosystem, providing model discovery, inference APIs, and collaborative tools that make it the GitHub of machine learning for researchers and developers worldwide.
Run and deploy ML models via API with simple pricing
Cloud platform that lets developers run 50,000+ open-source ML models through a simple API without managing GPUs or infrastructure. Replicate hosts production-ready models like FLUX, Stable Diffusion, Llama, and Whisper for image, text, audio, and video, with custom model deployment, LoRA support, automatic scaling, version history with rollback, and pay-per-use pricing.
Production-grade inference with serverless and on-demand GPUs
High-performance inference platform serving open-source and custom AI models at global scale, processing 13+ trillion tokens daily at ~180K requests per second. Fireworks AI delivers 1,000+ tokens per second on large models through quantization-aware tuning and adaptive speculation, with serverless, fine-tuning, and dedicated GPU options across text, image, and audio modalities.
Ultra-fast LPU inference with fastest token generation
AI inference company building the Language Processing Unit (LPU), purpose-built silicon that delivers the fastest LLM token generation speeds available. GroqCloud serves popular open-source models like Llama at 300+ tokens per second with sub-millisecond latency — roughly 10x faster than NVIDIA H100 GPU clusters — through a simple API without infrastructure management.
Fast inference platform for open-source models
Cloud platform for running, fine-tuning, and training open-source AI models with optimized inference speeds up to 4x faster than traditional deployments. Together AI supports serverless endpoints and dedicated GPUs, fine-tuning of 100B+ parameter models like DeepSeek-V3 and Qwen3-235B, plus async batch processing scaling to 30B tokens for cost-effective large workloads.
Unified API gateway for 200+ AI models
Unified API gateway providing access to 500+ AI models from leading providers through a single OpenAI-compatible interface. OpenRouter eliminates the need to manage separate keys, billing, and integrations across providers like OpenAI, Anthropic, Google, and Meta, with built-in plugins for web search, PDF processing, automatic fallback routing, and per-model cost tracking.
Reasoning-focused LLM with competitive pricing
Chinese AI research lab developing high-performance open-source language models with a focus on reasoning quality, mathematical accuracy, and cost-efficient training. DeepSeek-R1 excels at step-by-step problem solving while V3 and V3.2 combine thinking and non-thinking modes via Mixture-of-Experts architecture. Free chat assistant and API access to the latest models.
Open-weight frontier lab with a full European developer stack
Mistral AI is the French frontier-AI lab behind an integrated developer stack: open-weight and commercial models (Mistral Large 3, Small 4, Codestral, Devstral, Magistral, Voxtral), the Le Chat assistant, Studio agent platform, Vibe agentic coding suite, and the European-hosted Mistral Compute cloud. It offers a sovereign alternative to US labs with strong reasoning, coding, and multimodal performance, Apache 2.0 weights on Hugging Face, and an API priced well below incumbents.