aicoolies logo

Model Providers

LLM APIs, hosted inference platforms, model gateways, and local model runtimes

Showing 24 of 55 tools

Claude

Claude

Top Pick

Anthropic's frontier AI assistant

Anthropic's AI assistant known for strong reasoning, nuanced writing, and extended context up to 200K tokens. Available in Opus (most capable), Sonnet (balanced), and Haiku (fast) tiers. Features web search, deep research, file analysis, code execution, artifacts, and Projects for organized workflows. Claude Code provides terminal-based agentic coding. API supports tool use, batch processing, and prompt caching. Available via claude.ai, mobile apps, and developer API.

freemium
xAI Python SDK logo

xAI Python SDK

Official Python SDK for the xAI API

The xAI Python SDK is the official Python client for the xAI API, giving developers a direct way to build Grok-powered apps without relying on community proxies or unofficial wrappers. It supports synchronous and asynchronous Python clients for chat completions, streaming responses, function/tool calling, and multimodal workflows, making it a clean fit for backend services, agents, notebooks, and developer tools that need programmatic xAI access.

open-sourceOpen Source
Cerebras logo

Cerebras

Wafer-scale inference at thousands of tokens per second

Cerebras Inference serves open-weight LLMs like Llama, Qwen, and GPT-OSS on wafer-scale CS-3 chips through an OpenAI-compatible API, benchmarking between 1,800 and 2,600 output tokens per second on Llama 3.1 8B and several hundred on 70B models. A free tier offers one million tokens per day with no credit card, while paid pay-per-token pricing starts at $0.04 per million tokens for the smaller Llama models.

freemium
Chatbox logo

Chatbox

One desktop app for every LLM — private, cross-platform, extensible

Chatbox is a cross-platform desktop AI client supporting OpenAI, Claude, Gemini, DeepSeek, and local models via Ollama. All chat data stays on-device, making it ideal for privacy-conscious developers. Features include document analysis, code assistance with syntax highlighting, image generation, web search, and a local knowledge base for private Q&A. Available on Windows, macOS, Linux, Android, iOS, and web.

freemiumOpen Source
Baseten logo

Baseten

ML inference platform for production AI models

Baseten is the inference platform for deploying AI models at scale with dedicated and pre-optimized model APIs and performance-optimized infrastructure. Specializes in image generation, transcription, text-to-speech, LLM serving, embeddings, and compound AI workloads. Delivers 75% latency reduction with 415ms cold starts and 3000+ concurrent scaling. Available as managed cloud or self-hosted, trusted by Cursor, Notion, Descript, and Sourcegraph for production inference.

api-usage-based
Nexa SDK logo

Nexa SDK

Cross-platform on-device AI model runtime

Nexa SDK enables running frontier LLMs and multimodal models locally across PC, mobile, IoT, and wearables with automatic hardware acceleration for GPU, NPU, and CPU. It supports Qwen, Gemma, Llama, DeepSeek models with Python/C++ desktop SDKs, Android/iOS mobile SDKs, and Docker for edge deployment. Includes an OpenAI-compatible API server with chat and function calling support.

open-sourceOpen Source
Triton Inference Server logo

Triton Inference Server

NVIDIA's optimized AI model serving platform

Triton Inference Server is NVIDIA's open-source inference serving platform that deploys AI models from TensorRT, PyTorch, ONNX, TensorFlow, OpenVINO, Python, and more across cloud, data center, and edge environments. It supports dynamic batching, model ensembles, concurrent model execution on GPUs and CPUs, and real-time, streaming, and batch inference patterns. Includes Model Analyzer for profiling and Model Navigator for automated optimization.

open-sourceOpen Source
fal.ai logo

fal.ai

Serverless AI inference for generative media at scale

fal.ai is a serverless AI inference platform providing ultra-low-latency APIs for generating images, videos, audio, and 3D models. With 600+ production-ready models and native Python and JavaScript SDKs, it eliminates GPU management while delivering 30-50% lower costs than alternatives. Automatic scaling with no cold starts and real-time streaming support make it ideal for interactive AI applications.

api-usage-based

RamaLama

Container-native local AI model serving with Podman

RamaLama is an open-source tool that containerizes AI model inference using Podman or Docker, eliminating host system configuration complexity. It auto-detects GPUs (NVIDIA, AMD, Intel, Apple Silicon), pulls models from HuggingFace, Ollama, and OCI registries, and runs them in isolated rootless containers with read-only mounts and network isolation. Developed under the Containers project (Red Hat ecosystem), it brings familiar container workflows to local LLM serving.

open-sourceOpen Source
Deepgram logo

Deepgram

Voice AI APIs for speech-to-text and text-to-speech

Deepgram is a voice AI infrastructure platform providing low-latency speech-to-text, text-to-speech, and conversational AI APIs. Its Nova-3 model delivers industry-leading accuracy for real-time transcription with streaming support, interruption handling, and multi-language capabilities. Used by 1,300+ organizations including Twilio and Vapi, Deepgram powers voice features in applications ranging from call centers to AI agent voice interfaces.

api-usage-based

One API

OpenAI API management gateway for 100+ LLM providers

One API is a self-hosted LLM API gateway that provides a unified OpenAI-compatible interface for managing multiple model providers including OpenAI, Azure, Anthropic, Google, and dozens of Chinese providers. It handles load balancing, quota management, rate limiting, token tracking, and channel-based routing through a web dashboard. Widely adopted in the Chinese developer ecosystem with over 18,000 GitHub stars.

open-sourceOpen Source
Xinference logo

Xinference

Local model inference engine with OpenAI-compatible API and web UI

Xinference is a local inference engine that runs LLMs, embedding models, image generation, and audio models with an OpenAI-compatible API. It provides a web dashboard for model management, supports vLLM, llama.cpp, and transformers backends, and handles multi-GPU deployment automatically. Supports 100+ models including Qwen, Llama, Mistral, and DeepSeek with over 9,200 GitHub stars.

open-sourceOpen Source

Cactus

On-device AI inference engine for mobile and wearable applications

Cactus is a YC-backed low-latency AI engine for mobile and wearable devices that runs LLMs, transcription, embedding, and TTS models locally. It achieves 16-20 tok/sec on older devices and 70+ tok/sec on flagships with ARM SIMD kernels optimized for Snapdragon, Apple, and MediaTek processors. Supports Qwen, Gemma, Llama, DeepSeek with Flutter, React Native, and Kotlin SDKs.

open-sourceOpen Source

exo

Run frontier AI models across a cluster of everyday devices

exo turns multiple local machines into a unified AI compute cluster for models that exceed a single device's memory. It automatically discovers devices, uses topology-aware auto parallelism to split work across available resources, and supports RDMA over Thunderbolt 5 for co-located clusters or standard networking for looser setups. The project exposes OpenAI Chat Completions, Claude Messages, OpenAI Responses, and Ollama-compatible APIs plus a dashboard for cluster management.

open-sourceOpen Source
Lemonade logo

Lemonade

AMD's open-source local LLM server with GPU and NPU acceleration

Lemonade is AMD's open-source local AI serving platform for LLMs, image generation, speech recognition, and text-to-speech on your own hardware. Built in lightweight C++, it can detect CPU, GPU, and NPU backends and is extra optimized for Ryzen AI, Radeon, and Strix Halo PCs. Lemonade exposes OpenAI, Anthropic, and Ollama-compatible APIs, ships with a desktop model manager, and supports source-confirmed GGUF, FLM, and ONNX models across Windows, Linux, macOS, and Docker.

open-sourceOpen Source
DeepInfra logo

DeepInfra

Cost-effective AI inference platform with 86+ models from $0.02/M tokens

DeepInfra is an AI inference platform offering 86+ LLM models with pricing starting at $0.02 per million tokens. Backed by $20.6M in funding including an $18M Series A from Felicis Ventures, it provides OpenAI-compatible endpoints for models including DeepSeek, Llama, and Mistral with pay-as-you-go pricing.

api-usage-based

llm-d

Kubernetes-native distributed LLM inference stack

llm-d is an open-source Kubernetes-native stack for distributed LLM inference with cache-aware routing and disaggregated serving. It separates prefill and decode stages across different GPU pools for optimal resource utilization, routes requests to nodes with warm KV caches, and integrates with vLLM as the serving engine. Apache-2.0 licensed with 2,900+ GitHub stars.

open-sourceOpen Source

Llamafile

Run LLMs as a single portable executable file

Llamafile by Mozilla packages a complete LLM — model weights, inference engine, and OpenAI-compatible API server — into a single executable file that runs on Mac, Windows, Linux, FreeBSD, and OpenBSD with no installation. Built on llama.cpp and Cosmopolitan Libc for cross-platform portability, it delivers GPU-accelerated inference when available and falls back to optimized CPU execution. Supports GGUF models with a built-in web chat UI and REST API for integration.

open-sourceOpen Source
AnythingLLM logo

AnythingLLM

All-in-one self-hosted AI app with RAG, agents, and multi-user support

AnythingLLM is an open-source, privacy-first AI application that turns any document into an interactive knowledge base. It bundles document ingestion, vector storage (built-in LanceDB), RAG pipelines, AI agents, and multi-user access into a single deployable package. Supports 30+ LLM providers including OpenAI, Anthropic, Ollama, and local models. With 62K+ GitHub stars and MIT license, it runs as a desktop app or Docker container with zero configuration required out of the box.

freemiumOpen Source

Text Generation Inference

Hugging Face's production LLM serving framework

Text Generation Inference (TGI) is Hugging Face's production-ready serving framework for large language models. It features flash attention, continuous batching, tensor parallelism, quantization via GPTQ/AWQ/EETQ, and Safetensors support. Powers Hugging Face's Inference API and Inference Endpoints, with an OpenAI-compatible API and Docker deployment. Supports LLaMA, Mistral, Falcon, and other popular model architectures.

open-sourceOpen Source
MLC LLM logo

MLC LLM

Run LLMs natively on any device with ML compilation

MLC LLM is an open-source engine for deploying large language models natively across diverse platforms using machine learning compilation. It runs models on NVIDIA/AMD GPUs, Apple Silicon, mobile devices, and browsers via WebGPU without cloud dependencies. Features include OpenAI-compatible API, quantization support, and optimized backends for CUDA, Metal, Vulkan, and WebAssembly.

open-sourceOpen Source
ONNX Runtime logo

ONNX Runtime

Cross-platform high-performance ML inference engine

ONNX Runtime is Microsoft's open-source inference engine for machine learning models in ONNX format. It delivers cross-platform acceleration via execution providers for NVIDIA CUDA, TensorRT, DirectML, CoreML, OpenVINO, and more. Supports training acceleration, quantization, and GenAI workloads. Used in production across Windows, Azure, Office 365, and thousands of applications with pip-installable Python and native C++/C#/Java APIs.

open-sourceOpen Source

OpenVINO

Intel's open-source AI inference optimization toolkit

OpenVINO is Intel's open-source toolkit for optimizing and deploying AI inference across CPUs, GPUs, and NPUs. It supports models from PyTorch, TensorFlow, ONNX, and TFLite, providing graph optimizations, quantization, and hardware-specific acceleration. The toolkit includes a GenAI API for LLM deployment and runs on Intel, ARM, and x86 platforms for edge, desktop, and cloud inference workloads.

open-sourceOpen Source
ExecuTorch logo

ExecuTorch

PyTorch on-device AI for mobile and edge devices

ExecuTorch is PyTorch's official solution for deploying AI models on mobile, embedded, and edge devices. It features a 50KB base runtime, 12+ hardware backends including Apple CoreML, Qualcomm QNN, ARM, and Vulkan, and native PyTorch export without format conversions. Powers Meta's on-device AI across Instagram, WhatsApp, Quest 3, and Ray-Ban Smart Glasses, supporting LLMs, vision, speech, and multimodal models.

open-sourceOpen Source