# inference

26 tools tagged

Showing 24 of 26 tools

KubeAI

Kubernetes operator for serving AI inference workloads

KubeAI is an Apache-2.0 Kubernetes operator for deploying and scaling AI inference workloads, including LLMs, embeddings, reranking, and speech-to-text. It gives platform teams OpenAI-compatible endpoints, model proxy/controller primitives, model caching, scale-from-zero behavior, and cluster-native resource management for self-hosted inference on Kubernetes.

open-sourceOpen Source

Cerebras

Wafer-scale inference at thousands of tokens per second

Cerebras Inference serves open-weight LLMs like Llama, Qwen, and GPT-OSS on wafer-scale CS-3 chips through an OpenAI-compatible API, benchmarking between 1,800 and 2,600 output tokens per second on Llama 3.1 8B and several hundred on 70B models. A free tier offers one million tokens per day with no credit card, while paid pay-per-token pricing starts at $0.04 per million tokens for the smaller Llama models.

freemium

Guidance

Constrained generation that guarantees valid LLM outputs every time

Guidance is Microsoft's structured generation library that enforces output constraints directly within LLM decoding. It supports JSON schemas, regex patterns, grammars, and interleaved generation-and-control flow to guarantee valid outputs from any compatible model. Works with local models via llama.cpp, Transformers, and remote APIs including OpenAI and Anthropic. Eliminates retry loops and post-processing for structured data extraction.

freeOpen Source

llama-swap

Hot-swap between local LLM models via OpenAI-compatible API

llama-swap is an open-source tool that manages multiple local LLM models behind a single OpenAI-compatible API endpoint. It automatically loads and unloads models on demand, letting developers hot-swap between different models without restarting services. With 3.1K+ GitHub stars, it solves the common pain point of running multiple specialized models on limited hardware.

open-sourceOpen Source

Bifrost

50x faster LLM gateway with MCP support, built in Go

Bifrost is a high-performance open-source AI gateway built from scratch in Go. Unifies access to 15+ providers and 1,000+ models through a single OpenAI-compatible API with only 11 microsecond overhead per request at 5K RPS — 50x faster than LiteLLM. Features automatic failover, load balancing, semantic caching, and functions as both MCP client and MCP server. Apache 2.0 licensed.

open-sourceOpen Source

LiteRT-LM

Google's production on-device LLM inference framework

LiteRT-LM is Google's official open-source framework for running large language models on-device across Android, iOS, Web, Desktop, and Raspberry Pi. Already deployed in Chrome and Pixel hardware, it provides production-grade on-device LLM inference with 1.4K+ GitHub stars. Apache 2.0 licensed.

open-sourceOpen Source

gemma.cpp

Lightweight C++ inference for Google Gemma models

gemma.cpp is Google's standalone C++ inference engine built specifically for running Gemma language models without Python or CUDA dependencies. It provides optimized CPU inference using SIMD instructions and Highway library, supports Gemma 2 and Gemma 3 models, and runs on x86 and ARM architectures. Designed for embedded systems, edge devices, and server deployments needing minimal overhead.

open-sourceOpen Source

Xinference

Local model inference engine with OpenAI-compatible API and web UI

Xinference is a local inference engine that runs LLMs, embedding models, image generation, and audio models with an OpenAI-compatible API. It provides a web dashboard for model management, supports vLLM, llama.cpp, and transformers backends, and handles multi-GPU deployment automatically. Supports 100+ models including Qwen, Llama, Mistral, and DeepSeek with over 9,200 GitHub stars.

open-sourceOpen Source

RouteLLM

Intelligent model router that balances cost and quality across LLM providers

RouteLLM by LMSYS routes LLM requests to the most cost-effective model that can handle each query's complexity. It uses learned routing models to classify whether a query needs a powerful expensive model or can be handled by a cheaper alternative, reducing costs by up to 85% while maintaining quality. Supports OpenAI, Anthropic, and other providers through an OpenAI-compatible API.

open-sourceOpen Source

LoRAX

Multi-LoRA inference server for serving hundreds of fine-tuned models

LoRAX is an inference server that serves hundreds of fine-tuned LoRA models from a single base model deployment. It dynamically loads and unloads LoRA adapters on demand, sharing the base model's GPU memory across all adapters. Built on text-generation-inference with OpenAI-compatible API. Enables multi-tenant model serving without per-model GPU allocation. Over 3,700 GitHub stars.

open-sourceOpen Source

Modal

Serverless GPU compute platform for AI inference and training

Modal is a serverless compute platform that lets developers run AI workloads on GPUs with a Python-first SDK. Functions deploy with decorators, auto-scale from zero to thousands of containers, and bill per second. It supports LLM inference, fine-tuning, batch jobs, and sandboxes, with current GPU options including B200, H200, H100, A100, L40S, A10, L4, and T4. Modal’s 2026 Series C valued the company at $4.65B.

freemium

Ray

Distributed AI compute engine for scaling Python and ML workloads

Ray is an open-source distributed computing framework built for scaling AI and Python applications from a laptop to thousands of GPUs. It provides libraries for distributed training, hyperparameter tuning, model serving, reinforcement learning, and data processing under a single unified API. Ray's public site highlights OpenAI and other enterprise users. Maintained by Anyscale with Apache-2.0 open-source licensing.

open-sourceOpen Source

Cactus

On-device AI inference engine for mobile and wearable applications

Cactus is a YC-backed low-latency AI engine for mobile and wearable devices that runs LLMs, transcription, embedding, and TTS models locally. It achieves 16-20 tok/sec on older devices and 70+ tok/sec on flagships with ARM SIMD kernels optimized for Snapdragon, Apple, and MediaTek processors. Supports Qwen, Gemma, Llama, DeepSeek with Flutter, React Native, and Kotlin SDKs.

open-sourceOpen Source

BitNet

Microsoft's framework for running 1-bit large language models on consumer CPUs

BitNet is Microsoft's official inference framework for 1-bit quantized large language models that enables running models with up to 100 billion parameters on standard consumer CPUs without requiring a GPU. By leveraging extreme quantization where weights use only 1.58 bits on average, BitNet achieves dramatic reductions in memory footprint and computational cost while maintaining competitive output quality for many practical use cases.

open-sourceOpen Source

exo

Run frontier AI models across a cluster of everyday devices

exo turns multiple local machines into a unified AI compute cluster for models that exceed a single device's memory. It automatically discovers devices, uses topology-aware auto parallelism to split work across available resources, and supports RDMA over Thunderbolt 5 for co-located clusters or standard networking for looser setups. The project exposes OpenAI Chat Completions, Claude Messages, OpenAI Responses, and Ollama-compatible APIs plus a dashboard for cluster management.

open-sourceOpen Source

Lemonade

AMD's open-source local LLM server with GPU and NPU acceleration

Lemonade is AMD's open-source local AI serving platform for LLMs, image generation, speech recognition, and text-to-speech on your own hardware. Built in lightweight C++, it can detect CPU, GPU, and NPU backends and is extra optimized for Ryzen AI, Radeon, and Strix Halo PCs. Lemonade exposes OpenAI, Anthropic, and Ollama-compatible APIs, ships with a desktop model manager, and supports source-confirmed GGUF, FLM, and ONNX models across Windows, Linux, macOS, and Docker.

open-sourceOpen Source

DeepInfra

Cost-effective AI inference platform with 86+ models from $0.02/M tokens

DeepInfra is an AI inference platform offering 86+ LLM models with pricing starting at $0.02 per million tokens. Backed by $20.6M in funding including an $18M Series A from Felicis Ventures, it provides OpenAI-compatible endpoints for models including DeepSeek, Llama, and Mistral with pay-as-you-go pricing.

api-usage-based

llm-d

Kubernetes-native distributed LLM inference stack

llm-d is an open-source Kubernetes-native stack for distributed LLM inference with cache-aware routing and disaggregated serving. It separates prefill and decode stages across different GPU pools for optimal resource utilization, routes requests to nodes with warm KV caches, and integrates with vLLM as the serving engine. Apache-2.0 licensed with 2,900+ GitHub stars.

open-sourceOpen Source

SGLang

Fast serving framework for LLMs and vision models

SGLang is an open-source serving framework for large language and vision-language models, designed for low latency and high throughput. It features RadixAttention for automatic KV cache reuse, compressed finite state machines for fast structured output generation, continuous batching, and tensor parallelism. With over 25,000 GitHub stars, it supports models like LLaMA, Mistral, Qwen, and Gemma on NVIDIA and AMD GPUs.

open-sourceOpen Source

TensorRT-LLM

NVIDIA's LLM inference optimization and acceleration library

TensorRT-LLM is NVIDIA's open-source library for optimizing LLM inference on NVIDIA GPUs. It provides kernel fusion, quantization (FP8, INT4, INT8), KV cache optimization, and in-flight batching to maximize throughput. Supports multi-GPU and multi-node setups with tensor and pipeline parallelism, and integrates with Triton Inference Server for production deployment of models like LLaMA, GPT, Mistral, and Qwen.

open-sourceOpen Source

LM Studio

Run local LLMs with an intuitive desktop GUI and OpenAI-compatible API server.

Free desktop application by Element Labs for discovering, downloading, and running open-source LLMs locally. Features a curated Hugging Face model browser, side-by-side model comparison, parameter tuning, and an OpenAI-compatible API server on localhost:1234. Powered by llama.cpp with Metal acceleration for Apple Silicon.

free

Cerebras Code

Ultra-fast AI coding powered by Cerebras hardware

Cerebras Code is a coding subscription service from Cerebras, the AI hardware company behind the Wafer-Scale Engine that delivers the fastest AI inference available. Unlike GPU-based systems bottlenecked by memory bandwidth, Cerebras's architecture eliminates these constraints at the hardware level, achieving token speeds no GPU cluster can match. Provides API access to open-source coding models running at 2,000+ tokens per second.

paid

Replicate

Run and deploy ML models via API with simple pricing

Cloud platform that lets developers run thousands of open-source and proprietary public ML models through a simple API without managing GPUs or infrastructure. Replicate hosts models for image, text, audio, and video, supports Cog-based custom deployments and private models, and now operates as a distinct Cloudflare brand with pay-by-time or input/output pricing depending on the model.

api-usage-based

Fireworks AI

Production-grade inference with serverless and on-demand GPUs

High-performance inference platform serving open-source and custom AI models at global scale, processing 13+ trillion tokens daily at ~180K requests per second. Fireworks AI delivers 1,000+ tokens per second on large models through quantization-aware tuning and adaptive speculation, with serverless, fine-tuning, and dedicated GPU options across text, image, and audio modalities.

freemium