25 tools tagged
Showing 24 of 25 tools
Wafer-scale inference at thousands of tokens per second
Cerebras Inference serves open-weight LLMs like Llama, Qwen, and GPT-OSS on wafer-scale CS-3 chips through an OpenAI-compatible API, benchmarking between 1,800 and 2,600 output tokens per second on Llama 3.1 8B and several hundred on 70B models. A free tier offers one million tokens per day with no credit card, while paid pay-per-token pricing starts at $0.04 per million tokens for the smaller Llama models.
Constrained generation that guarantees valid LLM outputs every time
Guidance is Microsoft's structured generation library that enforces output constraints directly within LLM decoding. It supports JSON schemas, regex patterns, grammars, and interleaved generation-and-control flow to guarantee valid outputs from any compatible model. Works with local models via llama.cpp, Transformers, and remote APIs including OpenAI and Anthropic. Eliminates retry loops and post-processing for structured data extraction.
Hot-swap between local LLM models via OpenAI-compatible API
llama-swap is an open-source tool that manages multiple local LLM models behind a single OpenAI-compatible API endpoint. It automatically loads and unloads models on demand, letting developers hot-swap between different models without restarting services. With 3.1K+ GitHub stars, it solves the common pain point of running multiple specialized models on limited hardware.
50x faster LLM gateway with MCP support, built in Go
Bifrost is a high-performance open-source AI gateway built from scratch in Go. Unifies access to 15+ providers and 1,000+ models through a single OpenAI-compatible API with only 11 microsecond overhead per request at 5K RPS — 50x faster than LiteLLM. Features automatic failover, load balancing, semantic caching, and functions as both MCP client and MCP server. Apache 2.0 licensed.
Google's production on-device LLM inference framework
LiteRT-LM is Google's official open-source framework for running large language models on-device across Android, iOS, Web, Desktop, and Raspberry Pi. Already deployed in Chrome and Pixel hardware, it provides production-grade on-device LLM inference with 1.4K+ GitHub stars. Apache 2.0 licensed.
Lightweight C++ inference for Google Gemma models
gemma.cpp is Google's standalone C++ inference engine built specifically for running Gemma language models without Python or CUDA dependencies. It provides optimized CPU inference using SIMD instructions and Highway library, supports Gemma 2 and Gemma 3 models, and runs on x86 and ARM architectures. Designed for embedded systems, edge devices, and server deployments needing minimal overhead.
Local model inference engine with OpenAI-compatible API and web UI
Xinference is a local inference engine that runs LLMs, embedding models, image generation, and audio models with an OpenAI-compatible API. It provides a web dashboard for model management, supports vLLM, llama.cpp, and transformers backends, and handles multi-GPU deployment automatically. Supports 100+ models including Qwen, Llama, Mistral, and DeepSeek with over 9,200 GitHub stars.
Intelligent model router that balances cost and quality across LLM providers
RouteLLM by LMSYS routes LLM requests to the most cost-effective model that can handle each query's complexity. It uses learned routing models to classify whether a query needs a powerful expensive model or can be handled by a cheaper alternative, reducing costs by up to 85% while maintaining quality. Supports OpenAI, Anthropic, and other providers through an OpenAI-compatible API.
Multi-LoRA inference server for serving hundreds of fine-tuned models
LoRAX is an inference server that serves hundreds of fine-tuned LoRA models from a single base model deployment. It dynamically loads and unloads LoRA adapters on demand, sharing the base model's GPU memory across all adapters. Built on text-generation-inference with OpenAI-compatible API. Enables multi-tenant model serving without per-model GPU allocation. Over 3,700 GitHub stars.
Serverless GPU compute platform for AI inference and training
Modal is a serverless compute platform that lets developers run AI workloads on GPUs with a Python-first SDK. Functions deploy with simple decorators, auto-scale from zero to thousands of containers, and bill per-second of actual use. Supports LLM inference, fine-tuning, batch processing, and sandboxed environments. Used by Meta, Scale AI, and Harvey. Valued at $1.1B after $87M Series B.
Distributed AI compute engine for scaling Python and ML workloads
Ray is an open-source distributed computing framework built for scaling AI and Python applications from a laptop to thousands of GPUs. It provides libraries for distributed training, hyperparameter tuning, model serving, reinforcement learning, and data processing under a single unified API. Used by OpenAI for ChatGPT training, Uber, Shopify, and Instacart. Maintained by Anyscale and part of the PyTorch Foundation.
On-device AI inference engine for mobile and wearable applications
Cactus is a YC-backed low-latency AI engine for mobile and wearable devices that runs LLMs, transcription, embedding, and TTS models locally. It achieves 16-20 tok/sec on older devices and 70+ tok/sec on flagships with ARM SIMD kernels optimized for Snapdragon, Apple, and MediaTek processors. Supports Qwen, Gemma, Llama, DeepSeek with Flutter, React Native, and Kotlin SDKs.
Microsoft's framework for running 1-bit large language models on consumer CPUs
BitNet is Microsoft's official inference framework for 1-bit quantized large language models that enables running models with up to 100 billion parameters on standard consumer CPUs without requiring a GPU. By leveraging extreme quantization where weights use only 1.58 bits on average, BitNet achieves dramatic reductions in memory footprint and computational cost while maintaining competitive output quality for many practical use cases.
Run frontier AI models across a cluster of everyday devices
exo turns a collection of everyday devices — laptops, desktops, phones — into a unified AI compute cluster capable of running large language models that no single device could handle alone. It automatically partitions models across available hardware using dynamic model sharding, supports heterogeneous device types including Apple Silicon, NVIDIA, and AMD GPUs, and communicates over standard networking without requiring specialized interconnects.
AMD's open-source local LLM server with GPU and NPU acceleration
Lemonade is AMD's open-source local AI serving platform that runs LLMs, image generation, speech recognition, and text-to-speech directly on your hardware. Built in lightweight C++, it automatically detects and configures optimal CPU, GPU, and NPU backends. Lemonade exposes an OpenAI-compatible API so existing applications work without code changes, and ships with a desktop app for model management and testing. Supports GGUF, ONNX, and SafeTensors across Windows, Linux, macOS, and Docker.
Cost-effective AI inference platform with 86+ models from $0.02/M tokens
DeepInfra is an AI inference platform offering 86+ LLM models with pricing starting at $0.02 per million tokens. Backed by $20.6M in funding including an $18M Series A from Felicis Ventures, it provides OpenAI-compatible endpoints for models including DeepSeek, Llama, and Mistral with pay-as-you-go pricing.
Kubernetes-native distributed LLM inference stack
llm-d is an open-source Kubernetes-native stack for distributed LLM inference with cache-aware routing and disaggregated serving. It separates prefill and decode stages across different GPU pools for optimal resource utilization, routes requests to nodes with warm KV caches, and integrates with vLLM as the serving engine. Apache-2.0 licensed with 2,900+ GitHub stars.
Fast serving framework for LLMs and vision models
SGLang is an open-source serving framework for large language and vision-language models, designed for low latency and high throughput. It features RadixAttention for automatic KV cache reuse, compressed finite state machines for fast structured output generation, continuous batching, and tensor parallelism. With over 25,000 GitHub stars, it supports models like LLaMA, Mistral, Qwen, and Gemma on NVIDIA and AMD GPUs.
NVIDIA's LLM inference optimization and acceleration library
TensorRT-LLM is NVIDIA's open-source library for optimizing LLM inference on NVIDIA GPUs. It provides kernel fusion, quantization (FP8, INT4, INT8), KV cache optimization, and in-flight batching to maximize throughput. Supports multi-GPU and multi-node setups with tensor and pipeline parallelism, and integrates with Triton Inference Server for production deployment of models like LLaMA, GPT, Mistral, and Qwen.
Run local LLMs with an intuitive desktop GUI and OpenAI-compatible API server.
Free desktop application by Element Labs for discovering, downloading, and running open-source LLMs locally. Features a curated Hugging Face model browser, side-by-side model comparison, parameter tuning, and an OpenAI-compatible API server on localhost:1234. Powered by llama.cpp with Metal acceleration for Apple Silicon.
Ultra-fast AI coding powered by Cerebras hardware
Cerebras Code is a coding subscription service from Cerebras, the AI hardware company behind the Wafer-Scale Engine that delivers the fastest AI inference available. Unlike GPU-based systems bottlenecked by memory bandwidth, Cerebras's architecture eliminates these constraints at the hardware level, achieving token speeds no GPU cluster can match. Provides API access to open-source coding models running at 2,000+ tokens per second.
Run and deploy ML models via API with simple pricing
Cloud platform that lets developers run 50,000+ open-source ML models through a simple API without managing GPUs or infrastructure. Replicate hosts production-ready models like FLUX, Stable Diffusion, Llama, and Whisper for image, text, audio, and video, with custom model deployment, LoRA support, automatic scaling, version history with rollback, and pay-per-use pricing.
Production-grade inference with serverless and on-demand GPUs
High-performance inference platform serving open-source and custom AI models at global scale, processing 13+ trillion tokens daily at ~180K requests per second. Fireworks AI delivers 1,000+ tokens per second on large models through quantization-aware tuning and adaptive speculation, with serverless, fine-tuning, and dedicated GPU options across text, image, and audio modalities.
Ultra-fast LPU inference with fastest token generation
AI inference company building the Language Processing Unit (LPU), purpose-built silicon that delivers the fastest LLM token generation speeds available. GroqCloud serves popular open-source models like Llama at 300+ tokens per second with sub-millisecond latency — roughly 10x faster than NVIDIA H100 GPU clusters — through a simple API without infrastructure management.