16 tools tagged
Showing 16 of 16 tools
Local model inference engine with OpenAI-compatible API and web UI
Xinference is a local inference engine that runs LLMs, embedding models, image generation, and audio models with an OpenAI-compatible API. It provides a web dashboard for model management, supports vLLM, llama.cpp, and transformers backends, and handles multi-GPU deployment automatically. Supports 100+ models including Qwen, Llama, Mistral, and DeepSeek with over 9,200 GitHub stars.
Intelligent model router that balances cost and quality across LLM providers
RouteLLM by LMSYS routes LLM requests to the most cost-effective model that can handle each query's complexity. It uses learned routing models to classify whether a query needs a powerful expensive model or can be handled by a cheaper alternative, reducing costs by up to 85% while maintaining quality. Supports OpenAI, Anthropic, and other providers through an OpenAI-compatible API.
Multi-LoRA inference server for serving hundreds of fine-tuned models
LoRAX is an inference server that serves hundreds of fine-tuned LoRA models from a single base model deployment. It dynamically loads and unloads LoRA adapters on demand, sharing the base model's GPU memory across all adapters. Built on text-generation-inference with OpenAI-compatible API. Enables multi-tenant model serving without per-model GPU allocation. Over 3,700 GitHub stars.
Serverless GPU compute platform for AI inference and training
Modal is a serverless compute platform that lets developers run AI workloads on GPUs with a Python-first SDK. Functions deploy with simple decorators, auto-scale from zero to thousands of containers, and bill per-second of actual use. Supports LLM inference, fine-tuning, batch processing, and sandboxed environments. Used by Meta, Scale AI, and Harvey. Valued at $1.1B after $87M Series B.
Distributed AI compute engine for scaling Python and ML workloads
Ray is an open-source distributed computing framework built for scaling AI and Python applications from a laptop to thousands of GPUs. It provides libraries for distributed training, hyperparameter tuning, model serving, reinforcement learning, and data processing under a single unified API. Used by OpenAI for ChatGPT training, Uber, Shopify, and Instacart. Maintained by Anyscale and part of the PyTorch Foundation.
On-device AI inference engine for mobile and wearable applications
Cactus is a YC-backed open-source inference engine built specifically for running LLMs, vision models, and embeddings on smartphones, tablets, and wearable devices. It provides native SDKs for iOS, Android, Flutter, and React Native with optimized ARM CPU and Apple NPU execution paths. Cactus achieves the fastest inference speeds on ARM processors with 10x lower RAM usage compared to generic runtimes, enabling privacy-first AI applications that run entirely on-device.
Microsoft's framework for running 1-bit large language models on consumer CPUs
BitNet is Microsoft's official inference framework for 1-bit quantized large language models that enables running models with up to 100 billion parameters on standard consumer CPUs without requiring a GPU. By leveraging extreme quantization where weights use only 1.58 bits on average, BitNet achieves dramatic reductions in memory footprint and computational cost while maintaining competitive output quality for many practical use cases.
Run frontier AI models across a cluster of everyday devices
exo turns a collection of everyday devices — laptops, desktops, phones — into a unified AI compute cluster capable of running large language models that no single device could handle alone. It automatically partitions models across available hardware using dynamic model sharding, supports heterogeneous device types including Apple Silicon, NVIDIA, and AMD GPUs, and communicates over standard networking without requiring specialized interconnects.
AMD's open-source local LLM server with GPU and NPU acceleration
Lemonade is AMD's open-source local AI serving platform that runs LLMs, image generation, speech recognition, and text-to-speech directly on your hardware. Built in lightweight C++, it automatically detects and configures optimal CPU, GPU, and NPU backends. Lemonade exposes an OpenAI-compatible API so existing applications work without code changes, and ships with a desktop app for model management and testing. Supports GGUF, ONNX, and SafeTensors across Windows, Linux, macOS, and Docker.
Cost-effective AI inference platform with 86+ models from $0.02/M tokens
DeepInfra is an AI inference platform offering 86+ LLM models with pricing starting at $0.02 per million tokens. Backed by $20.6M in funding including an $18M Series A from Felicis Ventures, it provides OpenAI-compatible endpoints for models including DeepSeek, Llama, and Mistral with pay-as-you-go pricing.
Kubernetes-native distributed LLM inference stack
llm-d is an open-source Kubernetes-native stack for distributed LLM inference with cache-aware routing and disaggregated serving. It separates prefill and decode stages across different GPU pools for optimal resource utilization, routes requests to nodes with warm KV caches, and integrates with vLLM as the serving engine. Apache-2.0 licensed with 2,900+ GitHub stars.
Run local LLMs with an intuitive desktop GUI and OpenAI-compatible API server.
Free desktop application by Element Labs for discovering, downloading, and running open-source LLMs locally. Features a curated Hugging Face model browser, side-by-side model comparison, parameter tuning, and an OpenAI-compatible API server on localhost:1234. Powered by llama.cpp with Metal acceleration for Apple Silicon.
Run and deploy ML models via API with simple pricing
Unified API gateway that provides access to hundreds of LLM models from OpenAI, Anthropic, Google, Meta, and open-source providers through a single OpenAI-compatible interface. Features model fallbacks, price comparison, and community-driven model rankings. The most popular LLM routing service for developers who want multi-provider flexibility without managing individual API integrations.
Production-grade inference with serverless and on-demand GPUs
Open-source model serving platform optimized for large language models and generative AI. Supports Hugging Face models, LoRA adapters, and continuous batching for efficient multi-user serving. Built on PyTorch with OpenAI-compatible endpoints. Designed for teams who need production-grade LLM serving with lower latency and better resource utilization than generic model serving frameworks.
Ultra-fast LPU inference with fastest token generation
Enterprise AI platform for fine-tuning and deploying custom language models. Offers Command R family of models, Embed API for retrieval, and Rerank API for search relevance. Known for strong enterprise features including data privacy guarantees, custom model training, and retrieval-augmented generation capabilities that help organizations build AI applications grounded in their proprietary data.
Fast inference platform for open-source models
Meta's open-source large language model family available for commercial use. Llama 3 models range from 8B to 405B parameters, offering competitive performance with full weight access. Hosted on Hugging Face and available through major cloud providers. The most impactful open-source AI release, enabling companies and researchers to build, fine-tune, and deploy custom AI solutions without API dependencies.