aicoolies logo

LoRAX

Multi-LoRA inference server for serving hundreds of fine-tuned models

Share
open-sourceOpen Source
Visit Website →

LoRAX is an inference server that serves hundreds of fine-tuned LoRA models from a single base model deployment. It dynamically loads and unloads LoRA adapters on demand, sharing the base model's GPU memory across all adapters. Built on text-generation-inference with OpenAI-compatible API. Enables multi-tenant model serving without per-model GPU allocation. Over 3,700 GitHub stars.

LoRAX solves the economics of serving many fine-tuned models by sharing a single base model across hundreds or thousands of LoRA adapters. Traditional model serving requires dedicating GPU memory to each model variant, making it economically impractical to serve personalized models for different customers, use cases, or domains. LoRAX loads the base model once and dynamically swaps LoRA adapters per request, enabling multi-tenant fine-tuned model serving at a fraction of the GPU cost.

The architecture builds on Hugging Face's text-generation-inference server, adding a LoRA adapter management layer that handles loading adapters from Hugging Face Hub or local storage, caching frequently used adapters in GPU memory, and evicting least-recently-used adapters when memory pressure requires it. The adapter switching happens per-request with negligible latency overhead, meaning different requests to the same server can use different fine-tuned models transparently.

With over 3,700 GitHub stars, LoRAX has become the standard solution for organizations that fine-tune models for multiple customers or applications and need to serve them cost-effectively. The OpenAI-compatible API means existing client code works without modification, and the adapter specification happens through request headers or parameters. Predibase maintains LoRAX alongside their serverless fine-tuning platform, ensuring compatibility with the latest base models and LoRA techniques.

Pricing

Free and open-source under Apache 2.0

Platforms

Python, CUDA GPUs, Docker, OpenAI-compatible API

Categories

Tags

Use Cases

Alternatives

Related Tools

KubeAI

Kubernetes operator for serving AI inference workloads

KubeAI is an Apache-2.0 Kubernetes operator for deploying and scaling AI inference workloads, including LLMs, embeddings, reranking, and speech-to-text. It gives platform teams OpenAI-compatible endpoints, model proxy/controller primitives, model caching, scale-from-zero behavior, and cluster-native resource management for self-hosted inference on Kubernetes.

open-sourceOpen Source
Deep Lake logo

Deep Lake

AI data runtime for multimodal datasets and vector search

Deep Lake is an open-source AI data runtime from Activeloop for storing, versioning, and querying multimodal data and embeddings. It fits teams building RAG, training, evaluation, or dataset-heavy agent workflows that need a bridge between vector search, structured metadata, and large image, text, audio, or video collections.

open-sourceOpen Source
SeekDB logo

SeekDB

AI-native state store with hybrid vector and full-text search

SeekDB is an open-source AI-native state store from the OceanBase ecosystem that combines MySQL-compatible data access with hybrid vector and full-text retrieval. It targets agent and AI application teams that need embedded or server deployment, copy-on-write style sandboxes, and searchable state without gluing together several separate storage layers.

open-sourceOpen Source
Marqo logo

Marqo

Embedding-first search and discovery engine for AI-powered product experiences.

Marqo is an open-source tensor search engine that combines embedding generation and vector search in a single API, removing the need to manage separate embedding pipelines and vector databases. Built for product discovery and multi-modal search, it lets teams index text, images, and structured data together, returning ranked results based on semantic similarity rather than keyword overlap.

freemium
Freestyle logo

Freestyle

Sandboxes for coding agents — Linux VMs, Git, and deploys in one box

Freestyle is YC-backed sandbox infrastructure built for AI coding agents, shipping secure Linux VMs with nested virtualization, Git servers, and one-click web deploys. It lets agents run real workloads, branch repos, and deploy apps under short-lived identities while billing only for active compute. Used in production by vly.ai, Rork, and Vibeflow.

freemium
OpenSRE logo

OpenSRE

Open-source toolkit for building AI SRE incident response agents

OpenSRE is Tracer Cloud’s open-source public-alpha Python toolkit for building AI SRE agents that investigate and respond to production incidents. It ships 60+ tools across observability, databases, incident management, communications, deployment and protocol integrations, plus simulation/evaluation workflows for benchmarking agent accuracy before live pager use.

open-sourceOpen Source

Comparisons