Рет қаралды 347
LoRAX: Serve 1000s of Fine-Tuned LLMs on a Single GPU - Travis Addair, Predibase, Inc.
LoRAX (LoRA eXchange), is a new LLM inference system that allows users to pack 1000s of fine-tuned “LoRA” adapters into a single GPU, dramatically reducing the cost of serving compared against dedicated deployments per fine-tuned model. LoRAX is open-source, free to use commercially, and production-ready, with pre-built docker images and Helm charts available for immediate download and use. In this talk, we'll introduce LoRAX and explore the key ideas that make it the most cost effective and efficient way to serve fine-tuned LLMs in production, including: - Dynamic Adapter Loading: allowing each set of fine-tuned LoRA weights to be loaded from storage just-in-time as requests come in at runtime, without blocking concurrent requests. - Heterogeneous Continuous Batching: an extension to continuous batching that packs requests for different adapters together into the same batch, keeping latency and throughput nearly constant with the number of concurrent adapters. - Adapter Exchange Scheduling: a fair scheduling policy that asynchronously prefetches and offloads adapters between GPU and CPU memory, and schedules request batching to optimize the aggregate throughput of the system.