[Submitted on 12 Sep 2023]
Download PDF
Abstract: High throughput serving of enormous language fashions (LLMs) requires batching
sufficiently many requests at a time. However, current programs battle
as a result of the key-value cache (KV cache) reminiscence for every request is big and
grows and shrinks dynamically. When managed inefficiently, this reminiscence may be
considerably wasted by fragmentation and redundant duplication, limiting the
batch dimension. To tackle this drawback, we suggest PagedAttention, an consideration
algorithm impressed by the classical digital reminiscence and paging methods in
working programs. On high of it, we construct vLLM, an LLM serving system that
achieves (1) near-zero waste in KV cache reminiscence and (2) versatile sharing of KV
cache inside and throughout requests to additional scale back reminiscence utilization. Our
evaluations present that vLLM improves the throughput of widespread LLMs by
2-4$instances$ with the identical degree of latency in comparison with the state-of-the-art
programs, akin to FasterTransformer and Orca. The enchancment is extra pronounced
with longer sequences, bigger fashions, and extra complicated decoding algorithms.
vLLM’s supply code is publicly obtainable at
this https URL
Submission historical past
From: Woosuk Kwon [view email]
[v1]
Tue, 12 Sep 2023 12:50:04 UTC (831 KB)
…. to be continued
Read the Original Article
Copyright for syndicated content material belongs to the linked Source : Hacker News – https://arxiv.org/abs/2309.06180