Efficient Memory Management for Large Language Model Serving with PagedAttention

Posted on April 15, 2024 1 minute read ∼ Filed in : A paper note

Throughput is important.

Batching requests can improve the throughput. The larger the batching size, the high throughput it is.

Use memory inefficiently lead to memory waste, and can thus store few requests, which leads low throughtput.

It’s challenging to manage memory efficiently. Existing system such as hugging face transformer. (20-40% used for store request)

Internal fragmentation: over-allocated (e.g., 2048 tokens) due to unknown output length.
Reservation: not used at the current step, but will use in future. This is a kind of waste in current.
External fragmentation: different sequence length => this is many the external fragmentation of the virtual memory not the physical memory.

Only 20%of memory is effectively used for the current system.

Contributions

Request => process, Block -> page, Tokens => bytes.

This paper divides the KV cache into blocks, such that the KV cache can be store in the non-continuous memory.

blocks are not necessarily stored in contiguous space, so KV is a more flexible way.
external fragmentation is also avoided.
memory sharing can happen at the granularity of the block

Then the paper proposes an attention algorithm based on such storage

Finally, the paper implements LLM based on such pageAttention.

Logic & physical KV blocks

Analysis:

No internal fragmentation: since it is on-demands, so only the last block has the internal fragmentation. Since each block has 16 or 32 tokens, thus internal fragmentation is small.
No external fragmentation: blocks have same size, thus there any block can be used.
Reduce memory usage, so we can store more requests.