Efficient Memory Management for Large Language Model Serving with PagedAttention
1 minute read ∼ Filed in : A paper noteThroughput is important.
Batching requests can improve the throughput. The larger the batching size, the high throughput it is.
Use memory inefficiently lead to memory waste, and can thus store few requests, which leads low throughtput.
It’s challenging to manage memory efficiently. Existing system such as hugging face transformer. (20-40% used for store request)
- Internal fragmentation: over-allocated (e.g., 2048 tokens) due to unknown output length.
- Reservation: not used at the current step, but will use in future. This is a kind of waste in current.
- External fragmentation: different sequence length => this is many the external fragmentation of the virtual memory not the physical memory.
Only 20%of memory is effectively used for the current system.
Contributions
Request => process, Block -> page, Tokens => bytes.
This paper divides the KV cache into blocks, such that the KV cache can be store in the non-continuous memory.
- blocks are not necessarily stored in contiguous space, so KV is a more flexible way.
- external fragmentation is also avoided.
- memory sharing can happen at the granularity of the block
Then the paper proposes an attention algorithm based on such storage
Finally, the paper implements LLM based on such pageAttention.
Methods
Logic & physical KV blocks
- Allocation is on-demand, every time it allocates a new block.
- memory sharing in block level, reduce memory usage
Analysis:
- No internal fragmentation: since it is on-demands, so only the last block has the internal fragmentation. Since each block has 16 or 32 tokens, thus internal fragmentation is small.
- No external fragmentation: blocks have same size, thus there any block can be used.
- Reduce memory usage, so we can store more requests.