vLLM介紹

簡介

vLLM 工程github地址
Paged attention論文地址

vLLM開發者介紹

Woosuk Kwon

vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs.
SkyPilot: A framework for easily and cost effectively running machine learning workloads on any cloud.

Zhuohan Li

vLLM: A high-throughput and memory-efficient serving engine for large language models, accelerated with PagedAttention.
Vicuna: An open-source chatbot impressing GPT-4 with 90% ChatGPT quality.
AlpaServe: Use model parallelism to accelerate deep learning serving, even when models fit a single GPU.
Alpa: Automate model parallel training with just a few lines of code.

Features

SOTA最先進的服務吞吐量
高效的顯存管理：PagedAttention高效管理kv memory，multi-query attention
傳入請求的Continuous batching
優化的CUDA kernels。比如從Faster Transformer release 5.3中移植過來的attention kernel。實現了layernorm和position encoding kernels。
支持多卡GPU推理，目前只支持Tensor parallel，不支持pipeline parallel
最新開源模型支持，更新速度非常快：llama, llama2, 百川，通義千問，書生等等

主要解決的問題

由于LLMs以迭代方式生成其輸出，LLM服務的性能受到內存的限制（內存和IO受限型memory-IO bound），計算資源不是瓶頸。就是說，當前將1MB的數據加載到GPU的計算核心所花費的時間比這些計算core對1MB數據執行LLM計算所花費的更多。這意味著LLM推理吞吐量在很大程度上取決于您可以將多大的batch放入高帶寬GPU內存。參見(processor’s ops:byte ratio.）
在自回歸解碼過程中，LLM的所有輸入tokens產生它們的attention key and value tensors，并且這些tensors被保存在GPU存儲器中以生成下一個token。這些緩存的key and value tensors通常被稱為KV緩存。由于碎片和過度預留，現有系統浪費了60%-80%的顯卡內存。

vLLM的解決方案

減少顯存的碎片和過度預留問題可以顯著的提升推理性能。VLLM的主要解決思路是：

continuous batchingbatch介紹
Paged attentionvLLM blog

以下是 AnyScale 公司針對VLLM做的continuous-batching-llm-inference評測結論：
我們想要看看這種優化的性能如何。我們將詳細討論以下內容，包括我們如何模擬生產工作負載，但是總結我們的發現：

使用continuous batching和Paged attention內存優化（使用vLLM），吞吐量可提高高達23倍。
通過使用continuous batching（在Ray Serve和Hugging Face的text-generation-inference上），吞吐量比簡單batch提高8倍。
通過優化的模型實現（NVIDIA的Faster Transformer優化介紹），吞吐量比簡單batch提高4倍。

vLLM Work Through

詳細參考綁定的資源：vLLM First SF Meetup Slides。是2個作者寫的比較詳細

性能評測 TBD

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/207171.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/207171.shtml
英文地址，請注明出處：http://en.pswp.cn/news/207171.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！