AI 推理 | vLLM 快速部署指南

本文是 AI 推理系列的第一篇，近期將更新與 vLLM 的相關內容。本篇從 vLLM 的部署開始，介紹 vLLM GPU/CPU 后端的安裝方式，后續將陸續講解 vLLM 的核心特性，如 PD 分離、Speculative Decoding、Prefix Caching 等，敬請關注。

1 什么是 vLLM？

vLLM 是一個高效、易用的大語言模型（LLM）推理和服務框架，專注于優化推理速度和吞吐量，尤其適合高并發的生產環境。它由加州大學伯克利分校的研究團隊開發，并因其出色的性能成為當前最受歡迎的 LLM 推理引擎之一。

vLLM 同時支持在 GPU 和 CPU 上運行，本文將會分別介紹 vLLM 使用 GPU 和 CPU 作為后端時的安裝與運行方法。

2 前提準備

2.1 購買虛擬機

如果本地不具備 GPU 環境，可考慮通過云服務提供商（如阿里云、騰訊云等）購買 GPU 服務器。

操作系統建議選擇 Ubuntu 22.04，GPU 型號可根據實際需求進行選擇。由于大語言模型通常占用較多磁盤空間，建議適當增加磁盤容量。

2.2 虛擬環境

推薦使用 uv 來管理 python 虛擬環境，執行以下命令安裝 uv：

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

3 安裝

3.1 使用 GPU 作為 vLLM 后端

3.1.1 系統要求

vLLM 包含預編譯的 C++ 和 CUDA (12.6) 二進制文件，需滿足以下條件：

操作系統：Linux
Python 版本：3.9 ~ 3.12
GPU：計算能力 7.0 或更高（如 V100、T4、RTX20xx、A10、A100、L4、H100 等）

注：計算能力（Compute capability）定義了每個 NVIDIA GPU 架構的硬件特性和支持的指令。計算能力決定你是否可以使用某些 CUDA 或 Tensor 核心功能（如 Unified Memory、Tensor Core、動態并行等），并不直接代表 GPU 的計算性能。

3.1.2 安裝和配置 GPU 依賴

可以使用以下命令一鍵安裝相關依賴，該腳本會安裝 NVIDIA GPU Driver，NVIDIA Container Toolkit，以及配置 NVIDIA Container Runtime（后續通過 Docker 運行 vLLM 時需要）。

curl -sS https://raw.githubusercontent.com/cr7258/hands-on-lab/refs/heads/main/ai/gpu/setup/docker-only-install.sh | bash

3.1.3 安裝 vLLM

創建 Python 虛擬環境：

# (Recommended) Create a new uv environment. Use `--seed` to install `pip` and `setuptools` in the environment.
uv venv --python 3.12 --seed
source .venv/bin/activate

安裝 vLLM：

uv pip install vllm

3.2 使用 CPU 作為 vLLM 后端

3.3 系統要求

操作系統：Linux
Python 版本：3.9 ~ 3.12
編譯器： gcc/g++ >= 12.3.0 (可選，建議)

3.4 安裝編譯依賴

vLLM 當前并沒有為 CPU 提供預構建的安裝包或者鏡像，需要自己根據源碼進行編譯。

uv venv vllm-cpu --python 3.12 --seed
source vllm-cpu/bin/activate

首先，安裝推薦的編譯器。建議使用 gcc/g++ >= 12.3.0 作為默認編譯器以避免潛在問題。例如，在 Ubuntu 22.4 上，你可以運行：

sudo apt-get update  -y
sudo apt-get install -y gcc-12 g++-12 libnuma-dev python3-dev
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12

然后克隆 vLLM 倉庫：

git clone https://github.com/vllm-project/vllm.git vllm_source
cd vllm_source

接著，安裝用于構建 vLLM CPU 后端的 Python 包：

pip install --upgrade pip
pip install "cmake>=3.26" wheel packaging ninja "setuptools-scm>=8" numpy
pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
pip install intel-extension-for-pytorch

3.5 安裝 vLLM

最后，構建并安裝 vLLM CPU 后端：

VLLM_TARGET_DEVICE=cpu python setup.py install

3.3 使用 Docker 運行 vLLM

vLLM 官方提供了 Docker 鏡像用于部署。執行以下命令使用 Docker 來運行 GPU 后端的 vLLM：

docker run --runtime nvidia --gpus all \-v ~/.cache/huggingface:/root/.cache/huggingface \-p 8000:8000 \--ipc=host \vllm/vllm-openai:latest \--model Qwen/Qwen2.5-1.5B-Instruct

以下是命令參數的解釋：

--runtime nvidia: 指定使用 NVIDIA 容器運行時，這是運行需要 GPU 的容器的必要設置。
--gpus all: 允許容器訪問主機上的所有 GPU 資源。如果你只想使用特定的 GPU，可以指定 GPU ID，例如使用 --gpus '"device=0,1"' 指定使用 0 和 1 號 GPU（注意：device 前后還有個單引號，否則會看到這個報錯：cannot set both Count and DeviceIDs on device request.）。如果想直接設置指定數量的 GPU，例如可以使用 --gpus 2，使用 2 個 GPU。
-v ~/.cache/huggingface:/root/.cache/huggingface: 將主機上的 Hugging Face 緩存目錄掛載到容器內。這樣可以在容器重啟后重用已下載的模型，避免重復下載，節省時間和網絡流量。
-p 8000:8000: 將容器內的 8000 端口映射到主機的 8000 端口，使得可以通過 http://localhost:8000 訪問 vLLM 服務。
你可以使用 ipc=host 標志或 --shm-size 標志，讓容器訪問宿主機的共享內存。vLLM 使用 PyTorch，而 PyTorch 在底層通過共享內存在進程之間傳遞數據，尤其是在進行張量并行推理時。在使用多個 GPU 進行推理時需要設置該參數。
vllm/vllm-openai:latest: 使用的 Docker 鏡像，latest 表示最新版本的 vLLM OpenAI 兼容服務器鏡像。
--model Qwen/Qwen2.5-1.5B-Instruct: 傳遞給 vLLM 服務的參數，指定要加載的模型為 Qwen/Qwen2.5-1.5B-Instruct。

如果要使用 CPU 作為后端來運行 vLLM 可以使用這個倉庫的鏡像。

docker run --rm \-v ~/.cache/huggingface:/root/.cache/huggingface \-p 8000:8000 \--ipc=host \public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.8.5 \--model=Qwen/Qwen2.5-1.5B-Instruct

vLLM 提供了兩種主要的推理模式：離線推理（Offline Inference）和在線推理（Online Serving），適用于不同的應用場景和需求。這兩種模式的區別如下：

4 離線推理和在線推理的區別

離線推理（offline inference）和在線推理（online inference）的主要區別在于使用場景、延遲要求、資源調度方式等方面，簡單總結如下：

4.1 離線推理

定義： 對一批輸入數據進行集中處理，通常不要求實時返回結果。特點如下：

批處理：常用于處理大量輸入，如日志分析、推薦系統預計算。
低延遲要求：結果可以晚些返回，不影響用戶體驗。
資源利用高：系統可以在空閑時充分利用 GPU/CPU 資源。
示例場景：每天夜間跑用戶興趣畫像、預生成廣告文案等。

4.2 在線推理

定義： 針對用戶實時請求進行推理，立即返回結果。特點如下：

實時響應：響應時間通常要求在幾百毫秒以內。
延遲敏感：高并發、低延遲是核心指標。
資源分配穩定：服務需長時間在線、資源預留固定。
示例場景：聊天機器人、搜索聯想、智能客服等。

5 離線推理

安裝好 vLLM 后，你可以開始對一系列輸入提示詞進行文本生成（即離線批量推理）。以下代碼是 vLLM 官網提供的示例：

# basic.py
# https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/basic/basic.pyfrom vllm import LLM, SamplingParams# Sample prompts.
prompts = ["Hello, my name is","The president of the United States is","The capital of France is","The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)def main():# Create an LLM.llm = LLM(model="facebook/opt-125m")# Generate texts from the prompts.# The output is a list of RequestOutput objects# that contain the prompt, generated text, and other information.outputs = llm.generate(prompts, sampling_params)# Print the outputs.print("\nGenerated Outputs:\n" + "-" * 60)for output in outputs:prompt = output.promptgenerated_text = output.outputs[0].textprint(f"Prompt:    {prompt!r}")print(f"Output:    {generated_text!r}")print("-" * 60)if __name__ == "__main__":main()

在上面的代碼中，使用了 SamplingParams 來指定采樣過程的參數。采樣溫度設置為 0.8，核采樣概率設置為 0.95。下面解釋一下這兩個參數的用途：

Sampling Temperature（采樣溫度）和 Nucleus Sampling Probability（核采樣概率/Top-p）是大語言模型生成文本時常用的兩個采樣參數，用于控制輸出文本的多樣性和質量。

采樣溫度（Sampling Temperature）
- 作用：控制生成文本的“隨機性”或“創造性”。
- 原理：溫度會對模型輸出的概率分布進行縮放。溫度越低（如0.5），高概率的詞更容易被選中，生成結果更確定、重復性更高；溫度越高（如1.2），低概率的詞被選中的機會增加，文本更有多樣性但可能更混亂。
- 具體來說，temperature=0.8 表示在概率分布上做了一定程度的“平滑”，比默認的 1.0 更偏向于選擇高概率詞，但仍保留一定的多樣性。
核采樣概率（Nucleus Sampling Probability / Top-p）
- 作用：控制每一步生成時考慮的候選詞集合大小，動態平衡文本的多樣性和合理性。
- 原理：Top-p（核采樣）會將所有詞按概率從高到低排序，累加概率，直到總和首次超過 0.95 為止，只在這部分“核心”詞中隨機采樣。

這里用一個例子來解釋這兩個采樣參數之間的關系，假設模型下一步可以說 “貓狗老虎大象猴子烏龜老鷹鱷魚螞蟻 …”（有上萬個詞）：

Temperature 是調整每個詞出現的概率（“貓”的概率是 30%，你可以把它調得更大或更小）；
Top-p 是把所有詞排序后，只保留累計概率達到 95% 的前幾個詞，比如前 7 個，然后從中挑一個。

輸出是通過 llm.generate 方法生成的。該方法會將輸入提示加入 vLLM 引擎的等待隊列，并調用 vLLM 引擎以高吞吐量生成輸出。最終輸出會以 RequestOutput 對象列表的形式返回，每個對象包含完整的輸出 token。

執行以下代碼運行程序：

python basic.py# 輸出
INFO 05-09 22:19:49 [__init__.py:239] Automatically detected platform cuda.
INFO 05-09 22:20:08 [config.py:717] This model supports multiple tasks: {'embed', 'classify', 'reward', 'generate', 'score'}. Defaulting to 'generate'.
INFO 05-09 22:20:13 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=8192.
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 685/685 [00:00<00:00, 6.99MB/s]
vocab.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 899k/899k [00:00<00:00, 1.90MB/s]
merges.txt: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 13.1MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 441/441 [00:00<00:00, 5.21MB/s]
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 1.70MB/s]
INFO 05-09 22:20:18 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=facebook/opt-125m, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 05-09 22:20:19 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x78d0197039e0>
INFO 05-09 22:20:21 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-09 22:20:21 [cuda.py:221] Using Flash Attention backend on V1 engine.
WARNING 05-09 22:20:21 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 05-09 22:20:21 [gpu_model_runner.py:1329] Starting to load model facebook/opt-125m...
INFO 05-09 22:20:22 [weight_utils.py:265] Using model weights format ['*.bin']
pytorch_model.bin: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 251M/251M [00:27<00:00, 8.96MB/s]
INFO 05-09 22:20:51 [weight_utils.py:281] Time spent downloading weights for facebook/opt-125m: 28.962338 seconds
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.84it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.83it/s]INFO 05-09 22:20:51 [loader.py:458] Loading weights took 0.17 seconds
INFO 05-09 22:20:51 [gpu_model_runner.py:1347] Model loading took 0.2389 GiB and 29.934842 seconds
INFO 05-09 22:20:53 [backends.py:420] Using cache directory: /root/.cache/vllm/torch_compile_cache/4f59edd943/rank_0_0 for vLLM's torch.compile
INFO 05-09 22:20:53 [backends.py:430] Dynamo bytecode transform time: 2.05 s
INFO 05-09 22:20:55 [backends.py:136] Cache the graph of shape None for later use
INFO 05-09 22:20:59 [backends.py:148] Compiling a graph for general shape takes 6.06 s
INFO 05-09 22:21:02 [monitor.py:33] torch.compile takes 8.11 s in total
INFO 05-09 22:21:02 [kv_cache_utils.py:634] GPU KV cache size: 549,216 tokens
INFO 05-09 22:21:02 [kv_cache_utils.py:637] Maximum concurrency for 2,048 tokens per request: 268.17x
INFO 05-09 22:21:17 [gpu_model_runner.py:1686] Graph capturing finished in 15 secs, took 0.20 GiB
INFO 05-09 22:21:17 [core.py:159] init engine (profile, create kv cache, warmup model) took 26.22 seconds
INFO 05-09 22:21:17 [core_client.py:439] Core engine process 0 ready.
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 32.86it/s, est. speed input: 213.68 toks/s, output: 525.96 toks/s]Generated Outputs:
------------------------------------------------------------
Prompt:    'Hello, my name is'
Output:    " Paul D. I'm 28 years old. I'm a girl, but I"
------------------------------------------------------------
Prompt:    'The president of the United States is'
Output:    ' facing the same problems as a former CIA director, who has been accused of repeatedly'
------------------------------------------------------------
Prompt:    'The capital of France is'
Output:    ' exploding with touristy people.\nWhat do you mean?  Did you mean'
------------------------------------------------------------
Prompt:    'The future of AI is'
Output:    ' set for a big shift in the near term. The world is in the hands'
------------------------------------------------------------

默認情況下，vLLM 會從 Hugging Face 下載模型。如果你想使用來自 ModelScope 的模型，請設置 VLLM_USE_MODELSCOPE 環境變量。

6 在線推理

vLLM 可以部署為實現 OpenAI API 協議的服務器。默認情況下，服務器在 http://localhost:8000 啟動。你可以通過 --host 和 --port 參數指定地址。服務器目前一次只能托管一個模型，實現了 list models，create chat completion 等端點。

運行以下命令以啟動 vLLM 服務器，并使用 Qwen/Qwen2.5-1.5B-Instruct 模型：

vllm serve Qwen/Qwen2.5-1.5B-Instruct

啟動后的輸出如下（GPU 后端）：

INFO 05-09 22:51:11 [__init__.py:239] Automatically detected platform cuda.
INFO 05-09 22:51:18 [api_server.py:1043] vLLM API server version 0.8.5.post1
INFO 05-09 22:51:18 [api_server.py:1044] args: Namespace(subparser='serve', model_tag='Qwen/Qwen2.5-1.5B-Instruct', config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='Qwen/Qwen2.5-1.5B-Instruct', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', max_model_len=None, guided_decoding_backend='auto', reasoning_parser=None, logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=0.9, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, disable_sliding_window=False, use_v2_block_manager=True, seed=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=False, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7e5c1f36b740>)
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 660/660 [00:00<00:00, 6.78MB/s]
INFO 05-09 22:51:33 [config.py:717] This model supports multiple tasks: {'reward', 'classify', 'generate', 'score', 'embed'}. Defaulting to 'generate'.
INFO 05-09 22:51:38 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=2048.
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.30k/7.30k [00:00<00:00, 6.34MB/s]
vocab.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.78M/2.78M [00:00<00:00, 3.32MB/s]
merges.txt: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.67M/1.67M [00:00<00:00, 2.64MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.03M/7.03M [00:00<00:00, 8.10MB/s]
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 242/242 [00:00<00:00, 2.76MB/s]
INFO 05-09 22:51:51 [__init__.py:239] Automatically detected platform cuda.
INFO 05-09 22:51:56 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: model='Qwen/Qwen2.5-1.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Qwen/Qwen2.5-1.5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 05-09 22:51:57 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x78bea674d970>
INFO 05-09 22:51:58 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-09 22:51:58 [cuda.py:221] Using Flash Attention backend on V1 engine.
WARNING 05-09 22:51:58 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 05-09 22:51:58 [gpu_model_runner.py:1329] Starting to load model Qwen/Qwen2.5-1.5B-Instruct...
INFO 05-09 22:51:59 [weight_utils.py:265] Using model weights format ['*.safetensors']
model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.09G/3.09G [03:44<00:00, 13.8MB/s]
INFO 05-09 22:55:44 [weight_utils.py:281] Time spent downloading weights for Qwen/Qwen2.5-1.5B-Instruct: 225.167288 seconds
INFO 05-09 22:55:45 [weight_utils.py:315] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.23it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.23it/s]INFO 05-09 22:55:45 [loader.py:458] Loading weights took 0.50 seconds
INFO 05-09 22:55:45 [gpu_model_runner.py:1347] Model loading took 2.8871 GiB and 226.791553 seconds
INFO 05-09 22:55:52 [backends.py:420] Using cache directory: /root/.cache/vllm/torch_compile_cache/c822c41d5a/rank_0_0 for vLLM's torch.compile
INFO 05-09 22:55:52 [backends.py:430] Dynamo bytecode transform time: 6.61 s
INFO 05-09 22:55:54 [backends.py:136] Cache the graph of shape None for later use
INFO 05-09 22:56:17 [backends.py:148] Compiling a graph for general shape takes 24.18 s
INFO 05-09 22:56:26 [monitor.py:33] torch.compile takes 30.79 s in total
INFO 05-09 22:56:27 [kv_cache_utils.py:634] GPU KV cache size: 567,984 tokens
INFO 05-09 22:56:27 [kv_cache_utils.py:637] Maximum concurrency for 32,768 tokens per request: 17.33x
INFO 05-09 22:56:47 [gpu_model_runner.py:1686] Graph capturing finished in 21 secs, took 0.45 GiB
INFO 05-09 22:56:47 [core.py:159] init engine (profile, create kv cache, warmup model) took 62.02 seconds
INFO 05-09 22:56:47 [core_client.py:439] Core engine process 0 ready.
WARNING 05-09 22:56:48 [config.py:1239] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 05-09 22:56:48 [serving_chat.py:118] Using default chat sampling params from model: {'repetition_penalty': 1.1, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
INFO 05-09 22:56:48 [serving_completion.py:61] Using default completion sampling params from model: {'repetition_penalty': 1.1, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
INFO 05-09 22:56:48 [api_server.py:1090] Starting vLLM API server on http://0.0.0.0:8000
INFO 05-09 22:56:48 [launcher.py:28] Available routes are:
INFO 05-09 22:56:48 [launcher.py:36] Route: /openapi.json, Methods: HEAD, GET
INFO 05-09 22:56:48 [launcher.py:36] Route: /docs, Methods: HEAD, GET
INFO 05-09 22:56:48 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 05-09 22:56:48 [launcher.py:36] Route: /redoc, Methods: HEAD, GET
INFO 05-09 22:56:48 [launcher.py:36] Route: /health, Methods: GET
INFO 05-09 22:56:48 [launcher.py:36] Route: /load, Methods: GET
INFO 05-09 22:56:48 [launcher.py:36] Route: /ping, Methods: POST, GET
INFO 05-09 22:56:48 [launcher.py:36] Route: /tokenize, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /detokenize, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /v1/models, Methods: GET
INFO 05-09 22:56:48 [launcher.py:36] Route: /version, Methods: GET
INFO 05-09 22:56:48 [launcher.py:36] Route: /v1/chat/completions, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /v1/completions, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /v1/embeddings, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /pooling, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /score, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /v1/score, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /v1/audio/transcriptions, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /rerank, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /v1/rerank, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /v2/rerank, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /invocations, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /metrics, Methods: GET
INFO:     Started server process [49305]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

啟動后的輸出如下（CPU 后端）：

INFO 05-10 11:48:14 [__init__.py:248] Automatically detected platform cpu.
INFO 05-10 11:48:27 [config.py:760] This model supports multiple tasks: {'classify', 'generate', 'embed', 'score', 'reward'}. Defaulting to 'generate'.
INFO 05-10 11:48:27 [config.py:1857] Disabled the custom all-reduce kernel because it is not supported on current platform.
WARNING 05-10 11:48:27 [config.py:2019] max_model_len was is not set. Defaulting to arbitrary value of 8192.
WARNING 05-10 11:48:27 [config.py:2025] max_num_seqs was is not set. Defaulting to arbitrary value of 128.
INFO 05-10 11:48:29 [config.py:760] This model supports multiple tasks: {'classify', 'generate', 'embed', 'score', 'reward'}. Defaulting to 'generate'.
INFO 05-10 11:48:29 [config.py:1857] Disabled the custom all-reduce kernel because it is not supported on current platform.
INFO 05-10 11:48:30 [config.py:760] This model supports multiple tasks: {'classify', 'generate', 'embed', 'score', 'reward'}. Defaulting to 'generate'.
INFO 05-10 11:48:30 [config.py:1857] Disabled the custom all-reduce kernel because it is not supported on current platform.
INFO 05-10 11:48:31 [config.py:760] This model supports multiple tasks: {'classify', 'generate', 'embed', 'score', 'reward'}. Defaulting to 'generate'.
INFO 05-10 11:48:31 [config.py:1857] Disabled the custom all-reduce kernel because it is not supported on current platform.
INFO 05-10 11:48:31 [api_server.py:1044] vLLM API server version 0.8.5.dev572+g246e3e0a3
INFO 05-10 11:48:32 [config.py:760] This model supports multiple tasks: {'classify', 'generate', 'embed', 'score', 'reward'}. Defaulting to 'generate'.
INFO 05-10 11:48:32 [config.py:1857] Disabled the custom all-reduce kernel because it is not supported on current platform.
INFO 05-10 11:48:32 [cli_args.py:297] non-default args: {'model': 'Qwen/Qwen2.5-1.5B-Instruct'}
INFO 05-10 11:48:41 [config.py:760] This model supports multiple tasks: {'classify', 'generate', 'embed', 'score', 'reward'}. Defaulting to 'generate'.
WARNING 05-10 11:48:41 [arg_utils.py:1533] device type=cpu is not supported by the V1 Engine. Falling back to V0. 
INFO 05-10 11:48:41 [config.py:1857] Disabled the custom all-reduce kernel because it is not supported on current platform.
WARNING 05-10 11:48:41 [cpu.py:118] Environment variable VLLM_CPU_KVCACHE_SPACE (GiB) for CPU backend is not set, using 4 by default.
WARNING 05-10 11:48:41 [cpu.py:131] uni is not supported on CPU, fallback to mp distributed executor backend.
INFO 05-10 11:48:41 [api_server.py:247] Started engine process with PID 23528
INFO 05-10 11:48:45 [__init__.py:248] Automatically detected platform cpu.
INFO 05-10 11:48:47 [llm_engine.py:240] Initializing a V0 LLM engine (v0.8.5.dev572+g246e3e0a3) with config: model='Qwen/Qwen2.5-1.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cpu, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=None, served_model_name=Qwen/Qwen2.5-1.5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=None, compilation_config={"compile_sizes": [], "inductor_compile_config": {"enable_auto_functionalized_v2": false}, "cudagraph_capture_sizes": [256, 248, 240, 232, 224, 216, 208, 200, 192, 184, 176, 168, 160, 152, 144, 136, 128, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 4, 2, 1], "max_capture_size": 256}, use_cached_outputs=True, 
INFO 05-10 11:48:49 [cpu.py:57] Using Torch SDPA backend.
INFO 05-10 11:48:49 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-10 11:48:49 [weight_utils.py:257] Using model weights format ['*.safetensors']
INFO 05-10 11:48:50 [weight_utils.py:307] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.57it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.56it/s]INFO 05-10 11:48:50 [default_loader.py:278] Loading weights took 0.19 seconds
INFO 05-10 11:48:50 [executor_base.py:112] # cpu blocks: 9362, # CPU blocks: 0
INFO 05-10 11:48:50 [executor_base.py:117] Maximum concurrency for 32768 tokens per request: 4.57x
INFO 05-10 11:48:50 [llm_engine.py:435] init engine (profile, create kv cache, warmup model) took 0.15 seconds
WARNING 05-10 11:48:51 [config.py:1283] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 05-10 11:48:51 [serving_chat.py:116] Using default chat sampling params from model: {'repetition_penalty': 1.1, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
INFO 05-10 11:48:51 [serving_completion.py:61] Using default completion sampling params from model: {'repetition_penalty': 1.1, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
INFO 05-10 11:48:51 [api_server.py:1091] Starting vLLM API server on http://0.0.0.0:8000
INFO 05-10 11:48:51 [launcher.py:28] Available routes are:
INFO 05-10 11:48:51 [launcher.py:36] Route: /openapi.json, Methods: GET, HEAD
INFO 05-10 11:48:51 [launcher.py:36] Route: /docs, Methods: GET, HEAD
INFO 05-10 11:48:51 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 05-10 11:48:51 [launcher.py:36] Route: /redoc, Methods: GET, HEAD
INFO 05-10 11:48:51 [launcher.py:36] Route: /health, Methods: GET
INFO 05-10 11:48:51 [launcher.py:36] Route: /load, Methods: GET
INFO 05-10 11:48:51 [launcher.py:36] Route: /ping, Methods: GET, POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /tokenize, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /detokenize, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /v1/models, Methods: GET
INFO 05-10 11:48:51 [launcher.py:36] Route: /version, Methods: GET
INFO 05-10 11:48:51 [launcher.py:36] Route: /v1/chat/completions, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /v1/completions, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /v1/embeddings, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /pooling, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /score, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /v1/score, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /v1/audio/transcriptions, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /rerank, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /v1/rerank, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /v2/rerank, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /invocations, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /metrics, Methods: GET
INFO:     Started server process [23290]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

這個服務器可以像 OpenAI API 一樣以相同的格式進行請求。例如，要列出模型：

curl -sS http://localhost:8000/v1/models | jq# 輸出
{"object": "list","data": [{"id": "Qwen/Qwen2.5-1.5B-Instruct","object": "model","created": 1746802639,"owned_by": "vllm","root": "Qwen/Qwen2.5-1.5B-Instruct","parent": null,"max_model_len": 32768,"permission": [{"id": "modelperm-63355d89a63946e58de0095fa60cd62b","object": "model_permission","created": 1746802639,"allow_create_engine": false,"allow_sampling": true,"allow_logprobs": true,"allow_search_indices": false,"allow_view": true,"allow_fine_tuning": false,"organization": "*","group": null,"is_blocking": false}]}]
}

6.1 使用 vLLM 的 OpenAI Completions API

你可以使用輸入提示詞查詢模型：

curl -sS http://localhost:8000/v1/completions \-H "Content-Type: application/json" \-d '{"model": "Qwen/Qwen2.5-1.5B-Instruct","prompt": "San Francisco is a"}' | jq# 輸出
{"id": "cmpl-8bca6ac3579a41178ae70c6c5ba1d6c6","object": "text_completion","created": 1746802658,"model": "Qwen/Qwen2.5-1.5B-Instruct","choices": [{"index": 0,"text": " city in the state of California, United States. It is the county seat of","logprobs": null,"finish_reason": "length","stop_reason": null,"prompt_logprobs": null}],"usage": {"prompt_tokens": 4,"total_tokens": 20,"completion_tokens": 16,"prompt_tokens_details": null}
}

由于服務器與 OpenAI API 兼容，你也可以直接使用 openai 的 Python SDK 進行請求。

from openai import OpenAI# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(api_key=openai_api_key,base_url=openai_api_base,
)
completion = client.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct",prompt="San Francisco is a")
print("Completion result:", completion)

執行以上代碼的輸出如下：

Completion result: Completion(id='cmpl-4b2b13ce66c24de3afc73ee67dc94607', 
choices=[CompletionChoice(finish_reason='length', index=0, logprobs=None, 
text=' city of contrasts. It has its share of luxury and wealth, but it also', 
stop_reason=None, prompt_logprobs=None)], created=1746802708, 
model='Qwen/Qwen2.5-1.5B-Instruct', object='text_completion', system_fingerprint=None, 
usage=CompletionUsage(completion_tokens=16, prompt_tokens=4, total_tokens=20, 
completion_tokens_details=None, prompt_tokens_details=None))

6.2 使用 vLLM 的 OpenAI Chat Completions API

vLLM 還支持 OpenAI 的 Chat Completions API。該 API 提供了一種更加動態、交互式的模型交流方式，支持往返對話并在聊天歷史中保存上下文。這種方式特別適用于需要保持上下文連貫性或需要更詳細解釋的任務，能夠讓模型理解之前的對話內容，從而提供更加連貫、個性化的響應。

curl -sS http://localhost:8000/v1/chat/completions \-H "Content-Type: application/json" \-d '{"model": "Qwen/Qwen2.5-1.5B-Instruct","messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Who won the world series in 2020?"}]}' | jq# 輸出
{"id": "chatcmpl-ff58ca762837455e8388a033245dc254","object": "chat.completion","created": 1746802744,"model": "Qwen/Qwen2.5-1.5B-Instruct","choices": [{"index": 0,"message": {"role": "assistant","reasoning_content": null,"content": "The New York Yankees won the World Series in 2020, defeating the Tampa Bay Rays in seven games.","tool_calls": []},"logprobs": null,"finish_reason": "stop","stop_reason": null}],"usage": {"prompt_tokens": 31,"total_tokens": 56,"completion_tokens": 25,"prompt_tokens_details": null},"prompt_logprobs": null
}

你同樣可以使用 OpenAI 的 Python SDK 進行請求。

from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"client = OpenAI(api_key=openai_api_key,base_url=openai_api_base,
)chat_response = client.chat.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct",messages=[{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Tell me a joke."},]
)
print("Chat response:", chat_response)

響應如下：

Chat response: ChatCompletion(id='chatcmpl-3ba198d6ca6649ea8e621d6e51abc30a', 
choices=[Choice(finish_reason='stop', index=0, logprobs=None, 
message=ChatCompletionMessage(content="Sure, here's one for you:\n\nWhy couldn't the bicycle stand up by itself?\n\nBecause it was two-tired!", 
refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=[], 
reasoning_content=None), stop_reason=None)], created=1746802826, model='Qwen/Qwen2.5-1.5B-Instruct', 
object='chat.completion', service_tier=None, system_fingerprint=None, 
usage=CompletionUsage(completion_tokens=26, prompt_tokens=24, total_tokens=50, 
completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None)

7 vLLM 在使用 GPU 和 CPU 后端時的性能對比

CPU 后端與 GPU 后端在性能上存在顯著差距，因為 vLLM 的架構最初是專為 GPU 優化設計的。若使用 CPU 后端，需要進行一系列優化以提升其運行效率。此外，GPU 擁有更強的并行計算能力，在大語言模型推理任務中相較于 CPU 更具優勢。下面我們將簡要對比 vLLM 在使用 CPU 和 GPU 后端時的推理性能差距。

當前進行測試的服務器配置參數如下：

CPU：32 vCPU
內存：188 GiB
GPU：NVIDIA A10

客戶端使用以下命令循環向 vLLM 服務器發送請求：

for i in {1..100}; doecho "Request: $i"curl -sS http://localhost:8000/v1/chat/completions \-H "Content-Type: application/json" \-d '{"model": "Qwen/Qwen2.5-1.5B-Instruct","messages": [{"role": "user", "content": "Tell me a joke"}]}'echo ""
done

vLLM 在 CPU 后端下的啟動命令如下：

vllm serve Qwen/Qwen2.5-1.5B-Instruct

vLLM CPU 后端每秒生成的 token 數大概在 10 幾個左右。

INFO 05-10 11:54:24 [metrics.py:486] Avg prompt throughput: 18.1 tokens/s, Avg generation throughput: 12.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 05-10 11:54:29 [metrics.py:486] Avg prompt throughput: 6.5 tokens/s, Avg generation throughput: 15.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

在 GPU 后端中，vLLM 默認啟用 Prefix Caching（v1 版本默認啟用，v0 默認禁用），以提升推理性能；而在 CPU 后端中，該功能默認關閉。為確保對比的公平性，添加 --no-enable-prefix-caching 參數手動禁用該功能。

vllm serve Qwen/Qwen2.5-1.5B-Instruct --no-enable-prefix-caching

vLLM GPU 后端每秒生成的 token 數大概在 130 幾個左右，性能大概是 CPU 后端的 10 倍左右。

INFO 05-11 11:31:38 [loggers.py:111] Engine 000: Avg prompt throughput: 135.3 tokens/s, Avg generation throughput: 108.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 05-11 11:31:48 [loggers.py:111] Engine 000: Avg prompt throughput: 135.3 tokens/s, Avg generation throughput: 108.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%