AI 推理 | vLLM 快速部署指南

本文是 AI 推理系列的第一篇,近期將更新與 vLLM 的相關內容。本篇從 vLLM 的部署開始,介紹 vLLM GPU/CPU 后端的安裝方式,后續將陸續講解 vLLM 的核心特性,如 PD 分離、Speculative Decoding、Prefix Caching 等,敬請關注。

1 什么是 vLLM?

vLLM 是一個高效、易用的大語言模型(LLM)推理和服務框架,專注于優化推理速度和吞吐量,尤其適合高并發的生產環境。它由加州大學伯克利分校的研究團隊開發,并因其出色的性能成為當前最受歡迎的 LLM 推理引擎之一。

vLLM 同時支持在 GPU 和 CPU 上運行,本文將會分別介紹 vLLM 使用 GPU 和 CPU 作為后端時的安裝與運行方法。

2 前提準備

2.1 購買虛擬機

如果本地不具備 GPU 環境,可考慮通過云服務提供商(如阿里云、騰訊云等)購買 GPU 服務器。

操作系統建議選擇 Ubuntu 22.04,GPU 型號可根據實際需求進行選擇。由于大語言模型通常占用較多磁盤空間,建議適當增加磁盤容量。

2.2 虛擬環境

推薦使用 uv 來管理 python 虛擬環境,執行以下命令安裝 uv:

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

3 安裝

3.1 使用 GPU 作為 vLLM 后端

3.1.1 系統要求

vLLM 包含預編譯的 C++ 和 CUDA (12.6) 二進制文件,需滿足以下條件:

  • 操作系統:Linux
  • Python 版本:3.9 ~ 3.12
  • GPU:計算能力 7.0 或更高(如 V100、T4、RTX20xx、A10、A100、L4、H100 等)

注:計算能力(Compute capability)定義了每個 NVIDIA GPU 架構的硬件特性和支持的指令。計算能力決定你是否可以使用某些 CUDA 或 Tensor 核心功能(如 Unified Memory、Tensor Core、動態并行等),并不直接代表 GPU 的計算性能。

3.1.2 安裝和配置 GPU 依賴

可以使用以下命令一鍵安裝相關依賴,該腳本會安裝 NVIDIA GPU Driver,NVIDIA Container Toolkit,以及配置 NVIDIA Container Runtime(后續通過 Docker 運行 vLLM 時需要)。

curl -sS https://raw.githubusercontent.com/cr7258/hands-on-lab/refs/heads/main/ai/gpu/setup/docker-only-install.sh | bash
3.1.3 安裝 vLLM

創建 Python 虛擬環境:

# (Recommended) Create a new uv environment. Use `--seed` to install `pip` and `setuptools` in the environment.
uv venv --python 3.12 --seed
source .venv/bin/activate

安裝 vLLM:

uv pip install vllm

3.2 使用 CPU 作為 vLLM 后端

3.3 系統要求
  • 操作系統:Linux
  • Python 版本:3.9 ~ 3.12
  • 編譯器: gcc/g++ >= 12.3.0 (可選,建議)
3.4 安裝編譯依賴

vLLM 當前并沒有為 CPU 提供預構建的安裝包或者鏡像,需要自己根據源碼進行編譯。

uv venv vllm-cpu --python 3.12 --seed
source vllm-cpu/bin/activate

首先,安裝推薦的編譯器。建議使用 gcc/g++ >= 12.3.0 作為默認編譯器以避免潛在問題。例如,在 Ubuntu 22.4 上,你可以運行:

sudo apt-get update  -y
sudo apt-get install -y gcc-12 g++-12 libnuma-dev python3-dev
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12

然后克隆 vLLM 倉庫:

git clone https://github.com/vllm-project/vllm.git vllm_source
cd vllm_source

接著,安裝用于構建 vLLM CPU 后端的 Python 包:

pip install --upgrade pip
pip install "cmake>=3.26" wheel packaging ninja "setuptools-scm>=8" numpy
pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
pip install intel-extension-for-pytorch
3.5 安裝 vLLM

最后,構建并安裝 vLLM CPU 后端:

VLLM_TARGET_DEVICE=cpu python setup.py install

3.3 使用 Docker 運行 vLLM

vLLM 官方提供了 Docker 鏡像用于部署。執行以下命令使用 Docker 來運行 GPU 后端的 vLLM:

docker run --runtime nvidia --gpus all \-v ~/.cache/huggingface:/root/.cache/huggingface \-p 8000:8000 \--ipc=host \vllm/vllm-openai:latest \--model Qwen/Qwen2.5-1.5B-Instruct

以下是命令參數的解釋:

  • --runtime nvidia: 指定使用 NVIDIA 容器運行時,這是運行需要 GPU 的容器的必要設置。
  • --gpus all: 允許容器訪問主機上的所有 GPU 資源。如果你只想使用特定的 GPU,可以指定 GPU ID,例如使用 --gpus '"device=0,1"' 指定使用 0 和 1 號 GPU(注意:device 前后還有個單引號,否則會看到這個報錯:cannot set both Count and DeviceIDs on device request.)。如果想直接設置指定數量的 GPU,例如可以使用 --gpus 2,使用 2 個 GPU。
  • -v ~/.cache/huggingface:/root/.cache/huggingface: 將主機上的 Hugging Face 緩存目錄掛載到容器內。這樣可以在容器重啟后重用已下載的模型,避免重復下載,節省時間和網絡流量。
  • -p 8000:8000: 將容器內的 8000 端口映射到主機的 8000 端口,使得可以通過 http://localhost:8000 訪問 vLLM 服務。
  • 你可以使用 ipc=host 標志或 --shm-size 標志,讓容器訪問宿主機的共享內存。vLLM 使用 PyTorch,而 PyTorch 在底層通過共享內存在進程之間傳遞數據,尤其是在進行張量并行推理時。在使用多個 GPU 進行推理時需要設置該參數。
  • vllm/vllm-openai:latest: 使用的 Docker 鏡像,latest 表示最新版本的 vLLM OpenAI 兼容服務器鏡像。
  • --model Qwen/Qwen2.5-1.5B-Instruct: 傳遞給 vLLM 服務的參數,指定要加載的模型為 Qwen/Qwen2.5-1.5B-Instruct

如果要使用 CPU 作為后端來運行 vLLM 可以使用這個倉庫的鏡像。

docker run --rm \-v ~/.cache/huggingface:/root/.cache/huggingface \-p 8000:8000 \--ipc=host \public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.8.5 \--model=Qwen/Qwen2.5-1.5B-Instruct

vLLM 提供了兩種主要的推理模式:離線推理(Offline Inference)和在線推理(Online Serving),適用于不同的應用場景和需求。這兩種模式的區別如下:

4 離線推理和在線推理的區別

離線推理(offline inference)和在線推理(online inference)的主要區別在于使用場景、延遲要求、資源調度方式等方面,簡單總結如下:

4.1 離線推理

定義: 對一批輸入數據進行集中處理,通常不要求實時返回結果。特點如下:

  • 批處理:常用于處理大量輸入,如日志分析、推薦系統預計算。
  • 低延遲要求:結果可以晚些返回,不影響用戶體驗。
  • 資源利用高:系統可以在空閑時充分利用 GPU/CPU 資源。
  • 示例場景:每天夜間跑用戶興趣畫像、預生成廣告文案等。

4.2 在線推理

定義: 針對用戶實時請求進行推理,立即返回結果。特點如下:

  • 實時響應:響應時間通常要求在幾百毫秒以內。
  • 延遲敏感:高并發、低延遲是核心指標。
  • 資源分配穩定:服務需長時間在線、資源預留固定。
  • 示例場景:聊天機器人、搜索聯想、智能客服等。

5 離線推理

安裝好 vLLM 后,你可以開始對一系列輸入提示詞進行文本生成(即離線批量推理)。以下代碼是 vLLM 官網提供的示例:

# basic.py
# https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/basic/basic.pyfrom vllm import LLM, SamplingParams# Sample prompts.
prompts = ["Hello, my name is","The president of the United States is","The capital of France is","The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)def main():# Create an LLM.llm = LLM(model="facebook/opt-125m")# Generate texts from the prompts.# The output is a list of RequestOutput objects# that contain the prompt, generated text, and other information.outputs = llm.generate(prompts, sampling_params)# Print the outputs.print("\nGenerated Outputs:\n" + "-" * 60)for output in outputs:prompt = output.promptgenerated_text = output.outputs[0].textprint(f"Prompt:    {prompt!r}")print(f"Output:    {generated_text!r}")print("-" * 60)if __name__ == "__main__":main()

在上面的代碼中,使用了 SamplingParams 來指定采樣過程的參數。采樣溫度設置為 0.8,核采樣概率設置為 0.95。下面解釋一下這兩個參數的用途:

Sampling Temperature(采樣溫度) 和 Nucleus Sampling Probability(核采樣概率/Top-p) 是大語言模型生成文本時常用的兩個采樣參數,用于控制輸出文本的多樣性和質量。

  • 采樣溫度(Sampling Temperature)

    • 作用:控制生成文本的“隨機性”或“創造性”。
    • 原理:溫度會對模型輸出的概率分布進行縮放。溫度越低(如0.5),高概率的詞更容易被選中,生成結果更確定、重復性更高;溫度越高(如1.2),低概率的詞被選中的機會增加,文本更有多樣性但可能更混亂。
    • 具體來說,temperature=0.8 表示在概率分布上做了一定程度的“平滑”,比默認的 1.0 更偏向于選擇高概率詞,但仍保留一定的多樣性。
  • 核采樣概率(Nucleus Sampling Probability / Top-p)

    • 作用:控制每一步生成時考慮的候選詞集合大小,動態平衡文本的多樣性和合理性。
    • 原理:Top-p(核采樣)會將所有詞按概率從高到低排序,累加概率,直到總和首次超過 0.95 為止,只在這部分“核心”詞中隨機采樣。

這里用一個例子來解釋這兩個采樣參數之間的關系,假設模型下一步可以說 “貓 狗 老虎 大象 猴子 烏龜 老鷹 鱷魚 螞蟻 …”(有上萬個詞):

  • Temperature 是調整每個詞出現的概率(“貓”的概率是 30%,你可以把它調得更大或更小);
  • Top-p 是把所有詞排序后,只保留累計概率達到 95% 的前幾個詞,比如前 7 個,然后從中挑一個。

輸出是通過 llm.generate 方法生成的。該方法會將輸入提示加入 vLLM 引擎的等待隊列,并調用 vLLM 引擎以高吞吐量生成輸出。最終輸出會以 RequestOutput 對象列表的形式返回,每個對象包含完整的輸出 token。

執行以下代碼運行程序:

python basic.py# 輸出
INFO 05-09 22:19:49 [__init__.py:239] Automatically detected platform cuda.
INFO 05-09 22:20:08 [config.py:717] This model supports multiple tasks: {'embed', 'classify', 'reward', 'generate', 'score'}. Defaulting to 'generate'.
INFO 05-09 22:20:13 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=8192.
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 685/685 [00:00<00:00, 6.99MB/s]
vocab.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 899k/899k [00:00<00:00, 1.90MB/s]
merges.txt: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 13.1MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 441/441 [00:00<00:00, 5.21MB/s]
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 1.70MB/s]
INFO 05-09 22:20:18 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=facebook/opt-125m, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 05-09 22:20:19 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x78d0197039e0>
INFO 05-09 22:20:21 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-09 22:20:21 [cuda.py:221] Using Flash Attention backend on V1 engine.
WARNING 05-09 22:20:21 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 05-09 22:20:21 [gpu_model_runner.py:1329] Starting to load model facebook/opt-125m...
INFO 05-09 22:20:22 [weight_utils.py:265] Using model weights format ['*.bin']
pytorch_model.bin: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 251M/251M [00:27<00:00, 8.96MB/s]
INFO 05-09 22:20:51 [weight_utils.py:281] Time spent downloading weights for facebook/opt-125m: 28.962338 seconds
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.84it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.83it/s]INFO 05-09 22:20:51 [loader.py:458] Loading weights took 0.17 seconds
INFO 05-09 22:20:51 [gpu_model_runner.py:1347] Model loading took 0.2389 GiB and 29.934842 seconds
INFO 05-09 22:20:53 [backends.py:420] Using cache directory: /root/.cache/vllm/torch_compile_cache/4f59edd943/rank_0_0 for vLLM's torch.compile
INFO 05-09 22:20:53 [backends.py:430] Dynamo bytecode transform time: 2.05 s
INFO 05-09 22:20:55 [backends.py:136] Cache the graph of shape None for later use
INFO 05-09 22:20:59 [backends.py:148] Compiling a graph for general shape takes 6.06 s
INFO 05-09 22:21:02 [monitor.py:33] torch.compile takes 8.11 s in total
INFO 05-09 22:21:02 [kv_cache_utils.py:634] GPU KV cache size: 549,216 tokens
INFO 05-09 22:21:02 [kv_cache_utils.py:637] Maximum concurrency for 2,048 tokens per request: 268.17x
INFO 05-09 22:21:17 [gpu_model_runner.py:1686] Graph capturing finished in 15 secs, took 0.20 GiB
INFO 05-09 22:21:17 [core.py:159] init engine (profile, create kv cache, warmup model) took 26.22 seconds
INFO 05-09 22:21:17 [core_client.py:439] Core engine process 0 ready.
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 32.86it/s, est. speed input: 213.68 toks/s, output: 525.96 toks/s]Generated Outputs:
------------------------------------------------------------
Prompt:    'Hello, my name is'
Output:    " Paul D. I'm 28 years old. I'm a girl, but I"
------------------------------------------------------------
Prompt:    'The president of the United States is'
Output:    ' facing the same problems as a former CIA director, who has been accused of repeatedly'
------------------------------------------------------------
Prompt:    'The capital of France is'
Output:    ' exploding with touristy people.\nWhat do you mean?  Did you mean'
------------------------------------------------------------
Prompt:    'The future of AI is'
Output:    ' set for a big shift in the near term. The world is in the hands'
------------------------------------------------------------

默認情況下,vLLM 會從 Hugging Face 下載模型。如果你想使用來自 ModelScope 的模型,請設置 VLLM_USE_MODELSCOPE 環境變量。

6 在線推理

vLLM 可以部署為實現 OpenAI API 協議的服務器。默認情況下,服務器在 http://localhost:8000 啟動。你可以通過 --host--port 參數指定地址。服務器目前一次只能托管一個模型,實現了 list models,create chat completion 等端點。

運行以下命令以啟動 vLLM 服務器,并使用 Qwen/Qwen2.5-1.5B-Instruct 模型:

vllm serve Qwen/Qwen2.5-1.5B-Instruct

啟動后的輸出如下(GPU 后端):

INFO 05-09 22:51:11 [__init__.py:239] Automatically detected platform cuda.
INFO 05-09 22:51:18 [api_server.py:1043] vLLM API server version 0.8.5.post1
INFO 05-09 22:51:18 [api_server.py:1044] args: Namespace(subparser='serve', model_tag='Qwen/Qwen2.5-1.5B-Instruct', config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='Qwen/Qwen2.5-1.5B-Instruct', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', max_model_len=None, guided_decoding_backend='auto', reasoning_parser=None, logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=0.9, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, disable_sliding_window=False, use_v2_block_manager=True, seed=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=False, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7e5c1f36b740>)
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 660/660 [00:00<00:00, 6.78MB/s]
INFO 05-09 22:51:33 [config.py:717] This model supports multiple tasks: {'reward', 'classify', 'generate', 'score', 'embed'}. Defaulting to 'generate'.
INFO 05-09 22:51:38 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=2048.
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.30k/7.30k [00:00<00:00, 6.34MB/s]
vocab.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.78M/2.78M [00:00<00:00, 3.32MB/s]
merges.txt: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.67M/1.67M [00:00<00:00, 2.64MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.03M/7.03M [00:00<00:00, 8.10MB/s]
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 242/242 [00:00<00:00, 2.76MB/s]
INFO 05-09 22:51:51 [__init__.py:239] Automatically detected platform cuda.
INFO 05-09 22:51:56 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: model='Qwen/Qwen2.5-1.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Qwen/Qwen2.5-1.5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 05-09 22:51:57 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x78bea674d970>
INFO 05-09 22:51:58 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-09 22:51:58 [cuda.py:221] Using Flash Attention backend on V1 engine.
WARNING 05-09 22:51:58 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 05-09 22:51:58 [gpu_model_runner.py:1329] Starting to load model Qwen/Qwen2.5-1.5B-Instruct...
INFO 05-09 22:51:59 [weight_utils.py:265] Using model weights format ['*.safetensors']
model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.09G/3.09G [03:44<00:00, 13.8MB/s]
INFO 05-09 22:55:44 [weight_utils.py:281] Time spent downloading weights for Qwen/Qwen2.5-1.5B-Instruct: 225.167288 seconds
INFO 05-09 22:55:45 [weight_utils.py:315] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.23it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.23it/s]INFO 05-09 22:55:45 [loader.py:458] Loading weights took 0.50 seconds
INFO 05-09 22:55:45 [gpu_model_runner.py:1347] Model loading took 2.8871 GiB and 226.791553 seconds
INFO 05-09 22:55:52 [backends.py:420] Using cache directory: /root/.cache/vllm/torch_compile_cache/c822c41d5a/rank_0_0 for vLLM's torch.compile
INFO 05-09 22:55:52 [backends.py:430] Dynamo bytecode transform time: 6.61 s
INFO 05-09 22:55:54 [backends.py:136] Cache the graph of shape None for later use
INFO 05-09 22:56:17 [backends.py:148] Compiling a graph for general shape takes 24.18 s
INFO 05-09 22:56:26 [monitor.py:33] torch.compile takes 30.79 s in total
INFO 05-09 22:56:27 [kv_cache_utils.py:634] GPU KV cache size: 567,984 tokens
INFO 05-09 22:56:27 [kv_cache_utils.py:637] Maximum concurrency for 32,768 tokens per request: 17.33x
INFO 05-09 22:56:47 [gpu_model_runner.py:1686] Graph capturing finished in 21 secs, took 0.45 GiB
INFO 05-09 22:56:47 [core.py:159] init engine (profile, create kv cache, warmup model) took 62.02 seconds
INFO 05-09 22:56:47 [core_client.py:439] Core engine process 0 ready.
WARNING 05-09 22:56:48 [config.py:1239] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 05-09 22:56:48 [serving_chat.py:118] Using default chat sampling params from model: {'repetition_penalty': 1.1, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
INFO 05-09 22:56:48 [serving_completion.py:61] Using default completion sampling params from model: {'repetition_penalty': 1.1, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
INFO 05-09 22:56:48 [api_server.py:1090] Starting vLLM API server on http://0.0.0.0:8000
INFO 05-09 22:56:48 [launcher.py:28] Available routes are:
INFO 05-09 22:56:48 [launcher.py:36] Route: /openapi.json, Methods: HEAD, GET
INFO 05-09 22:56:48 [launcher.py:36] Route: /docs, Methods: HEAD, GET
INFO 05-09 22:56:48 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 05-09 22:56:48 [launcher.py:36] Route: /redoc, Methods: HEAD, GET
INFO 05-09 22:56:48 [launcher.py:36] Route: /health, Methods: GET
INFO 05-09 22:56:48 [launcher.py:36] Route: /load, Methods: GET
INFO 05-09 22:56:48 [launcher.py:36] Route: /ping, Methods: POST, GET
INFO 05-09 22:56:48 [launcher.py:36] Route: /tokenize, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /detokenize, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /v1/models, Methods: GET
INFO 05-09 22:56:48 [launcher.py:36] Route: /version, Methods: GET
INFO 05-09 22:56:48 [launcher.py:36] Route: /v1/chat/completions, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /v1/completions, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /v1/embeddings, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /pooling, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /score, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /v1/score, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /v1/audio/transcriptions, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /rerank, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /v1/rerank, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /v2/rerank, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /invocations, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /metrics, Methods: GET
INFO:     Started server process [49305]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

啟動后的輸出如下(CPU 后端):

INFO 05-10 11:48:14 [__init__.py:248] Automatically detected platform cpu.
INFO 05-10 11:48:27 [config.py:760] This model supports multiple tasks: {'classify', 'generate', 'embed', 'score', 'reward'}. Defaulting to 'generate'.
INFO 05-10 11:48:27 [config.py:1857] Disabled the custom all-reduce kernel because it is not supported on current platform.
WARNING 05-10 11:48:27 [config.py:2019] max_model_len was is not set. Defaulting to arbitrary value of 8192.
WARNING 05-10 11:48:27 [config.py:2025] max_num_seqs was is not set. Defaulting to arbitrary value of 128.
INFO 05-10 11:48:29 [config.py:760] This model supports multiple tasks: {'classify', 'generate', 'embed', 'score', 'reward'}. Defaulting to 'generate'.
INFO 05-10 11:48:29 [config.py:1857] Disabled the custom all-reduce kernel because it is not supported on current platform.
INFO 05-10 11:48:30 [config.py:760] This model supports multiple tasks: {'classify', 'generate', 'embed', 'score', 'reward'}. Defaulting to 'generate'.
INFO 05-10 11:48:30 [config.py:1857] Disabled the custom all-reduce kernel because it is not supported on current platform.
INFO 05-10 11:48:31 [config.py:760] This model supports multiple tasks: {'classify', 'generate', 'embed', 'score', 'reward'}. Defaulting to 'generate'.
INFO 05-10 11:48:31 [config.py:1857] Disabled the custom all-reduce kernel because it is not supported on current platform.
INFO 05-10 11:48:31 [api_server.py:1044] vLLM API server version 0.8.5.dev572+g246e3e0a3
INFO 05-10 11:48:32 [config.py:760] This model supports multiple tasks: {'classify', 'generate', 'embed', 'score', 'reward'}. Defaulting to 'generate'.
INFO 05-10 11:48:32 [config.py:1857] Disabled the custom all-reduce kernel because it is not supported on current platform.
INFO 05-10 11:48:32 [cli_args.py:297] non-default args: {'model': 'Qwen/Qwen2.5-1.5B-Instruct'}
INFO 05-10 11:48:41 [config.py:760] This model supports multiple tasks: {'classify', 'generate', 'embed', 'score', 'reward'}. Defaulting to 'generate'.
WARNING 05-10 11:48:41 [arg_utils.py:1533] device type=cpu is not supported by the V1 Engine. Falling back to V0. 
INFO 05-10 11:48:41 [config.py:1857] Disabled the custom all-reduce kernel because it is not supported on current platform.
WARNING 05-10 11:48:41 [cpu.py:118] Environment variable VLLM_CPU_KVCACHE_SPACE (GiB) for CPU backend is not set, using 4 by default.
WARNING 05-10 11:48:41 [cpu.py:131] uni is not supported on CPU, fallback to mp distributed executor backend.
INFO 05-10 11:48:41 [api_server.py:247] Started engine process with PID 23528
INFO 05-10 11:48:45 [__init__.py:248] Automatically detected platform cpu.
INFO 05-10 11:48:47 [llm_engine.py:240] Initializing a V0 LLM engine (v0.8.5.dev572+g246e3e0a3) with config: model='Qwen/Qwen2.5-1.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cpu, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=None, served_model_name=Qwen/Qwen2.5-1.5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=None, compilation_config={"compile_sizes": [], "inductor_compile_config": {"enable_auto_functionalized_v2": false}, "cudagraph_capture_sizes": [256, 248, 240, 232, 224, 216, 208, 200, 192, 184, 176, 168, 160, 152, 144, 136, 128, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 4, 2, 1], "max_capture_size": 256}, use_cached_outputs=True, 
INFO 05-10 11:48:49 [cpu.py:57] Using Torch SDPA backend.
INFO 05-10 11:48:49 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-10 11:48:49 [weight_utils.py:257] Using model weights format ['*.safetensors']
INFO 05-10 11:48:50 [weight_utils.py:307] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.57it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.56it/s]INFO 05-10 11:48:50 [default_loader.py:278] Loading weights took 0.19 seconds
INFO 05-10 11:48:50 [executor_base.py:112] # cpu blocks: 9362, # CPU blocks: 0
INFO 05-10 11:48:50 [executor_base.py:117] Maximum concurrency for 32768 tokens per request: 4.57x
INFO 05-10 11:48:50 [llm_engine.py:435] init engine (profile, create kv cache, warmup model) took 0.15 seconds
WARNING 05-10 11:48:51 [config.py:1283] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 05-10 11:48:51 [serving_chat.py:116] Using default chat sampling params from model: {'repetition_penalty': 1.1, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
INFO 05-10 11:48:51 [serving_completion.py:61] Using default completion sampling params from model: {'repetition_penalty': 1.1, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
INFO 05-10 11:48:51 [api_server.py:1091] Starting vLLM API server on http://0.0.0.0:8000
INFO 05-10 11:48:51 [launcher.py:28] Available routes are:
INFO 05-10 11:48:51 [launcher.py:36] Route: /openapi.json, Methods: GET, HEAD
INFO 05-10 11:48:51 [launcher.py:36] Route: /docs, Methods: GET, HEAD
INFO 05-10 11:48:51 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 05-10 11:48:51 [launcher.py:36] Route: /redoc, Methods: GET, HEAD
INFO 05-10 11:48:51 [launcher.py:36] Route: /health, Methods: GET
INFO 05-10 11:48:51 [launcher.py:36] Route: /load, Methods: GET
INFO 05-10 11:48:51 [launcher.py:36] Route: /ping, Methods: GET, POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /tokenize, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /detokenize, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /v1/models, Methods: GET
INFO 05-10 11:48:51 [launcher.py:36] Route: /version, Methods: GET
INFO 05-10 11:48:51 [launcher.py:36] Route: /v1/chat/completions, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /v1/completions, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /v1/embeddings, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /pooling, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /score, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /v1/score, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /v1/audio/transcriptions, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /rerank, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /v1/rerank, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /v2/rerank, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /invocations, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /metrics, Methods: GET
INFO:     Started server process [23290]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

這個服務器可以像 OpenAI API 一樣以相同的格式進行請求。例如,要列出模型:

curl -sS http://localhost:8000/v1/models | jq# 輸出
{"object": "list","data": [{"id": "Qwen/Qwen2.5-1.5B-Instruct","object": "model","created": 1746802639,"owned_by": "vllm","root": "Qwen/Qwen2.5-1.5B-Instruct","parent": null,"max_model_len": 32768,"permission": [{"id": "modelperm-63355d89a63946e58de0095fa60cd62b","object": "model_permission","created": 1746802639,"allow_create_engine": false,"allow_sampling": true,"allow_logprobs": true,"allow_search_indices": false,"allow_view": true,"allow_fine_tuning": false,"organization": "*","group": null,"is_blocking": false}]}]
}

6.1 使用 vLLM 的 OpenAI Completions API

你可以使用輸入提示詞查詢模型:

curl -sS http://localhost:8000/v1/completions \-H "Content-Type: application/json" \-d '{"model": "Qwen/Qwen2.5-1.5B-Instruct","prompt": "San Francisco is a"}' | jq# 輸出
{"id": "cmpl-8bca6ac3579a41178ae70c6c5ba1d6c6","object": "text_completion","created": 1746802658,"model": "Qwen/Qwen2.5-1.5B-Instruct","choices": [{"index": 0,"text": " city in the state of California, United States. It is the county seat of","logprobs": null,"finish_reason": "length","stop_reason": null,"prompt_logprobs": null}],"usage": {"prompt_tokens": 4,"total_tokens": 20,"completion_tokens": 16,"prompt_tokens_details": null}
}

由于服務器與 OpenAI API 兼容,你也可以直接使用 openai 的 Python SDK 進行請求。

from openai import OpenAI# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(api_key=openai_api_key,base_url=openai_api_base,
)
completion = client.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct",prompt="San Francisco is a")
print("Completion result:", completion)

執行以上代碼的輸出如下:

Completion result: Completion(id='cmpl-4b2b13ce66c24de3afc73ee67dc94607', 
choices=[CompletionChoice(finish_reason='length', index=0, logprobs=None, 
text=' city of contrasts. It has its share of luxury and wealth, but it also', 
stop_reason=None, prompt_logprobs=None)], created=1746802708, 
model='Qwen/Qwen2.5-1.5B-Instruct', object='text_completion', system_fingerprint=None, 
usage=CompletionUsage(completion_tokens=16, prompt_tokens=4, total_tokens=20, 
completion_tokens_details=None, prompt_tokens_details=None))

6.2 使用 vLLM 的 OpenAI Chat Completions API

vLLM 還支持 OpenAI 的 Chat Completions API。該 API 提供了一種更加動態、交互式的模型交流方式,支持往返對話并在聊天歷史中保存上下文。這種方式特別適用于需要保持上下文連貫性或需要更詳細解釋的任務,能夠讓模型理解之前的對話內容,從而提供更加連貫、個性化的響應。

curl -sS http://localhost:8000/v1/chat/completions \-H "Content-Type: application/json" \-d '{"model": "Qwen/Qwen2.5-1.5B-Instruct","messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Who won the world series in 2020?"}]}' | jq# 輸出
{"id": "chatcmpl-ff58ca762837455e8388a033245dc254","object": "chat.completion","created": 1746802744,"model": "Qwen/Qwen2.5-1.5B-Instruct","choices": [{"index": 0,"message": {"role": "assistant","reasoning_content": null,"content": "The New York Yankees won the World Series in 2020, defeating the Tampa Bay Rays in seven games.","tool_calls": []},"logprobs": null,"finish_reason": "stop","stop_reason": null}],"usage": {"prompt_tokens": 31,"total_tokens": 56,"completion_tokens": 25,"prompt_tokens_details": null},"prompt_logprobs": null
}

你同樣可以使用 OpenAI 的 Python SDK 進行請求。

from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"client = OpenAI(api_key=openai_api_key,base_url=openai_api_base,
)chat_response = client.chat.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct",messages=[{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Tell me a joke."},]
)
print("Chat response:", chat_response)

響應如下:

Chat response: ChatCompletion(id='chatcmpl-3ba198d6ca6649ea8e621d6e51abc30a', 
choices=[Choice(finish_reason='stop', index=0, logprobs=None, 
message=ChatCompletionMessage(content="Sure, here's one for you:\n\nWhy couldn't the bicycle stand up by itself?\n\nBecause it was two-tired!", 
refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=[], 
reasoning_content=None), stop_reason=None)], created=1746802826, model='Qwen/Qwen2.5-1.5B-Instruct', 
object='chat.completion', service_tier=None, system_fingerprint=None, 
usage=CompletionUsage(completion_tokens=26, prompt_tokens=24, total_tokens=50, 
completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None)

7 vLLM 在使用 GPU 和 CPU 后端時的性能對比

CPU 后端與 GPU 后端在性能上存在顯著差距,因為 vLLM 的架構最初是專為 GPU 優化設計的。若使用 CPU 后端,需要進行一系列優化以提升其運行效率。此外,GPU 擁有更強的并行計算能力,在大語言模型推理任務中相較于 CPU 更具優勢。下面我們將簡要對比 vLLM 在使用 CPU 和 GPU 后端時的推理性能差距。

當前進行測試的服務器配置參數如下:

  • CPU:32 vCPU
  • 內存:188 GiB
  • GPU:NVIDIA A10

客戶端使用以下命令循環向 vLLM 服務器發送請求:

for i in {1..100}; doecho "Request: $i"curl -sS http://localhost:8000/v1/chat/completions \-H "Content-Type: application/json" \-d '{"model": "Qwen/Qwen2.5-1.5B-Instruct","messages": [{"role": "user", "content": "Tell me a joke"}]}'echo ""
done

vLLM 在 CPU 后端下的啟動命令如下:

vllm serve Qwen/Qwen2.5-1.5B-Instruct 

vLLM CPU 后端每秒生成的 token 數大概在 10 幾個左右。

INFO 05-10 11:54:24 [metrics.py:486] Avg prompt throughput: 18.1 tokens/s, Avg generation throughput: 12.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 05-10 11:54:29 [metrics.py:486] Avg prompt throughput: 6.5 tokens/s, Avg generation throughput: 15.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

在 GPU 后端中,vLLM 默認啟用 Prefix Caching(v1 版本默認啟用,v0 默認禁用),以提升推理性能;而在 CPU 后端中,該功能默認關閉。為確保對比的公平性,添加 --no-enable-prefix-caching 參數手動禁用該功能。

vllm serve Qwen/Qwen2.5-1.5B-Instruct --no-enable-prefix-caching

vLLM GPU 后端每秒生成的 token 數大概在 130 幾個左右,性能大概是 CPU 后端的 10 倍左右。

INFO 05-11 11:31:38 [loggers.py:111] Engine 000: Avg prompt throughput: 135.3 tokens/s, Avg generation throughput: 108.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 05-11 11:31:48 [loggers.py:111] Engine 000: Avg prompt throughput: 135.3 tokens/s, Avg generation throughput: 108.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

總結

本文系統介紹了高性能 LLM 推理框架 vLLM 的部署實踐,涵蓋環境準備、GPU/CPU 后端配置、離線推理與在線推理部署等環節。最后通過實際測試,深入比較了兩種后端在推理吞吐量和響應速度方面的性能差異。

歡迎關注

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/905578.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/905578.shtml
英文地址,請注明出處:http://en.pswp.cn/news/905578.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

Python-MCPInspector調試

Python-MCPInspector調試 使用FastMCP開發MCPServer&#xff0c;熟悉【McpServer編碼過程】【MCPInspector調試方法】-> 可以這樣理解&#xff1a;只編寫一個McpServer&#xff0c;然后使用MCPInspector作為McpClient進行McpServer的調試 1-核心知識點 1-熟悉【McpServer編…

Linux 常用命令 -hostnamectl【主機名控制】

簡介 hostnamectl 命令中的 “hostname” 顧名思義&#xff0c;指的是計算機在網絡上的名稱&#xff0c;“ctl” 是 “control” 的縮寫&#xff0c;意味著控制。hostnamectl 命令用于查詢和修改系統主機名以及相關的設置。它通過與 systemd 系統管理器交互&#xff0c;允許用…

力扣-二叉樹-101 對稱二叉樹

思路 分解問題為&#xff0c;該節點的左孩子的左子樹和右孩子的右子樹是不是同一棵樹 && 該節點的左孩子的右字數和右孩子的左子樹是不是同一課樹 && 該節點的左右孩子的值相不相同 代碼 class Solution {public boolean isSymmetric(TreeNode root) {// 層…

Nginx技術方案【學習記錄】

文章目錄 1. 需求分析1.1 應用場景1.2 實現目標 2. Nginx反向代理與實現均衡負載2.1 部署架構2.2 架構描述2.2.1 Nginx代理服務器2.2.2 API服務器與API服務器&#xff08;Backup&#xff09;2.2.3 nginx.conf配置文件2.2.4 測試方法 3. 高速會話緩存技術3.1 問題背景3.2 使用 R…

Ubuntu22.04怎么退出Emergency Mode(緊急模式)

1.使用nano /etc/fstab命令進入fstab文件下&#xff1b; 2.將掛載項首行加#注釋掉&#xff0c;修改完之后使用ctrlX退出; 3.重啟即可退出緊急模式&#xff01;

Unity 紅點系統

首先明確一個&#xff0c;即紅點系統的數據結構是一顆樹&#xff0c;并且紅點的數據結構的初始化需要放在游戲的初始化中&#xff0c;之后再是對應的紅點UI側的注冊&#xff0c;對應的紅點UI在銷毀時需要注銷對紅點UI的顯示回調注冊&#xff0c;但是不銷毀數據側的紅點注冊 - …

極新攜手火山引擎,共探AI時代生態共建的破局點與增長引擎

在生成式AI與行業大模型的雙重驅動下&#xff0c;人工智能正以前所未有的速度重構互聯網產業生態。從內容創作、用戶交互到商業決策&#xff0c;AI技術滲透至產品研發、運營的全鏈條&#xff0c;推動效率躍升與創新模式變革。然而&#xff0c;面對AI技術迭代的爆發期&#xff0…

【Redis】SDS結構

目錄 1、背景2、SDS底層實現 1、背景 redis作為高性能的內存數據庫&#xff0c;對字符串操作&#xff08;如鍵、值的存儲&#xff09;有極高的要求。c語言原生字符串&#xff08;以\0結尾的字符串數據&#xff09;有一些缺點&#xff1a;長度計算需要遍歷&#xff08;O(n)時間…

STM32硬件I2C驅動OLED屏幕

本文基于STM32硬件I2C驅動SSD1306 OLED屏幕&#xff0c;提供完整的代碼實現及關鍵注意事項&#xff0c;適用于128x32或128x64分辨率屏幕。代碼通過模塊化設計&#xff0c;支持顯示字符、數字、漢字及位圖&#xff0c;并優化了顯存刷新機制。 零、完整代碼 完整代碼: 1&#x…

鴻蒙 PC 發布之后,想在技術上聊聊它的未來可能

最近鴻蒙 PC 剛發布完&#xff0c;但是發布會沒公布太多技術細節&#xff0c;基本上一些細節都是通過自媒體渠道獲取&#xff0c;首先可以確定的是&#xff0c;鴻蒙 PC 本身肯定是無法「直接」運行 win 原本的應用&#xff0c;但是可以支持手機上「原生鴻蒙」的應用&#xff0c…

【JAVA】抽象類與接口:設計模式中的應用對比(16)

核心知識點詳細解釋 Java抽象類和接口的定義、特點和使用場景 抽象類 抽象類是使用 abstract 關鍵字修飾的類。它不能被實例化&#xff0c;主要用于作為其他類的基類&#xff0c;提供一些通用的屬性和方法。抽象類可以包含抽象方法和具體方法。抽象方法是使用 abstract 關鍵…

HTML 顏色全解析:從命名規則到 RGBA/HSL 值,附透明度設置與場景應用指南

一、HTML 顏色系統詳解 HTML 中的顏色可以通過多種方式定義&#xff0c;包括顏色名稱、RGB 值、十六進制值、HSL 值等&#xff0c;同時支持透明度調整。以下是詳細分類及應用場景&#xff1a; 1. 顏色名稱&#xff08;預定義關鍵字&#xff09; HTML 預定義了 140 個標準顏色名…

LVS負載均衡群集和keepalive

目錄 一. 集群概述 1.1 集群的定義 1.2 集群的分類 1. 高可用集群 HA 2. 高性能運輸群集 HPC 3.負載均衡群集 LB 4. 分布式存儲集群 二. LVS概述 2.1 LVS的定義 2.2 LVS的工作原理 2.3 LVS 的三種工作模式 2.4 LVS 三種工作模式的對比 2.5 LVS 調度算法 1. 靜態…

ZTE 7551N 中興小鮮60 遠航60 努比亞小牛 解鎖BL 刷機包 刷root 展訊 T760 bl

ZTE 7551N 中興小鮮60 遠航60 努比亞小牛 解鎖BL 刷機包 刷root 3款機型是一個型號&#xff0c;包通用&#xff0c; ro.product.system.modelZTE 7551N ro.product.system.nameCN_P720S15 #################################### # from generate-common-build-props # Th…

單片機-STM32部分:12、I2C

飛書文檔https://x509p6c8to.feishu.cn/wiki/MsB7wLebki07eUkAZ1ec12W3nsh 一、簡介 IIC協議&#xff0c;又稱I2C協議&#xff0c;是由PHILP公司在80年代開發的兩線式串行總線&#xff0c;用于連接微控制器及其外圍設備&#xff0c;IIC屬于半雙工同步通信方式。 IIC是一種同步…

Virtualized Table 虛擬化表格 el-table-v2 表頭分組 多級表頭的簡單示例

注意添加這個屬性,會影響到有多少個層級的表頭: :header-height“[50, 40]”,即后面的columnIndex 如果有fix的列CustomizedHeader會被調用多次,如果有多個層級的表頭,也會被調用多次, 實際被調用次數是(fix數 1 * 表頭層級數量) 以下代碼均刪除了JSX TS版本代碼 <templ…

防御保護-----第十二章:VPN概述

文章目錄 第二部分&#xff0c;數據安全第十二章&#xff1a;VPN概述VPN概述VPN分類VPN關鍵技術隧道技術身份認證技術加解密技術數據認證技術 數據的安全傳輸密碼學發展史 對稱加密算法 --- 傳統密碼算法密鑰解釋流加密分組加密 --- 塊加密算法填充算法PKCS7算法分組模式 公鑰密…

前端項目打包部署流程j

1.打包前端項目(運行build這個文件) 2.打包完成后&#xff0c;控制臺如下所示:(沒有報錯即代表成功) 3.左側出現dist文件夾 4.準備好我們下載的nginx(可以到官網下載一個),然后在一個沒有中文路徑下的文件夾里面解壓。 5.在繼承終端內打開我們的項目&#xff0c;找到前面打包好…

Go語言標識符

文章目錄 標識符的組成規則Go語言關鍵字預定義標識符標識符命名慣例 特殊標識符標識符訪問權限控制 在Go語言中&#xff0c;標識符(Identifier)是用來命名變量、函數、類型、常量等程序實體的名稱。 標識符的組成規則 1、必須以字母或下劃線(_)開頭&#xff1a; 字母包括Unico…

CST軟件對OPERACST軟件聯合仿真汽車無線充電站對人體的影響

上海又收緊了新能源車的免費上牌政策。所以年前一些伙伴和我探討過買新能源汽車的問題&#xff0c;小伙伴們基本糾結的點是買插電還是純電&#xff1f;我個人是很抗拒新能源車的&#xff0c;也開過坐過。個人有幾個觀點&#xff1a; 溢價過高&#xff0c;不保值。實際并不環保…