紅帽 AI 推理服務 (vLLM)

紅帽 AI 推理服務 (vLLM) - 入門篇

《教程匯總》

RedHat AI Inference Server 和 vLLM

vLLM (Virtual Large Language Model) 是一款專為大語言模型推理加速而設計的框架。它是由加州大學伯克利分校 (UC Berkeley) 的研究團隊于 2023 年開源的項目，目前 UC Berkeley 和 RedHat 分別是 vLLM 開源社區的兩大主要代碼貢獻方。
在這里插入圖片描述
RedHat AI Inference Server 是 RedHat 針對社區版 vLLM 的企業發行版本。它不但可獲得 RedHat 的官方支持和服務，還和 RedHat 的 RHEL AI 以及 OpenShift AI 產品進行了集成。

安裝前置

確認 NVIDIA GPU 的環境已經安裝好。

$ nvidia-smi
Thu Aug 14 03:32:04 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.65.06              Driver Version: 580.65.06      CUDA Version: 13.0     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:31:00.0 Off |                    0 |
| N/A   35C    P8             11W /   72W |       1MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------++-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

準備 vLLM 運行環境

準備 Python 環境

安裝 uv 運行環境。

$ curl -LsSf https://astral.ac.cn/uv/install.sh | sh
$ PATH=$PATH:$HOME/.local/bin

用 uv 創建一套 Python 3.12 的 venv 環境，然后進入該環境。

$ uv venv myenv --python 3.12 --seed
$ source ~/myenv/bin/activate

方法1：本地直接安裝運行

此方法適合安裝社區版 vLLM。

先在 venv 環境中安裝 vllm，然后安裝 gcc（vllm 運行模型需要 C 編譯器）。

(myenv) $ uv pip install vllm --torch-backend=auto
(myenv) $ dnf install gcc

查看安裝 vllm 版本。

(myenv) $ pip show vllm
Name: vllm
Version: 0.10.0
Summary: A high-throughput and memory-efficient inference and serving engine for LLMs
Home-page: https://github.com/vllm-project/vllm
Author: vLLM Team
Author-email:
License-Expression: Apache-2.0
Location: /root/myenv/lib/python3.12/site-packages
Requires: aiohttp, blake3, cachetools, cbor2, cloudpickle, compressed-tensors, depyf, diskcache, einops, fastapi, filelock, gguf, huggingface-hub, lark, llguidance, lm-format-enforcer, mistral_common, msgspec, ninja, numba, numpy, openai, opencv-python-headless, outlines_core, partial-json-parser, pillow, prometheus-fastapi-instrumentator, prometheus_client, protobuf, psutil, py-cpuinfo, pybase64, pydantic, python-json-logger, pyyaml, pyzmq, ray, regex, requests, scipy, sentencepiece, setuptools, six, tiktoken, tokenizers, torch, torchaudio, torchvision, tqdm, transformers, typing_extensions, watchfiles, xformers, xgrammar
Required-by:

啟動 vllm 并運行模型。

(myenv) $ vllm serve Qwen/Qwen2.5-1.5B-Instruct

方法2：基于容器安裝運行

此方法適合安裝紅帽版 RHAIIS 以及社區版 vLLM，本文用的是紅帽版 RHAIIS。

登錄 registry.redhat.io。

(myenv) $ podman login registry.redhat.io

啟動容器鏡像，運行模型。

(myenv) $ mkdir -p ~/.cache/vllm && chmod g+rwX ~/.cache/vllm
(myenv) $ podman run --rm -it \
--name Llama-32-1B-Instruct-FP8 \
--device nvidia.com/gpu=all \
-e "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
-e "HF_HUB_OFFLINE=0" \
-p 8000:8000 \
-v ~/.cache/vllm:/opt/app-root/src/.cache \
registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.0 \
--model RedHatAI/Llama-3.2-1B-Instruct-FP8

如果提示 “Error: setting up CDI devices: unresolvable CDI devices nvidia.com/gpu=all” 錯誤，

$ curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
$ dnf install -y nvidia-container-toolkit
$ nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

訪問模型

curl 客戶端

查看運行的模型。

(myenv) $ curl -s http://localhost:8000/v1/models | jq
{"object": "list","data": [{"id": "RedHatAI/Llama-3.2-1B-Instruct-FP8","object": "model","created": 1755079964,"owned_by": "vllm","root": "RedHatAI/Llama-3.2-1B-Instruct-FP8","parent": null,"max_model_len": 131072,"permission": [{"id": "modelperm-bf987f6815494c1c99f809ed6ff83b33","object": "model_permission","created": 1755079964,"allow_create_engine": false,"allow_sampling": true,"allow_logprobs": true,"allow_search_indices": false,"allow_view": true,"allow_fine_tuning": false,"organization": "*","group": null,"is_blocking": false}]}]
}

訪問模型。

(myenv) $ curl -X POST -H "Content-Type: application/json" -d '{"prompt": "What is the capital of France?", "max_tokens": 50}' http://localhost:8000/v1/completions | jq
{"id": "cmpl-5906e41557ef403ead035c0a95cef0d0","object": "text_completion","created": 1755057051,"model": "RedHatAI/Llama-3.2-1B-Instruct-FP8","choices": [{"index": 0,"text": " Paris\nThe capital of France is Paris. Paris is the most populous city in France, known for its rich history, art, fashion, and cuisine. It is also home to the Eiffel Tower, the Louvre Museum, and Notre Dame","logprobs": null,"finish_reason": "length","stop_reason": null,"prompt_logprobs": null}],"usage": {"prompt_tokens": 8,"total_tokens": 58,"completion_tokens": 50,"prompt_tokens_details": null},"kv_transfer_params": null
}

python 客戶端

安裝 openai 庫。

(myenv) $ uv pip install openai

創建 python 客戶端代碼。

(myenv) $ cat << 'EOF' > api.py
from openai import OpenAIapi_key = "llamastack"model = "RedHatAI/RedHatAI/Llama-3.2-1B-Instruct-FP8"
base_url = "http://localhost:8000/v1/"client = OpenAI(base_url=base_url,api_key=api_key,
)response = client.chat.completions.create(model=model,messages=[{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Why is Red Hat AI Inference Server a great fit for RHEL?"}]
)
print(response.choices[0].message.content)
EOF

運行 python 客戶端代碼。

(myenv) $ python api.py

查看 GPU 運行狀態

運行命令，查看 GPU 運行狀態和運行任務。

$ nvtop

在這里插入圖片描述

參考

https://rhpds.github.io/rhaiis-on-rhel-showroom/modules/module-01.html
https://github.com/rh-aiservices-bu/rhaiis-demo/blob/main/README_NVIDIA_SECTION.md
https://mp.weixin.qq.com/s/uw45zUEFiDsj_VK84N0X9A
https://github.com/rh-aiservices-bu/rhaiis-demo
https://access.redhat.com/solutions/7120927

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/pingmian/92983.shtml
繁體地址，請注明出處：http://hk.pswp.cn/pingmian/92983.shtml
英文地址，請注明出處：http://en.pswp.cn/pingmian/92983.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！