Triton server的部署、構建、backend插件機制整體介紹

?目錄

0 引言

1 什么是Trition inference server

2?Trition inference server部署

2.1 下載server

2.2 下載模型

2.3?實驗

3 triton inference server的構建

3.1 build時候需要哪些repo

3.2 構建過程做了什么

3.3 構建體驗

4 閱讀readme整體了解下backend機制

4.1 什么是backend

4.1?Backend Shared Library

4.3?how to add your backend to the released Triton Docker image.

4.3.1 手動模式

4.3.2 自動模式

0 引言

最近再看Triton，因此花時間學習了源碼、官方readme文檔以及相關社區資料。通過完成 Triton Server 的下載、部署和構建，理解其 backend 插件機制、模型加載流程及生命周期管理等模塊，并結合源碼閱讀，對 Triton Server 的整體架構與工作流程有了初步的認識。

本文內容基于對 GitHub 上 Triton 官方代碼庫的閱讀、官方文檔的研讀，以及通過 ChatGPT 輔助解析源碼和技術細節后所做的整理和總結。全文涵蓋了 Triton Server 的安裝與使用，梳理了 backend 插件機制的實現細節、模型加載過程和自定義 C++ backend 的開發實踐，為想了解 Triton 內部機制的工程師和開發者提供參考。

由于本人對 Triton 的理解仍處于不斷深化階段，文中難免存在不足或疏漏，歡迎讀者批評指正，共同交流進步。

1 什么是Trition inference server

Triton Inference Server is an open source inference serving software that streamlines AI inferencing. Triton enables teams to deploy any AI model from multiple deep learning and machine learning frameworks, including TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more. Triton Inference Server supports inference across cloud, data center, edge and embedded devices on NVIDIA GPUs, x86 and ARM CPU, or AWS Inferentia. Triton Inference Server delivers optimized performance for many query types, including real time, batched, ensembles and audio/video streaming. Triton inference Server is part of?NVIDIA AI Enterprise, a software platform that accelerates the data science pipeline and streamlines the development and deployment of production AI.

?GitHub的readme里面有上面一段介紹triton inference server的，我用自己通俗點的語言來概括：

我的通俗理解：triton inference server就是一個服務，然后這個服務的底層可以用tensorRT, pytorch，onnxruntime等不同的后端做具體的模型推理，然后客戶端可以通過http請求給triton server發送比如推理請求，然后triton server通過api調用具體的推理后端做具體的模型推理，并將推理后的結果返回給客戶端。

2?Trition inference server部署

在https://github.com/triton-inference-server/server?的readme里面，有一個用docker鏡像部署的，我先部署試一下。

2.1 下載server

首先下載server，這一步其實對于部署來說不需要，因為其實是用docker鏡像部署的。

git clone https://github.com/triton-inference-server/server

2.2 下載模型

cd docs/examples
./fetch_models.sh

這里的這個fetch_models.sh腳本內容如下(我把第一個wget注釋掉了)，其實就是下載模型，還有就是創建了python虛擬環境。執行腳本的時候如果wget下載模型報錯，那么就把腳本中wget那一行注釋掉，然后把網址復制到瀏覽器中，然后手動把模型下載完之后上傳到做實驗的服務器中。

#!/bin/bash
# Copyright (c) 2018-2025, NVIDIA CORPORATION. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
#  * Redistributions of source code must retain the above copyright
#    notice, this list of conditions and the following disclaimer.
#  * Redistributions in binary form must reproduce the above copyright
#    notice, this list of conditions and the following disclaimer in the
#    documentation and/or other materials provided with the distribution.
#  * Neither the name of NVIDIA CORPORATION nor the names of its
#    contributors may be used to endorse or promote products derived
#    from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.set -ex# Convert Tensorflow inception V3 module to ONNX
# Pre-requisite: Python3, venv, and Pip3 are installed on the system
mkdir -p model_repository/inception_onnx/1
#wget -O /tmp/inception_v3_2016_08_28_frozen.pb.tar.gz \
#     https://storage.googleapis.com/download.tensorflow.org/models/inception_v3_2016_08_28_frozen.pb.tar.gz
(cd /tmp && tar xzf inception_v3_2016_08_28_frozen.pb.tar.gz)
python3 -m venv tf2onnx
source ./tf2onnx/bin/activate
pip3 install "numpy<2" tensorflow tf2onnx
python3 -m tf2onnx.convert --graphdef /tmp/inception_v3_2016_08_28_frozen.pb --output inception_v3_onnx.model.onnx --inputs input:0 --outputs InceptionV3/Predictions/Softmax:0
deactivate
mv inception_v3_onnx.model.onnx model_repository/inception_onnx/1/model.onnx# ONNX densenet
mkdir -p model_repository/densenet_onnx/1
wget -O model_repository/densenet_onnx/1/model.onnx \https://github.com/onnx/models/raw/main/validated/vision/classification/densenet-121/model/densenet-7.onnx

2.3?實驗

這里我直接在cpu上做，所以按照這里的步驟做：https://github.com/triton-inference-server/server/blob/main/docs/getting_started/quickstart.md#run-on-cpu-only-system

這里步驟有這樣的命令

docker run --rm -p8000:8000 -p8001:8001 -p8002:8002 -v/full/path/to/docs/examples/model_repository:/models nvcr.io/nvidia/tritonserver:<xx.yy>-py3 tritonserver --model-repository=/models
docker pull nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk
docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk
/workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg

?但別真的就照抄了，照抄肯定報錯，要把里面的/full/path/to/docs/examples/還有<xx.yy>這些地方改一下，并且第一行是運行服務端的，后面幾行是運行客戶端的，所以還需要開兩個連接服務器的終端，其中第一個終端的examples路徑下運行下面的命令

docker run --rm -p8000:8000 -p8001:8001 -p8002:8002 -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:25.05-py3 tritonserver --model-repository=/models --model-control-mode explicit --load-model densenet_onnx

然后大約會出現的下面的界面?

第二個終端直接運行下面的命令

docker pull nvcr.io/nvidia/tritonserver:25.05-py3-sdk
docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:25.05-py3-sdk
/workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg

?然后會出現下面的結果。

Request 0, batch size 1
Image '/workspace/images/mug.jpg':15.349563 (504) = COFFEE MUG13.227461 (968) = CUP10.424893 (505) = COFFEEPOT

3 triton inference server的構建

前面的體驗其實是用現成的docker鏡像部署體驗了下triton，接下來看一下triton的build過程，當然這里也不會真正實踐build的全流程，只是想通過這個構建過程盡可能多的學習和了解triton的一些東西。

3.1 build時候需要哪些repo

最開始我以為triton inference server只需要repo這一個repo就夠了，其實并不只是他其實需要下面的這一系列的repo，當然這些repo你可以自己手動下載，不手動下載那么在執行build.py的時候也會自動的進行下載。

模塊類別	模塊名示例（倉庫名）	作用簡述	是否構建時自動拉取
Server 主體	`server`	Triton 服務主程序和框架	你手動 clone
核心組件（Core）	`core`	Triton 核心邏輯和調度	是
公共模塊（Common）	`common`	公共工具和基礎設施	是
后端接口層	`backend`	后端統一接口定義	是
推理后端（Backend）	`onnxruntime_backend` `pytorch_backend` `tensorrt_backend` `python_backend` 等	具體推理引擎后端實現	是
第三方依賴	`thirdparty`	依賴的第三方庫源碼（魔改版）	是

然后其中推理后端又包括這些，下面的這些repo也是需要的。

后端名	功能說明
`onnxruntime_backend`	使用 ONNX Runtime 推理 ONNX 模型
`pytorch_backend`	使用 LibTorch 推理 PyTorch 模型
`tensorflow_backend`	適配 TensorFlow 模型（TF1/TF2）
`tensorrt_backend`	使用 NVIDIA TensorRT 推理，適合高性能場景
`openvino_backend`	使用 Intel OpenVINO 在 x86/邊緣設備上進行推理
`python_backend`	支持 Python 寫自定義后端邏輯
`dali_backend`	圖像預處理后端，集成 NVIDIA DALI
`fil_backend`	使用 RAPIDS 的 FIL 進行樹模型推理（如 XGBoost）
`identity_backend` / `repeat_backend` / `square_backend`	示例/測試用后端
`ensemble_backend`	核心內置，用于將多個模型組合成一條流水線執行

3.2 構建過程做了什么

最開始我以為build.py只是編譯了可執行程序或者編譯得到so動態庫，但其實不是，執行build.py的過程主要分為下面三個。

自動下載需要的代碼倉庫（repo）
- 根據配置和版本標簽，自動從 GitHub 拉取 core、common、backend、各個具體后端（onnxruntime_backend、pytorch_backend 等）和第三方依賴的源碼。
編譯得到可執行程序和動態庫
- 編譯 Triton Server 主程序
- 編譯各個后端的插件（so 文件）
- 編譯公共庫和核心模塊，生成 Triton 運行時依賴的各種二進制文件。
打包成一個 Docker 鏡像
- 把上面編譯得到的程序和動態庫，以及運行時環境（依賴庫、配置文件等）打包成一個完整的 Docker 鏡像，方便部署和分發。

3.3 構建體驗

這里使用--dryrun命令體驗一下構建

python3 build.py  --dryrun --enable-all
Building Triton Inference Server
platform rhel
machine x86_64
version 2.58.0dev
build dir ./triton_20250611/server/build
install dir None
cmake dir None
default repo-tag: r25.05
container version 25.05dev
upstream container version 25.05
endpoint "http"
endpoint "grpc"
endpoint "sagemaker"
endpoint "vertex-ai"
filesystem "gcs"
filesystem "s3"
filesystem "azure_storage"
backend "ensemble" at tag/branch "r25.05"
backend "identity" at tag/branch "r25.05"
backend "square" at tag/branch "r25.05"
backend "repeat" at tag/branch "r25.05"
backend "onnxruntime" at tag/branch "r25.05"
backend "python" at tag/branch "r25.05"
backend "dali" at tag/branch "r25.05"
backend "pytorch" at tag/branch "r25.05"
backend "openvino" at tag/branch "r25.05"
backend "fil" at tag/branch "r25.05"
backend "tensorrt" at tag/branch "r25.05"
repoagent "checksum" at tag/branch "r25.05"
cache "local" at tag/branch "r25.05"
cache "redis" at tag/branch "r25.05"
component "common" at tag/branch "r25.05"
component "core" at tag/branch "r25.05"
component "backend" at tag/branch "r25.05"
component "thirdparty" at tag/branch "r25.05"
Traceback (most recent call last):File "./triton_20250611/server/build.py", line 3162, in <module>create_build_dockerfiles(File "./triton_20250611/server/build.py", line 1696, in create_build_dockerfilesraise KeyError("A base image must be specified when targeting RHEL")
KeyError: 'A base image must be specified when targeting RHEL'

從上面的打印也能看出來，其實執行build.py的時候是會自動下載相應的repo的。然后有個報錯，那么修改build.py

def create_build_dockerfiles(container_build_dir, images, backends, repoagents, caches, endpoints
):if "base" in images:base_image = images["base"]if target_platform() == "rhel":print("warning: RHEL is not an officially supported target and you will probably experience errors attempting to build this container.")elif target_platform() == "windows":base_image = "mcr.microsoft.com/dotnet/framework/sdk:4.8"elif target_platform() == "rhel":#修改這里的分支邏輯，直接設定 base imagebase_image = "registry.access.redhat.com/ubi8/ubi:latest"print("Using manually set RHEL base image:", base_image)#raise KeyError("A base image must be specified when targeting RHEL")elif FLAGS.enable_gpu:base_image = "nvcr.io/nvidia/tritonserver:{}-py3-min".format(FLAGS.upstream_container_version)else:base_image = "ubuntu:24.04"

運行build.py腳本后，會在當前文件夾下創建：build文件夾，build文件夾下面有5個新文件產生，分別為：

cmake_build:負責執行 cmake 配置和編譯的腳本，編譯 Triton Inference Server 的可執行文件和動態庫。
docker_build:調用 cmake_build 進行編譯，然后基于生成的產物結合 Dockerfile 構建 Docker 鏡像的腳本，是整個容器構建流程的入口。
Dockerfile:用于構建最終的 Triton 推理服務器運行鏡像，包含已經編譯好的程序和依賴，作為實際運行環境。
Dockerfile.buildbase:定義構建基礎鏡像，包含編譯環境和必要依賴，用于支持后續的編譯和構建過程。
Dockerfile.cibase:定義持續集成（CI）環境的鏡像，通常用于自動化構建和測試流程。

這里其實是從docker_build這個腳本開始，這個腳本中間會調用cmake_build編譯一些可執行程序和動態庫，這個docker_build還會根據下面的三個Dockfile文件去生成最終的鏡像文件。

好了，triton的構建就先看到這里，先整體上理解，以后真正構建的時候再實際操作并且進行深入了解。

4 閱讀readme整體了解下backend機制

4.1 什么是backend

A Triton?backend?is the implementation that executes a model. A backend can be a wrapper around a deep-learning framework, like PyTorch, TensorFlow, TensorRT, ONNX Runtime or OpenVINO. A backend can also implement any functionality you want as long as it adheres to the?backend API. Triton uses this API to send requests to the backend for execution and the backend uses the API to communicate with Triton.

Every model must be associated with a backend. A model's backend is specified in the model's configuration using the?backend?setting. For using TensorRT backend, the value of this setting should be?tensorrt. Similarly, for using PyTorch, ONNX and TensorFlow backends, the?backend?field should be set to?pytorch,?onnxruntime?or?tensorflow?respectively. For all other backends,?backend?must be set to the name of the backend. Some backends may also check the?platform?setting for categorizing the model, for example, in TensorFlow backend,?platform?should be set to?tensorflow_savedmodel?or?tensorflow_graphdef?according to the model format. Please refer to the specific backend repository on whether?platform?is used.

在https://github.com/triton-inference-server/backend? ?的readme里面有上面兩段話，我用通俗一點的話總結一下就是：

所謂的backend就是triton中用來進行具體的模型推理的模塊，然后這個backend可以是由一些推理框架比如onnxruntime/tensorRT等封裝得到，當然你可以可以按照backend要求的api格式自己寫一個backend。

每個模型必須和一個backend來結合起來，怎么結合呢，可以使用backend關鍵字，也可以使用platform關鍵字。

?然后前面做部署實驗的時候，其實在這個目錄下可以看到

cat ./docs/examples/model_repository/densenet_onnx/config.pbtxt

name: "densenet_onnx"
platform: "onnxruntime_onnx"
max_batch_size : 0
input [{name: "data_0"data_type: TYPE_FP32format: FORMAT_NCHWdims: [ 3, 224, 224 ]reshape { shape: [ 1, 3, 224, 224 ] }}
]
output [{name: "fc6_1"data_type: TYPE_FP32dims: [ 1000 ]reshape { shape: [ 1, 1000, 1, 1 ] }label_filename: "densenet_labels.txt"}

這里面有platform: "onnxruntime_onnx"，我猜測大概率就是通過這個知道這個模型要用onnxruntime這個后端來推理吧，但是不確定，等后面看代碼的時候會再次確認一下。

4.1?Backend Shared Library

在?https://github.com/triton-inference-server/backend/tree/main#backends?這里有下面這樣一段話

Can I add (or remove) a backend to an existing Triton installation?

Yes. See?Backend Shared Library?for general information about how the shared library implementing a backend is managed by Triton, and?Triton with Unsupported and Custom Backends?for documentation on how to add your backend to the released Triton Docker image. For a standard install the globally available backends are in /opt/tritonserver/backends.

然后我去看看Backend Shared Library是怎么個事，

Backend Shared Library

Each backend must be implemented as a shared library and the name of the shared library must be?libtriton_<backend-name>.so. For example, if the name of the backend is "mybackend", a model indicates that it uses the backend by setting the model configuration 'backend' setting to "mybackend", and Triton looks for?libtriton_mybackend.so?as the shared library that implements the backend. The?tutorial?shows examples of how to build your backend logic into the appropriate shared library.

For a model,?M?that specifies backend?B, Triton searches for the backend shared library in the following places, in this order:

<model_repository>/M/<version_directory>/libtriton_B.so

<model_repository>/M/libtriton_B.so

<global_backend_directory>/B/libtriton_B.so

Where <global_backend_directory> is by default /opt/tritonserver/backends. The --backend-directory flag can be used to override the default.

Typically you will install your backend into the global backend directory. For example, if using Triton Docker images you can follow the instructions in?Triton with Unsupported and Custom Backends. Continuing the example of a backend names "mybackend", you would install into the Triton image as:
/opt/tritonserver/backends/mybackend/libtriton_mybackend.so... # other files needed by mybackend
Starting from 24.01, the default backend shared library name can be changed by providing the?runtime?setting in the model configuration. For example,
runtime: "my_backend_shared_library_name.so"
A model may choose a specific runtime implementation provided by the backend.

我也用通俗的語言概括下上面的這一段話，意思就是：每個backend都是一個.so的動態庫，然后庫文件的名字必須遵循libtriton_<backend-name>.so 的命名格式，然后配置文件backend:"mybackend"，那么triton就回去尋找libtriton_mybackend.so，還有就是說了這個動態庫存放的路徑有三個地方，但是一般來說存放在global backend directory也就是<global_backend_directory>/B/libtriton_B.so這個目錄，比如看下官方的。

4.3?how to add your backend to the released Triton Docker image.

說明在這里：https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/compose.md#triton-with-unsupported-and-custom-backends

Build it yourself

If you would like to do what?compose.py?is doing under the hood yourself, you can run?compose.py?with the?--dry-run?option and then modify the?Dockerfile.compose?file to satisfy your needs.

Triton with Unsupported and Custom Backends

You can?create and build your own Triton backend. The result of that build should be a directory containing your backend shared library and any additional files required by the backend. Assuming your backend is called "mybackend" and that the directory is "./mybackend", adding the following to the Dockerfile?compose.py?created will create a Triton image that contains all the supported Triton backends plus your custom backend.
COPY ./mybackend /opt/tritonserver/backends/mybackend
You also need to install any additional dependencies required by your backend as part of the Dockerfile. Then use Docker to create the image.
$ docker build -t tritonserver_custom -f Dockerfile.compose .

這一塊說的就是怎么把我們自己定義的backend添加到一個已有的docker鏡像中，然后有兩種方法，第一種方法是直接執行一次compose.py加上一些需要的參數，然后就結束了，第二種方法是執行compose.py的時候加上--dry-run然后生成Dockerfile.compose，然后再手動修改這個文件然后再執行生成docker鏡像的命令。

4.3.1 手動模式

python3 compose.py --dry-run --container-version=25.05       #這一句生成一個server/Dockerfile.compose
COPY ./mybackend /opt/tritonserver/backends/mybackend        #在./server/Dockerfile.compose中手動添加這一行
apt update && apt install -y libmydep-dev                    #自定義的backend可能需要的一些依賴
docker build -t tritonserver_custom -f Dockerfile.compose .  #構建鏡像

我試了下大體會有些面的這些打印。

python3 compose.py --dry-run --container-version=25.05
using container version 25.05
pulling container:nvcr.io/nvidia/tritonserver:25.05-py3
25.05-py3: Pulling from nvidia/tritonserver
Digest: sha256:3189f95bb663618601e46628af7afb154ba2997e152a29113c02f97b618d119f
Status: Image is up to date for nvcr.io/nvidia/tritonserver:25.05-py3
nvcr.io/nvidia/tritonserver:25.05-py3
25.05-py3-min: Pulling from nvidia/tritonserver
f03f49e66a78: Already exists
4f4fb700ef54: Already exists
bd0ed3dadbe9: Already exists
7b57f70af223: Already exists
3f0b11d337e6: Already exists
2104594958ce: Already exists
ba15f2616882: Already exists
4e46d4ab7302: Already exists
50f087002df9: Already exists
f94296dbf484: Already exists
03a8530f6876: Already exists
cce238fffcb6: Already exists
64a55035aee6: Already exists
1394c771d714: Already exists
2312e005f291: Already exists
95f88b748512: Already exists
b34261a35067: Already exists
b29ea1b3ef7d: Already exists
8accdaa104b8: Pull complete
046521000f43: Pull complete
Digest: sha256:3a1c84e22d2df22d00862eb651c000445d9314a12fd7dd005f4906f5615c7f6a
Status: Downloaded newer image for nvcr.io/nvidia/tritonserver:25.05-py3-min
nvcr.io/nvidia/tritonserver:25.05-py3-min

4.3.2 自動模式

python3 compose.py \--container-version=24.01 \--backend-dir=./mybackend \--output-name=tritonserver_custom

?參考文獻：

https://github.com/triton-inference-server/server

https://github.com/triton-inference-server/backend??

Triton中文社區

https://github.com/triton-inference-server/backend/tree/main#backends

tritonserver學習之五：backend實現機制_triton backend-CSDN博客

tritonserver學習之三：tritonserver運行流程_trition-server 使用教程-CSDN博客

Triton Server 快速入門_tritonserver-CSDN博客

?深度學習部署神器-triton inference server第一篇 - Oldpan的個人博客

tritonserver學習之六：自定義c++、python custom backend實踐_triton c++-CSDN博客