使用騰訊ncnn加速推理yolo v9對比opencv dnn

前面博客【opencv dnn模塊示例(25) 目標檢測 object_detection 之 yolov9
介】紹了 yolov9 詳細使用方式，重參數化、導出端到端模型，使用 torch、opencv、tensorrt 以及 paddle 的測試。

由于存在移動端推理部署的需求，需要進行加速處理，本文在 yolov9 的基礎上，使用騰訊的NCNN庫進行推理測試。

1、模型轉換

1.1、準備轉換工具

ncnn項目提供了轉換工具，可以直接沖預編譯 release 包形式獲取，鏈接 https://github.com/Tencent/ncnn/releases 。例如windows下提供的版本。
在這里插入圖片描述

我們以 ncnn-20241226-windows-vs2015-shared 包為例，下載后截圖如下，分為兩個架構，include和lib 為開發者使用， bin為動態庫和工具目錄。我們這里使用 onnx2ncnn.exe 工具。

在這里插入圖片描述

1.2、 onnx模型轉換

以 yolov9s 為例，預訓練或者訓練得到的 yolov9s.pt 模型，先進行重參數化處理，生成精簡網絡后的 yolov9s-converted.pt 模型文件，之后導出 yolov9s-converted.onnx，同時指定參數進行模型精簡。

python reparameterization_yolov9-s.py yolov9s.pt
python export.py --weights yolov9-s-converted.pt --include onnx --simplify

這里以我們訓練導出好的 best-s-c.onnx 模型文件進行準換，后續也以此作為測試。
進入NCNN的bin目錄，使用腳本命令

$ onnx2ncnn.exe best-s-c.onnx best-s-c.onnx.param best-s-c.onnx.binonnx2ncnn may not fully meet your needs. For more accurate and elegant
conversion results, please use PNNX. PyTorch Neural Network eXchange (PNNX) is
an open standard for PyTorch model interoperability. PNNX provides an open model
format for PyTorch. It defines computation graph as well as high level operators
strictly matches PyTorch. You can obtain pnnx through the following ways:
1. Install via pythonpip3 install pnnx
2. Get the executable from https://github.com/pnnx/pnnx
For more information, please refer to https://github.com/pnnx/pnnx

這里運行腳本之后，有一堆警告，可以按照要求進行額外的操作。當前轉換成功，并輸出了2個文件。
在這里插入圖片描述

我這里模型為個人訓練，6類。使用netron查看onnx 和 ncnn 模型的網絡結構、輸入和輸出。
兩者輸入 images、輸出 outputs0 相同，但ncnn中輸入和出書的維度都是動態的，不像onnx中為靜態固定的值。

在這里插入圖片描述

2、測試

2.1、測試代碼

我們直接使用前面博客中 opencv dnn 測試的代碼上修改。

2.1.1、預處理

先擴充為正方形，之后縮放到 (640,640)。

opencv 代碼為

// Create a 4D blob from a frame.
cv::Mat modelInput = frame;
if(letterBoxForSquare && inpWidth == inpHeight)modelInput = formatToSquare(modelInput);// preprocess
cv::dnn::blobFromImage(modelInput, blob, scale, cv::Size2f(inpWidth, inpHeight), mean, swapRB, false);

ncnn的代碼如下：

cv::Mat modelInput = frame;
if(letterBoxForSquare && inpWidth == inpHeight)modelInput = formatToSquare(modelInput);// preprocess
ncnn::Mat in = ncnn::Mat::from_pixels_resize((unsigned char*)modelInput.data, ncnn::Mat::PIXEL_BGR2RGB, modelInput.cols, modelInput.rows, (int)inpWidth, (int)inpHeight);float norm_ncnn[] = {1/255.f, 1/255.f, 1/255.f};
in.substract_mean_normalize(0, norm_ncnn);

注意對比，都先轉換為letterBox的正方形形式，之后縮放轉換為 4維 blob，并進行歸一化。ncnn稍顯復雜。

預處理的效率對比，三種實現方式如下，

         preprocesscv::TickMeter tk;tk.reset();for(int i = 0; i < 100; i++) {tk.start();cv::dnn::blobFromImage(modelInput, blob, scale, cv::Size2f(inpWidth, inpHeight), mean, swapRB, false);ncnn::Mat in2;in2.w = inpWidth;in2.h = inpHeight;in2.d = 1;in2.c = 3;in2.data = blob.data;in2.elemsize = 4;in2.elempack = 1;in2.dims = 3;in2.cstep = inpWidth*inpHeight;tk.stop();}std::cout<< tk.getTimeMilli() << "  " << tk.getAvgTimeMilli() << std::endl;tk.reset();for(int i = 0; i < 100; i++) {tk.start();cv::dnn::blobFromImage(modelInput, blob, scale, cv::Size2f(inpWidth, inpHeight), mean, swapRB, false);ncnn::Mat in2(inpWidth, inpHeight, 3, blob.data, 4, 1);tk.stop();}std::cout << tk.getTimeMilli() << "  " << tk.getAvgTimeMilli() << std::endl;tk.reset();for(int i = 0; i < 100; i++) {tk.start();ncnn::Mat in = ncnn::Mat::from_pixels_resize((unsigned char*)modelInput.data, ncnn::Mat::PIXEL_BGR2RGB, modelInput.cols, modelInput.rows, (int)inpWidth, (int)inpHeight);float norm_ncnn[] = {1 / 255.f, 1 / 255.f, 1 / 255.f};in.substract_mean_normalize(0, norm_ncnn);tk.stop();}std::cout << tk.getTimeMilli() << "  " << tk.getAvgTimeMilli() << std::endl;

運行100次，測量總時間和平均時間，對比結果可知ncnn的效率略高13%于opencv dnn。

374.684  3.74684
373.745  3.73745
327.09  3.2709

2.1.2、推理

opencv 的推理

// Run a model.
net.setInput(blob);
// output
std::vector<Mat> outs;
net.forward(outs, outNames);   // 亦可以使用 單一輸出 Mat out=net.forward(outNames);postprocess(frame, modelInput.size(), outs, net);

后處理函數，對網絡輸出 [1, clsass_num+4, 8400] 進行解碼，之后nms處理并繪制。

ncnn 的推理

由于格式不同，為復用后處理函數，對輸出進行轉換處理

ex.input("images", in);
ex.extract("output0", output);
// 復用opencv dnn的后處理
std::vector<Mat> outs;
outs.push_back(cv::Mat({1,output.h,output.w}, CV_32F, output.data));

2.2、效率對比

相同圖片，使用訓練的 yolov9-s模型，僅計算推理時間。

opencv dnn(CPU)：300ms
opencv dnn(GPU)：15ms

ncnn CPU：170ms
ncnn GPU（vulkan）：報錯。

目前僅看cpu，推理加速快接近50%… 在移動端還是提升客觀的。

2.3、主體代碼

注意，引用 #include "ncnn/net.h" 是，如果報奇怪的未定義錯誤，將引用提前。

using namespace cv;
using namespace dnn;float inpWidth;
float inpHeight;
float confThreshold, scoreThreshold, nmsThreshold;
std::vector<std::string> classes;
std::vector<cv::Scalar> colors;bool letterBoxForSquare = true;cv::Mat formatToSquare(const cv::Mat &source);void postprocess(Mat& frame, cv::Size inputSz, const std::vector<Mat>& out, Net& net);void drawPred(int classId, float conf, int left, int top, int right, int bottom, Mat& frame);std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<int> dis(100, 255);int test_ncnn()
{// 根據選擇的檢測模型文件進行配置 confThreshold = 0.25;scoreThreshold = 0.45;nmsThreshold = 0.5;float scale = 1 / 255.0;  //0.00392Scalar mean = {0,0,0};bool swapRB = true;inpWidth = 640;inpHeight = 640;//String modelPath = R"(E:\DeepLearning\yolov9\custom-data\traffic_accident_vehicle_test_0218\best-s-c.onnx)";String classesFile = R"(E:\DeepLearning\yolov9\custom-data\traffic_accident_vehicle_test_0218\cls.txt)";std::string param_path = R"(E:\1、交通事故\Traffic Accident Processes For IOS\models\20250221\ncnn\best-s-c.onnx.param)";std::string bin_path =   R"(E:\1、交通事故\Traffic Accident Processes For IOS\models\20250221\ncnn\best-s-c.onnx.bin)";ncnn::Net net;net.load_param(param_path.c_str());net.load_model(bin_path.c_str());net.opt.use_vulkan_compute = true;// Open file with classes names.if(!classesFile.empty()) {const std::string& file = classesFile;std::ifstream ifs(file.c_str());if(!ifs.is_open())CV_Error(Error::StsError, "File " + file + " not found");std::string line;while(std::getline(ifs, line)) {classes.push_back(line);colors.push_back(cv::Scalar(dis(gen), dis(gen), dis(gen)));}}// Create a windowstatic const std::string kWinName = "Deep learning object detection in OpenCV";cv::namedWindow(kWinName, 0);// Open a video file or an image file or a camera stream.VideoCapture cap;cap.open(R"(E:\DeepLearning\yolov9\bus.jpg)");cv::TickMeter tk;// Process frames.Mat frame, blob;while(waitKey(1) < 0) {cap >> frame;if(frame.empty()) {waitKey();break;}// Create a 4D blob from a frame.cv::Mat modelInput = frame;if(letterBoxForSquare && inpWidth == inpHeight)modelInput = formatToSquare(modelInput);// preprocess//cv::dnn::blobFromImage(modelInput, blob, scale, cv::Size2f(inpWidth, inpHeight), mean, swapRB, false);ncnn::Mat in = ncnn::Mat::from_pixels_resize((unsigned char*)modelInput.data, ncnn::Mat::PIXEL_BGR2RGB, modelInput.cols, modelInput.rows, (int)inpWidth, (int)inpHeight);float norm_ncnn[] = {1/255.f, 1/255.f, 1/255.f};in.substract_mean_normalize(0, norm_ncnn);// Run a model.ncnn::Extractor ex = net.create_extractor();ex.input("images", in);ncnn::Mat output;auto tt1 = cv::getTickCount();ex.extract("output0", output);auto tt2 = cv::getTickCount();//for(int i = 0; i < 20; i++) {//    auto tt1 = cv::getTickCount();//    ex.input("images", in);//    ex.extract("output0", output);//    auto tt2 = cv::getTickCount();//   std::cout << "infer time: " << (tt2 - tt1) / cv::getTickFrequency() * 1000 << std::endl;//}std::vector<Mat> outs;outs.push_back(cv::Mat({1,output.h,output.w}, CV_32F, output.data));cv::dnn::Net nullNet;postprocess(frame, modelInput.size(), outs, nullNet);//tk.stop();std::string label = format("Inference time: %.2f ms", (tt2 - tt1) / cv::getTickFrequency() * 1000);cv::putText(frame, label, Point(0, 15), FONT_HERSHEY_SIMPLEX, 0.5, Scalar(0, 255, 0));cv::imshow(kWinName, frame);}return 0;
}cv::Mat formatToSquare(const cv::Mat &source)
{int col = source.cols;int row = source.rows;int _max = MAX(col, row);cv::Mat result = cv::Mat::zeros(_max, _max, CV_8UC3);source.copyTo(result(cv::Rect(0, 0, col, row)));return result;
}void postprocess(Mat& frame, cv::Size inputSz, const std::vector<Mat>& outs, Net& net)
{// yolov8 has an output of shape (batchSize, 84, 8400) (Num classes + box[x,y,w,h] + confidence[c])auto tt1 = cv::getTickCount();float x_factor = inputSz.width / inpWidth;float y_factor = inputSz.height / inpHeight;std::vector<int> class_ids;std::vector<float> confidences;std::vector<cv::Rect> boxes;//int rows = outs[0].size[1];//int dimensions = outs[0].size[2];// [1, 84, 8400] -> [8400,84]int rows = outs[0].size[2];int dimensions = outs[0].size[1];auto tmp = outs[0].reshape(1, dimensions);cv::transpose(tmp, tmp);float *data = (float *)tmp.data;for(int i = 0; i < rows; ++i) {//float confidence = data[4];//if(confidence >= confThreshold) {float *classes_scores = data + 4;cv::Mat scores(1, classes.size(), CV_32FC1, classes_scores);cv::Point class_id;double max_class_score;minMaxLoc(scores, 0, &max_class_score, 0, &class_id);if(max_class_score > scoreThreshold) {confidences.push_back(max_class_score);class_ids.push_back(class_id.x);float x = data[0];float y = data[1];float w = data[2];float h = data[3];          int left = int((x - 0.5 * w) * x_factor);int top = int((y - 0.5 * h) * y_factor);int width = int(w * x_factor);int height = int(h * y_factor);boxes.push_back(cv::Rect(left, top, width, height));}//}data += dimensions;}std::vector<int> indices;NMSBoxes(boxes, confidences, scoreThreshold, nmsThreshold, indices);auto tt2 = cv::getTickCount();std::string label = format("NMS time: %.2f ms", (tt2 - tt1) / cv::getTickFrequency() * 1000);cv::putText(frame, label, Point(0, 30), FONT_HERSHEY_SIMPLEX, 0.5, Scalar(0, 255, 0));for(size_t i = 0; i < indices.size(); ++i) {int idx = indices[i];Rect box = boxes[idx];drawPred(class_ids[idx], confidences[idx], box.x, box.y,box.x + box.width, box.y + box.height, frame);//printf("cls = %d, prob = %.2f\n", class_ids[idx], confidences[idx]);std::cout << "cls " << class_ids[idx] << ", prob = " << confidences[idx] << ", "<< box  << "\n";}
}void drawPred(int classId, float conf, int left, int top, int right, int bottom, Mat& frame)
{rectangle(frame, Point(left, top), Point(right, bottom), Scalar(0, 255, 0));std::string label = format("%.2f", conf);Scalar color = Scalar::all(255);if(!classes.empty()) {CV_Assert(classId < (int)classes.size());label = classes[classId] + ": " + label;color = colors[classId];}int baseLine;Size labelSize = getTextSize(label, FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);top = max(top, labelSize.height);rectangle(frame, Point(left, top - labelSize.height),Point(left + labelSize.width, top + baseLine), color, FILLED);cv::putText(frame, label, Point(left, top), FONT_HERSHEY_SIMPLEX, 0.5, Scalar());
}