RK3588上CPU和GPU算力以及opencv resize的性能對比測試

RK3588上CPU和GPU算力以及opencv resize的性能對比測試

  • 一.背景
  • 二.小結
  • 三.相關鏈接
  • 四.操作步驟
    • 1.環境搭建
      • A.安裝依賴
      • B.設置GPU為高性能模式
      • C.獲取GPU信息
      • D.獲取CPU信息
    • 2.調用OpenCL SDK獲取GPU信息
    • 3.使用OpenCL API計算矩陣乘
    • 4.使用clpeak測試GPU的性能
    • 5.使用OpenBLAS測試CPU的算力
    • 6.分別用CPU與OpenCL測試opencv resize的性能
      • A.編譯OpenCV支持OpenCL
      • B.運行OpenCV測試程序

一.背景

  • 希望對比RK3588上CPU和Mali-GPU的性能差異
  • Mali-GPU算力測試采用clpeak
  • CPU-FP32的性能測試采用Openblas(開啟了NEON優化)
  • 分別用CPU和opencl測試opencv resize在不同算法下的性能:從32x32放大到8192x8192再縮放回32x32,循環100次

二.小結

  • GPU型號: Mali-LODX r0p0 Mali-G610 4 cores r0p0 0xA867
  • GPU FP32(clpeak): 441.95 GFLOPS
  • CPU FP32(openblas+neon): 53.68 GFLOPS
  • 插值方法:INTER_NEAREST CPU耗時(秒):3.01526 GPU耗時(秒):0.0672681
  • 插值方法:INTER_LINEAR CPU耗時(秒):5.3227 GPU耗時(秒):0.0189366
  • 插值方法:INTER_CUBIC CPU耗時(秒):8.22734 GPU耗時(秒):11.6337
  • 插值方法:INTER_AREA CPU耗時(秒):20.4999 GPU耗時(秒):27.3197
  • 插值方法:INTER_LANCZOS4 CPU耗時(秒):29.3602 GPU耗時(秒):43.9484

三.相關鏈接

  • opencv編譯

四.操作步驟

1.環境搭建

A.安裝依賴

mv /lib/aarch64-linux-gnu/libOpenCL.so.1 /lib/aarch64-linux-gnu/libOpenCL.so.1.bk
ln -s /usr/lib/aarch64-linux-gnu/libmali.so /lib/aarch64-linux-gnu/libOpenCL.so.1sudo apt install opencl-headers
sudo apt install ocl-icd-libopencl1
sudo apt install ocl-icd-opencl-dev
sudo apt install clinfo

B.設置GPU為高性能模式

echo performance> /sys/class/devfreq/fb000000.gpu/governor
echo performance> /sys/class/devfreq/fdab0000.npu/governor

C.獲取GPU信息

cat /sys/class/misc/mali0/device/gpuinfo
clinfo

輸出

Mali-G610 4 cores r0p0 0xA867Number of platforms                               1Platform Name                                   ARM PlatformPlatform Vendor                                 ARMPlatform Version                                OpenCL 2.1 v1.g6p0-01eac0.ba52c908d926792b8f5fe28f383a2b03Platform Profile                                FULL_PROFILEPlatform Extensions                             cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_icd cl_khr_egl_image cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_subgroups cl_khr_subgroup_extended_types cl_khr_subgroup_non_uniform_vote cl_khr_subgroup_ballot cl_khr_il_program cl_khr_priority_hints cl_khr_create_command_queue cl_khr_spirv_no_integer_wrap_decoration cl_khr_extended_versioning cl_khr_device_uuid cl_arm_core_id cl_arm_printf cl_arm_non_uniform_work_group_size cl_arm_import_memory cl_arm_import_memory_dma_buf cl_arm_import_memory_host cl_arm_integer_dot_product_int8 cl_arm_integer_dot_product_accumulate_int8 cl_arm_integer_dot_product_accumulate_saturate_int8 cl_arm_scheduling_controls cl_arm_controlled_kernel_termination cl_ext_cxx_for_openclPlatform Host timer resolution                  1nsPlatform Extensions function suffix             ARMPlatform Name                                   ARM Platform
Number of devices                                 1
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '6'.Device Name                                     Mali-LODX r0p0Device Vendor                                   ARMDevice Vendor ID                                0xa8670000Device Version                                  OpenCL 2.1 v1.g6p0-01eac0.ba52c908d926792b8f5fe28f383a2b03Driver Version                                  2.1Device OpenCL C Version                         OpenCL C 2.0 v1.g6p0-01eac0.ba52c908d926792b8f5fe28f383a2b03Device Type                                     GPUDevice Profile                                  FULL_PROFILEDevice Available                                YesCompiler Available                              YesLinker Available                                YesMax compute units                               4Max clock frequency                             1000MHzDevice Partition                                (core)Max number of sub-devices                     0Supported partition types                     NoneSupported affinity domains                    (n/a)Max work item dimensions                        3Max work item sizes                             1024x1024x1024Max work group size                             1024Preferred work group size multiple              16Max sub-groups per work group                   64Preferred / native vector sizeschar                                                16 / 4short                                                8 / 2int                                                  4 / 1long                                                 2 / 1half                                                 8 / 2        (cl_khr_fp16)float                                                4 / 1double                                               0 / 0        (n/a)Half-precision Floating-point support           (cl_khr_fp16)Denormals                                     YesInfinity and NANs                             YesRound to nearest                              YesRound to zero                                 YesRound to infinity                             YesIEEE754-2008 fused multiply-add               YesSupport is emulated in software               NoSingle-precision Floating-point support         (core)Denormals                                     YesInfinity and NANs                             YesRound to nearest                              YesRound to zero                                 YesRound to infinity                             YesIEEE754-2008 fused multiply-add               YesSupport is emulated in software               NoCorrectly-rounded divide and sqrt operations  NoDouble-precision Floating-point support         (n/a)Address bits                                    64, Little-EndianGlobal memory size                              16643870720 (15.5GiB)Error Correction support                        NoMax memory allocation                           16643870720 (15.5GiB)Unified memory for Host and Device              YesShared Virtual Memory (SVM) capabilities        (core)Coarse-grained buffer sharing                 YesFine-grained buffer sharing                   NoFine-grained system sharing                   NoAtomics                                       NoMinimum alignment for any data type             128 bytesAlignment of base address                       1024 bits (128 bytes)Preferred alignment for atomicsSVM                                           0 bytesGlobal                                        0 bytesLocal                                         0 bytesMax size for global variable                    65536 (64KiB)Preferred total size of global vars             0Global Memory cache type                        Read/WriteGlobal Memory cache size                        1048576 (1024KiB)Global Memory cache line size                   64 bytesImage support                                   YesMax number of samplers per kernel             16Max size for 1D images from buffer            65536 pixelsMax 1D or 2D image array size                 2048 imagesBase address alignment for 2D image buffers   32 bytesPitch alignment for 2D image buffers          64 pixelsMax 2D image size                             65536x65536 pixelsMax 3D image size                             65536x65536x65536 pixelsMax number of read image args                 128Max number of write image args                64Max number of read/write image args           64Max number of pipe args                         16Max active pipe reservations                    1Max pipe packet size                            1024Local memory type                               GlobalLocal memory size                               32768 (32KiB)Max number of constant args                     128Max constant buffer size                        16643870720 (15.5GiB)Max size of kernel argument                     1024Queue properties (on host)Out-of-order execution                        YesProfiling                                     YesQueue properties (on device)Out-of-order execution                        YesProfiling                                     YesPreferred size                                2097152 (2MiB)Max size                                      16777216 (16MiB)Max queues on device                            1Max events on device                            1024Prefer user sync for interop                    NoProfiling timer resolution                      1000nsExecution capabilitiesRun OpenCL kernels                            YesRun native kernels                            NoSub-group independent forward progress        YesIL version                                    SPIR-V_1.0SPIR versions                                 <printDeviceInfo:161: get CL_DEVICE_SPIR_VERSIONS size : error -30>printf() buffer size                            1048576 (1024KiB)Built-in kernels                                (n/a)Device Extensions                               cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_icd cl_khr_egl_image cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_subgroups cl_khr_subgroup_extended_types cl_khr_subgroup_non_uniform_vote cl_khr_subgroup_ballot cl_khr_il_program cl_khr_priority_hints cl_khr_create_command_queue cl_khr_spirv_no_integer_wrap_decoration cl_khr_extended_versioning cl_khr_device_uuid cl_arm_core_id cl_arm_printf cl_arm_non_uniform_work_group_size cl_arm_import_memory cl_arm_import_memory_dma_buf cl_arm_import_memory_host cl_arm_integer_dot_product_int8 cl_arm_integer_dot_product_accumulate_int8 cl_arm_integer_dot_product_accumulate_saturate_int8 cl_arm_scheduling_controls cl_arm_controlled_kernel_termination cl_ext_cxx_for_openclNULL platform behaviorclGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  ARM PlatformclGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [ARM]clCreateContext(NULL, ...) [default]            Success [ARM]clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)Platform Name                                 ARM PlatformDevice Name                                   Mali-LODX r0p0clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platformclCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)Platform Name                                 ARM PlatformDevice Name                                   Mali-LODX r0p0clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platformclCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platformclCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)Platform Name                                 ARM PlatformDevice Name                                   Mali-LODX r0p0

D.獲取CPU信息

lscpu

輸出

Architecture:                    aarch64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              1
Core(s) per socket:              2
Socket(s):                       3
Vendor ID:                       ARM
Model:                           0
Model name:                      Cortex-A55
Stepping:                        r2p0
CPU max MHz:                     2208.0000
CPU min MHz:                     408.0000
BogoMIPS:                        48.00
L1d cache:                       256 KiB
L1i cache:                       256 KiB
L2 cache:                        1 MiB
L3 cache:                        3 MiB
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp

2.調用OpenCL SDK獲取GPU信息

cat > cl_query.c <<-'EOF'
#include <stdio.h>
#include <stdlib.h>
#include <CL/cl.h>int main() {cl_platform_id *platforms = NULL;cl_uint num_platforms = 0;// 獲取可用的平臺數量cl_int clStatus = clGetPlatformIDs(0, NULL, &num_platforms);platforms = (cl_platform_id*) malloc(sizeof(cl_platform_id) * num_platforms);// 獲取所有平臺IDclStatus = clGetPlatformIDs(num_platforms, platforms, NULL);printf("OpenCL平臺數量: %d\n", num_platforms);// 遍歷每個平臺for (cl_uint i = 0; i < num_platforms; ++i) {char buffer[10240];printf("\n平臺 %d:\n", i+1);// 獲取平臺名稱clGetPlatformInfo(platforms[i], CL_PLATFORM_NAME, sizeof(buffer), buffer, NULL);printf("  名稱: %s\n", buffer);// 獲取平臺供應商clGetPlatformInfo(platforms[i], CL_PLATFORM_VENDOR, sizeof(buffer), buffer, NULL);printf("  供應商: %s\n", buffer);// 獲取平臺版本clGetPlatformInfo(platforms[i], CL_PLATFORM_VERSION, sizeof(buffer), buffer, NULL);printf("  版本: %s\n", buffer);// 獲取設備數量cl_uint num_devices = 0;clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_ALL, 0, NULL, &num_devices);cl_device_id *devices = (cl_device_id*) malloc(sizeof(cl_device_id) * num_devices);clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_ALL, num_devices, devices, NULL);// 遍歷每個設備for (cl_uint j = 0; j < num_devices; ++j) {printf("  設備 %d:\n", j+1);// 獲取設備名稱clGetDeviceInfo(devices[j], CL_DEVICE_NAME, sizeof(buffer), buffer, NULL);printf("    名稱: %s\n", buffer);// 獲取設備類型cl_device_type device_type;clGetDeviceInfo(devices[j], CL_DEVICE_TYPE, sizeof(device_type), &device_type, NULL);if (device_type & CL_DEVICE_TYPE_CPU)printf("    類型: CPU\n");if (device_type & CL_DEVICE_TYPE_GPU)printf("    類型: GPU\n");if (device_type & CL_DEVICE_TYPE_ACCELERATOR)printf("    類型: 加速器\n");// 獲取計算單元數量cl_uint compute_units;clGetDeviceInfo(devices[j], CL_DEVICE_MAX_COMPUTE_UNITS, sizeof(compute_units), &compute_units, NULL);printf("    計算單元數: %d\n", compute_units);// 獲取全局內存大小cl_ulong global_mem;clGetDeviceInfo(devices[j], CL_DEVICE_GLOBAL_MEM_SIZE, sizeof(global_mem), &global_mem, NULL);printf("    全局內存大小: %llu MB\n", (unsigned long long)(global_mem / (1024 * 1024)));}free(devices);}free(platforms);return 0;
}
EOFgcc -o cl_query cl_query.c -lOpenCL
./cl_query

輸出

OpenCL平臺數量: 1平臺 1:名稱: ARM Platform供應商: ARM版本: OpenCL 2.1 v1.g6p0-01eac0.ba52c908d926792b8f5fe28f383a2b03設備 1:
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '6'.名稱: Mali-LODX r0p0類型: GPU計算單元數: 4全局內存大小: 15872 MB

3.使用OpenCL API計算矩陣乘

cat > matmul.c <<-'EOF'
#include <stdio.h>
#include <stdlib.h>
#include <CL/cl.h>
#include <time.h>
#include <sys/time.h>#define MATRIX_SIZE 8192
#define TILE_SIZE 32// 獲取當前時間(秒),用于計算耗時
double get_current_time() {struct timeval tp;gettimeofday(&tp, NULL);return (double)(tp.tv_sec) + (double)(tp.tv_usec) / 1e6;
}#define xstr(s) str(s)
#define str(s) #sconst char *kernelSource = "                                  \n" \
"__kernel void mat_mul_optimized(const int N,                 \n" \
"                                __global float* A,           \n" \
"                                __global float* B,           \n" \
"                                __global float* C) {         \n" \
"    const int TILE_SIZE = " xstr(TILE_SIZE) ";               \n" \
"    __local float Asub[TILE_SIZE][TILE_SIZE];                \n" \
"    __local float Bsub[TILE_SIZE][TILE_SIZE];                \n" \
"    int global_row = get_global_id(1);                       \n" \
"    int global_col = get_global_id(0);                       \n" \
"    int local_row = get_local_id(1);                         \n" \
"    int local_col = get_local_id(0);                         \n" \
"    float sum = 0.0f;                                        \n" \
"    int numTiles = (N + TILE_SIZE - 1) / TILE_SIZE;          \n" \
"    for (int t = 0; t < numTiles; ++t) {                     \n" \
"        int tiled_row = global_row;                          \n" \
"        int tiled_col = t * TILE_SIZE + local_col;           \n" \
"        if (tiled_row < N && tiled_col < N)                  \n" \
"            Asub[local_row][local_col] = A[tiled_row * N + tiled_col];\n" \
"        else                                                 \n" \
"            Asub[local_row][local_col] = 0.0f;               \n" \
"        tiled_row = t * TILE_SIZE + local_row;               \n" \
"        tiled_col = global_col;                              \n" \
"        if (tiled_row < N && tiled_col < N)                  \n" \
"            Bsub[local_row][local_col] = B[tiled_row * N + tiled_col];\n" \
"        else                                                 \n" \
"            Bsub[local_row][local_col] = 0.0f;               \n" \
"        barrier(CLK_LOCAL_MEM_FENCE);                        \n" \
"        for (int k = 0; k < TILE_SIZE; ++k) {                \n" \
"            sum += Asub[local_row][k] * Bsub[k][local_col];  \n" \
"        }                                                    \n" \
"        barrier(CLK_LOCAL_MEM_FENCE);                        \n" \
"    }                                                        \n" \
"    if (global_row < N && global_col < N)                    \n" \
"        C[global_row * N + global_col] = sum;                \n" \
"}                                                            \n";int main() {int N = MATRIX_SIZE;size_t bytes = N * N * sizeof(float);// 分配主機內存float *h_A = (float*)malloc(bytes);float *h_B = (float*)malloc(bytes);float *h_C = (float*)malloc(bytes);// 初始化矩陣for(int i = 0; i < N*N; i++) {h_A[i] = 1.0f;h_B[i] = 1.0f;}// 獲取平臺和設備信息cl_platform_id platformId = NULL;cl_device_id deviceID = NULL;cl_uint retNumDevices;cl_uint retNumPlatforms;cl_int ret = clGetPlatformIDs(1, &platformId, &retNumPlatforms);ret = clGetDeviceIDs(platformId, CL_DEVICE_TYPE_DEFAULT, 1, &deviceID, &retNumDevices);// 創建 OpenCL 上下文cl_context context = clCreateContext(NULL, 1, &deviceID, NULL, NULL, &ret);// 創建命令隊列cl_command_queue commandQueue = clCreateCommandQueue(context, deviceID, 0, &ret);// 創建內存緩沖區cl_mem d_A = clCreateBuffer(context, CL_MEM_READ_ONLY, bytes, NULL, &ret);cl_mem d_B = clCreateBuffer(context, CL_MEM_READ_ONLY, bytes, NULL, &ret);cl_mem d_C = clCreateBuffer(context, CL_MEM_WRITE_ONLY, bytes, NULL, &ret);// 將數據寫入緩沖區ret = clEnqueueWriteBuffer(commandQueue, d_A, CL_TRUE, 0, bytes, h_A, 0, NULL, NULL);ret = clEnqueueWriteBuffer(commandQueue, d_B, CL_TRUE, 0, bytes, h_B, 0, NULL, NULL);// 記錄編譯開始時間double compile_start = get_current_time();// 創建程序對象cl_program program = clCreateProgramWithSource(context, 1, (const char**)&kernelSource, NULL, &ret);// 編譯內核程序ret = clBuildProgram(program, 1, &deviceID, NULL, NULL, NULL);// 檢查編譯錯誤if (ret != CL_SUCCESS) {size_t log_size;clGetProgramBuildInfo(program, deviceID, CL_PROGRAM_BUILD_LOG, 0, NULL, &log_size);char *log = (char *)malloc(log_size);clGetProgramBuildInfo(program, deviceID, CL_PROGRAM_BUILD_LOG, log_size, log, NULL);printf("CL Compilation failed:\n%s\n", log);free(log);return 1;}// 記錄編譯結束時間double compile_end = get_current_time();double compile_time = compile_end - compile_start;// 創建 OpenCL 內核cl_kernel kernel = clCreateKernel(program, "mat_mul_optimized", &ret);// 設置內核參數ret = clSetKernelArg(kernel, 0, sizeof(int), (void*)&N);ret = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void*)&d_A);ret = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void*)&d_B);ret = clSetKernelArg(kernel, 3, sizeof(cl_mem), (void*)&d_C);// 定義全局和本地工作區大小size_t local[2] = {TILE_SIZE, TILE_SIZE};size_t global[2] = {(size_t)((N + TILE_SIZE - 1) / TILE_SIZE) * TILE_SIZE,(size_t)((N + TILE_SIZE - 1) / TILE_SIZE) * TILE_SIZE};// 記錄第一次內核執行開始時間double launch_start = get_current_time();// 執行內核ret = clEnqueueNDRangeKernel(commandQueue, kernel, 2, NULL, global, local, 0, NULL, NULL);printf("clEnqueueNDRangeKernel:%d\n",ret);// 等待命令隊列執行完成clFinish(commandQueue);// 記錄第一次內核執行結束時間double launch_end = get_current_time();double launch_time = launch_end - launch_start;// 讀取結果ret = clEnqueueReadBuffer(commandQueue, d_C, CL_TRUE, 0, bytes, h_C, 0, NULL, NULL);// 計算 GFLOPSdouble total_ops = 2.0 * N * N * N;double gflops = (total_ops / 1e9) / launch_time;// 輸出結果printf("編譯時間: %f 秒\n", compile_time);printf("第一次內核執行時間: %f 秒\n", launch_time);printf("計算性能: %f GFLOPS\n", gflops);// 釋放資源ret = clFlush(commandQueue);ret = clFinish(commandQueue);ret = clReleaseKernel(kernel);ret = clReleaseProgram(program);ret = clReleaseMemObject(d_A);ret = clReleaseMemObject(d_B);ret = clReleaseMemObject(d_C);ret = clReleaseCommandQueue(commandQueue);ret = clReleaseContext(context);free(h_A);free(h_B);free(h_C);return 0;
}EOF
gcc -o matmul matmul.c -lOpenCL
./matmul

輸出

編譯時間: 0.031085 秒
第一次內核執行時間: 62.258528 秒
計算性能: 17.660418 GFLOPS

4.使用clpeak測試GPU的性能

git clone https://gitcode.com/gh_mirrors/cl/clpeak.git
git submodule update --init --recursive --remote
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake --build .
./clpeak

輸出

Platform: ARM Platform
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '6'.Device: Mali-LODX r0p0Driver version  : 2.1 (Linux ARM64)Compute units   : 4Clock frequency : 1000 MHzGlobal memory bandwidth (GBPS)float   : 25.71float2  : 24.45float4  : 23.70float8  : 12.05float16 : 12.01Single-precision compute (GFLOPS)float   : 441.77float2  : 470.27float4  : 466.52float8  : 435.65float16 : 411.38Half-precision compute (GFLOPS)half   : 441.96half2  : 878.25half4  : 911.51half8  : 886.19half16 : 846.44No double precision support! SkippedInteger compute (GIOPS)int   : 124.96int2  : 125.71int4  : 125.16int8  : 123.82int16 : 124.24Integer compute Fast 24bit (GIOPS)int   : 125.16int2  : 125.63int4  : 125.20int8  : 123.73int16 : 124.33Integer char (8bit) compute (GIOPS)char   : 126.47char2  : 251.55char4  : 498.03char8  : 497.37char16 : 491.94Integer short (16bit) compute (GIOPS)short   : 126.31short2  : 250.90short4  : 249.47short8  : 248.51short16 : 245.30Transfer bandwidth (GBPS)enqueueWriteBuffer              : 8.54enqueueReadBuffer               : 9.97enqueueWriteBuffer non-blocking : 8.55enqueueReadBuffer non-blocking  : 9.99enqueueMapBuffer(for read)      : 61.66memcpy from mapped ptr        : 11.95enqueueUnmap(after write)       : 62.02memcpy to mapped ptr          : 11.89Kernel launch latency : 26.81 us

5.使用OpenBLAS測試CPU的算力

git clone https://github.com/xianyi/OpenBLAS.git
cd OpenBLAS
make TARGET=ARMV8
make install
cd benchmark
make TARGET=ARMV8 sgemm
cc sgemm.o -o sgemm /opt/OpenBLAS/lib/libopenblas.so -Wl,-rpath=/opt/OpenBLAS/lib/
export OPENBLAS_NUM_THREADS=8
export OPENBLAS_LOOPS=10
export OPENBLAS_PARAM_M=8192
export OPENBLAS_PARAM_N=8192
export OPENBLAS_PARAM_K=8192
./sgemm

輸出

From :   1  To : 200 Step=1 : Transa=N : Transb=NSIZE                   Flops             TimeM=8192, N=8192, K=8192 :    53485.68 MFlops 205.571220 sec

6.分別用CPU與OpenCL測試opencv resize的性能

A.編譯OpenCV支持OpenCL

  • Opencv修改點[鏈接libmali.so]
diff --git a/cmake/OpenCVDetectOpenCL.cmake b/cmake/OpenCVDetectOpenCL.cmake
index 6ab2cae070..c3cf235e45 100644
--- a/cmake/OpenCVDetectOpenCL.cmake
+++ b/cmake/OpenCVDetectOpenCL.cmake
@@ -3,9 +3,8 @@ if(APPLE)set(OPENCL_LIBRARY "-framework OpenCL" CACHE STRING "OpenCL library")set(OPENCL_INCLUDE_DIR "" CACHE PATH "OpenCL include directory")else()
-  set(OPENCL_LIBRARY "" CACHE STRING "OpenCL library")
-  set(OPENCL_INCLUDE_DIR "${OpenCV_SOURCE_DIR}/3rdparty/include/opencl/1.2" CACHE PATH "OpenCL include directory")
-  ocv_install_3rdparty_licenses(opencl-headers "${OpenCV_SOURCE_DIR}/3rdparty/include/opencl/LICENSE.txt")
+  set(OPENCL_LIBRARY "/usr/lib/aarch64-linux-gnu/libmali.so")
+  set(OPENCL_INCLUDE_DIR "/usr/include")endif()mark_as_advanced(OPENCL_INCLUDE_DIR OPENCL_LIBRARY)
  • 編譯Opencv
git clone https://github.com/opencv/opencv.git
cd opencv
git checkout bdb6a968ce69a2bf7c34724f9052c20e941ab47b
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release \-DCMAKE_INSTALL_PREFIX=`pwd`/_install \-DWITH_OPENCL=ON -DWITH_NEON=ON \-DBUILD_SHARED_LIBS=ON \-D BUILD_opencv_world=ON -DBUILD_TESTS=OFF -DBUILD_EXAMPLES=OFF -DBUILD_opencv_apps=OFF \-DBUILD_opencv_dnn=OFF -DBUILD_opencv_calib3d=OFF \-DBUILD_opencv_imgproc=ON -DBUILD_opencv_imgcodecs=ON ..
make -j4
make install

B.運行OpenCV測試程序

cat > opencv_resize.cpp <<-'EOF'
#include <opencv2/opencv.hpp>
#include <opencv2/core/ocl.hpp>
#include <iostream>
#include <map>void run(int resize_mode)
{// 創建一個32x32的隨機圖像cv::Mat src = cv::Mat::zeros(32, 32, CV_8UC3);cv::randu(src, cv::Scalar::all(0), cv::Scalar::all(255));// ------------------------------------// 在CPU上執行// ------------------------------------cv::ocl::setUseOpenCL(false);cv::Mat enlarged_cpu, resized_back_cpu;// 記錄放大操作的開始時間int64 start_time_cpu = cv::getTickCount();for(int i=0;i<100;i++){// 放大到8192x8192cv::resize(src, enlarged_cpu, cv::Size(8192, 8192), 0, 0, resize_mode);// 縮小回32x32cv::resize(enlarged_cpu, resized_back_cpu, cv::Size(32, 32), 0, 0, resize_mode);}// 記錄縮小操作的結束時間int64 end_time_cpu = cv::getTickCount();// 計算縮小操作的耗時double time_resize_cpu = (end_time_cpu - start_time_cpu) / cv::getTickFrequency();// ------------------------------------// 在GPU(OpenCL)上執行// ------------------------------------cv::ocl::setUseOpenCL(true);cv::UMat src_umat;src.copyTo(src_umat);cv::UMat enlarged_gpu, resized_back_gpu;// 記錄放大操作的開始時間int64 start_time_gpu = cv::getTickCount();for(int i=0;i<100;i++){// 放大到8192x8192cv::resize(src_umat, enlarged_gpu, cv::Size(8192, 8192), 0, 0, resize_mode);// 縮小回32x32cv::resize(enlarged_gpu, resized_back_gpu, cv::Size(32, 32), 0, 0, resize_mode);}// 記錄縮小操作的結束時間int64 end_time_gpu = cv::getTickCount();// 計算縮小操作的耗時double time_resize_gpu = (end_time_gpu - start_time_gpu) / cv::getTickFrequency();std::cout <<"CPU耗時(秒):" << time_resize_cpu << " " << "GPU耗時(秒):" << time_resize_gpu << std::endl;
}int main() {// 檢查系統是否支持OpenCLif (!cv::ocl::haveOpenCL()) {std::cout << "系統不支持OpenCL。" << std::endl;return -1;}// 輸出OpenCL設備信息cv::ocl::Context context;if (!context.create(cv::ocl::Device::TYPE_GPU)) {std::cout << "未找到可用的GPU設備,使用CPU執行。" << std::endl;} else {cv::ocl::Device device = cv::ocl::Device::getDefault();std::cout << "使用的OpenCL設備:" << device.name() << std::endl;}// 定義要測試的插值方法std::vector<int> interpolation_methods = {cv::INTER_NEAREST,cv::INTER_LINEAR,cv::INTER_CUBIC,cv::INTER_AREA,cv::INTER_LANCZOS4};// 插值方法的名稱,用于輸出結果std::vector<std::string> interpolation_names = {"INTER_NEAREST","INTER_LINEAR","INTER_CUBIC","INTER_AREA","INTER_LANCZOS4"};for (size_t i = 0; i < interpolation_methods.size(); ++i) {int interpolation = interpolation_methods[i];std::string method_name = interpolation_names[i];std::cout << "插值方法:" << method_name << " ";run(interpolation);}		return 0;
}
EOF
g++ -o opencv_resize opencv_resize.cpp -I _install/include/opencv4 \_install/lib/libopencv_world.so -Wl,-rpath=_install/lib
export OPENBLAS_NUM_THREADS=8
./opencv_resize

輸出

arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '6'.
使用的OpenCL設備:Mali-LODX r0p0
插值方法:INTER_NEAREST  CPU耗時():3.01526 GPU耗時():0.0672681
插值方法:INTER_LINEAR   CPU耗時():5.3227  GPU耗時():0.0189366
插值方法:INTER_CUBIC    CPU耗時():8.22734 GPU耗時():11.6337
插值方法:INTER_AREA     CPU耗時():20.4999 GPU耗時():27.3197
插值方法:INTER_LANCZOS4 CPU耗時():29.3602 GPU耗時():43.9484

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/web/65699.shtml
繁體地址,請注明出處:http://hk.pswp.cn/web/65699.shtml
英文地址,請注明出處:http://en.pswp.cn/web/65699.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

轉運機器人在物流倉儲行業的優勢特點

在智能制造與智慧物流的浪潮中&#xff0c;一款革命性的產品正悄然改變著行業的面貌——富唯智能轉運機器人&#xff0c;它以卓越的智能科技與創新的設計理念&#xff0c;引領著物流領域步入一個全新的高效、智能、無人的時代。 一、解放雙手&#xff0c;重塑物流生態 富唯智能…

基于單片機的無線智能窗簾控制器的設計

摘 要 : 本文以單片機為控制核心 , 基于 PT2262/ 2272 無線收發模塊 , 實現了窗簾的無線遠程智能控制 . 該控制器通過高頻無線收發模塊實現了遙控窗簾的開合控制; 根據外部光線強弱實現自動開關窗簾 ; 根據設定時間自動完成開關過程; 通過語音播報當前環境溫濕度信息以…

linux centos掛載未分配的磁盤空間

使用到的命令 lshw -class disk -short hostnamectl fdisk /dev/sdb partprobe /dev/sdb mount /dev/sdb2 /opt/fastdfs/ mkfs.ext4 /dev/sdb2 mount -t ext4 /dev/sdb2 /opt/fastdfs/

Vivado中Tri_mode_ethernet_mac的時序約束、分析、調整——(一)時序約束的基本概念

1、基本概念 推薦閱讀&#xff0c;Ally Zhou編寫的《Vivado使用誤區與進階》系列文章&#xff0c;熟悉基本概念、tcl語句的使用。 《Vivado使用誤區與進階》電子書開放下載&#xff01;&#xff01; 2、Vivado中的語法例程 1&#xff09;語法例程 約束的語句可以參考vivado…

基于Spring Boot的城市垃圾分類管理系統設計與實現(LW+源碼+講解)

專注于大學生項目實戰開發,講解,畢業答疑輔導&#xff0c;歡迎高校老師/同行前輩交流合作?。 技術范圍&#xff1a;SpringBoot、Vue、SSM、HLMT、小程序、Jsp、PHP、Nodejs、Python、爬蟲、數據可視化、安卓app、大數據、物聯網、機器學習等設計與開發。 主要內容&#xff1a;…

springboot整合admin

1. 添加依賴 首先&#xff0c;在你的admin服務端pom.xml文件中添加Spring Boot Admin的依賴&#xff1a; <dependency><groupId>de.codecentric</groupId><artifactId>spring-boot-admin-starter-server</artifactId><version>2.5.4<…

【YOLOv8雜草作物目標檢測】

YOLOv8雜草目標檢測 算法介紹模型和數據集下載 算法介紹 YOLOv8在禾本科雜草目標檢測方面有顯著的應用和效果。以下是一些關鍵信息的總結&#xff1a; 農作物幼苗與雜草檢測系統&#xff1a;基于YOLOv8深度學習框架&#xff0c;通過2822張圖片訓練了一個目標檢測模型&#xff…

比亞迪夏直插家用MPV腹地,“迪王”開啟全面銷冠新征程

文/王俁祺 導語&#xff1a;比亞迪前腳剛收獲2024年的全面成功&#xff0c;后腳立刻就開始布局2025年的產品矩陣了。比亞迪夏的橫空出世&#xff0c;看來家用MPV市場也要感受“迪王”的恐怖如斯了。 家用MPV市場的“意外之喜” 1月8日&#xff0c;比亞迪夏終于在萬眾矚目之下…

左值引用(Lvalue Reference)和右值引用(Rvalue Reference)詳解

左值引用&#xff08;Lvalue Reference&#xff09;和右值引用&#xff08;Rvalue Reference&#xff09;詳解 文章目錄 左值引用&#xff08;Lvalue Reference&#xff09;和右值引用&#xff08;Rvalue Reference&#xff09;詳解1. 什么是左值和右值&#xff1f;左值&#x…

探索數據存儲的奧秘:深入理解B樹與B+樹

key value 類型的數據紅黑樹&#xff08;最優二叉樹&#xff0c;內存最優&#xff09;&#xff0c;時間復雜度&#xff1a;O&#xff08;logn&#xff09;,調整方便&#xff1b;一個結點分出兩個叉B樹一個節點可以分出很多叉數據量相等的條件下&#xff1a;紅黑樹的層數很高&am…

聯邦大語言模型典型系統: FATE - LLM、FedLLM、FederatedScope - LLM、PrimiHub

聯邦大語言模型典型系統: FATE - LLM、FedLLM、FederatedScope - LLM、PrimiHub 目錄 聯邦大語言模型典型系統: FATE - LLM、FedLLM、FederatedScope - LLM、PrimiHubPEFT 技術及簡單舉例PEFT 技術代碼實現提示詞工程不僅僅在聊天對話框實現,還可以再代碼中實現聯邦大語言模…

L1G5000 XTuner 微調個人小助手認知

使用 XTuner 微調 InternLM2-Chat-7B 實現自己的小助手認知 1 環境配置與數據準備步驟 0. 使用 conda 先構建一個 Python-3.10 的虛擬環境步驟 1. 安裝 XTuner 修改提供的數據步驟 0. 創建一個新的文件夾用于存儲微調數據步驟 1. 創建修改腳本步驟 2. 執行腳本步驟 3. 查看數據…

網絡協議安全的攻擊手法

1.使用SYN Flood泛洪攻擊&#xff1a; SYN Flood(半開放攻擊)是最經典的ddos攻擊之一&#xff0c;他利用了TCP協議的三次握手機制&#xff0c;攻擊者通常利用工具或控制僵尸主機向服務器發送海量的變源端口的TCP SYN報文&#xff0c;服務器響應了這些報文后就會生成大量的半連…

人工智能學習路線全鏈路解析

一、基礎準備階段&#xff08;預計 2-3 個月&#xff09; &#xff08;一&#xff09;數學知識鞏固與深化 線性代數&#xff08;約 1 個月&#xff09;&#xff1a; 矩陣基礎&#xff1a;回顧矩陣的定義、表示方法、矩陣的基本運算&#xff08;加法、減法、乘法&#xff09;&…

Redis 安裝與 Spring Boot 集成指南

安裝 Redis 和將其與 Spring Boot 應用集成是構建高效緩存解決方案的常見步驟。以下是詳細的指南&#xff0c;幫助你在本地環境中安裝 Redis&#xff0c;并在 Spring Boot 項目中配置和使用它。 1. 安裝 Redis Windows 環境 Redis 官方并不直接支持 Windows&#xff0c;但你…

3d打印材料是塑料么?pla petg

3D 打印材料不僅限于塑料&#xff0c;但塑料確實是最常見的材料類型之一。以下是一些常用的3D打印塑料材料的介紹&#xff1a; 1. PLA&#xff08;聚乳酸&#xff09; ? 特點&#xff1a;可生物降解&#xff0c;環保&#xff0c;容易打印&#xff0c;表面光滑。 ? 適用…

linux-磁盤io性能指標!

一. 引文&#xff1a; 平時查看或者監控磁盤io時&#xff0c;基本上都是用的現成的工具/腳本&#xff0c; 對其了解的還是很淺&#xff0c;特參考一些資料整理了下&#xff0c;留個隨筆。 二.磁盤I/O性能指標: 磁盤 I/O 是 Unix/Linux 系統管理中一個非常重要的組成部分。磁盤…

Excel 技巧08 - 如何計算某類(比如紅色背景色)單元格的總和? (★)

本文講了如何在Excel中計算某類(比如紅色背景色)單元格的總和。 1&#xff0c;如何計算某類(比如紅色背景色)單元格的總和&#xff1f; 技巧就是先把它們給標記出來&#xff0c;然后就好統計了。 那么如何找出來呢&#xff1f; 對&#xff0c;就是通過紅色。 按下Ctrl F 點…

awr報告無法生成:常見分析手段

awr報告無法生成:常見分析手段 STATISTICS_LEVEL和OPEN_MODEAWR快照是否能自動生成?AWR快照能否手動生成?日志有無ORA-12751或ORA-32701報錯?MMON進程是否被掛起?排查數據庫隱藏參數分析快照生成錯誤信息分析AWR Snapshot Tracing分析AWR Table Flush是否超時STATISTICS_L…

uni-app無限級樹形組件簡單實現

因為項目一些數據需要樹形展示&#xff0c;但是官網組件沒有。現在簡單封裝一個組件在app中使用&#xff0c;可以無線嵌套&#xff0c;展開&#xff0c;收縮&#xff0c;獲取子節點數據等。 簡單效果 組件TreeData <template><view class"tree"><te…