算能RISC-V通用云開發空間編譯pytorch @openKylin留檔

終于可以體驗下risc-v了！操作系統是openKylin，算能的云空間

嘗試編譯安裝pytorch

首先安裝git

apt install git

然后下載pytorch和算能cpu的庫：

git clone https://github.com/sophgo/cpuinfo.git

git clone https://github.com/pytorch/pytorch

注意事項：

cd pytorch
# 確保子模塊的遠程倉庫URL與父倉庫中的配置一致
git submodule sync
# 確保獲取并更新所有子模塊的內容，包括初始化尚未初始化的子模塊并遞歸地處理嵌套的子模塊
git submodule update --init --recursive

將pytorch/third-parth目錄的cpuinfo刪除，換成算能的cpu庫cpuinfo

cd pytorch

rm -rf cpuinfo

cp -rf ../cpuinfo .

安裝相關庫

apt install libopenblas-dev 報錯，可以跳過

apt install libblas-dev m4 cmake cython3 ccache

手工編譯安裝openblas

git clone https://github.com/xianyi/OpenBLAS.git
cd OpenBLAS
make -j8
make PREFIX=/usr/local/OpenBLAS install

編譯的時候是一堆warning啊

在/etc/profile最后一行添加：

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/OpenBLAS/lib/

并執行：source? /etc/profile

修改代碼

到pytorch目錄，執行：?vi aten/src/ATen/CMakeLists.txt

??? aten/src/ATen/CMakeLists.txt

將語句：if(NOT MSVC AND NOT EMSCRIPTEN AND NOT INTERN_BUILD_MOBILE)
替換為：if(FALSE)

?? vi caffe2/CMakeLists.txt

將語句：target_link_libraries(${test_name}_${CPU_CAPABILITY} c10 sleef gtest_main)
替換為：target_link_libraries(${test_name}_${CPU_CAPABILITY} c10 gtest_main)

?? vi? test/cpp/api/CMakeLists.txt

在語句下：add_executable(test_api ${TORCH_API_TEST_SOURCES})
添加：target_compile_options(test_api PUBLIC -Wno-nonnull)

環境變量配置

# 直接在終端中輸入即可，重啟需要重新輸入
export USE_CUDA=0
export USE_DISTRIBUTED=0
export USE_MKLDNN=0
export MAX_JOBS=16

配置原文鏈接：https://blog.csdn.net/m0_49267873/article/details/135670989

編譯安裝

執行：

python3 setup.py develop --cmake

或者python3.10 setup.py install

據說要gcc 13以上，自帶的gcc版本：

gcc version 9.3.0 (Openkylin 9.3.0-ok12)

需要打patch：

# 若提示無patchelf命令，則執行下列語句
apt install patchelf

# path為存放libtorch_cpu.so的路徑
patchelf --add-needed libatomic.so.1 /path/libtorch_cpu.so
?

對算能云的系統來說，命令為：patchelf --add-needed libatomic.so.1? /root/pytorch/build/lib/libtorch_cpu.so

編譯前的準備

編譯前還需要安裝好這兩個庫：

pip3 install pyyaml typing_extensions

另外還要升級setuptools

pip3 install setuptools -U

最終編譯完成

在pytorch目錄執行：

python3 setup.py develop --cmake

整個編譯過程大約需要3-4個小時

最終編譯完成：

Installed /usr/lib/python3.8/site-packages/mpmath-1.3.0-py3.8.egg
Searching for typing-extensions==4.9.0
Best match: typing-extensions 4.9.0
Adding typing-extensions 4.9.0 to easy-install.pth file
detected new path './mpmath-1.3.0-py3.8.egg'

Using /usr/local/lib/python3.8/dist-packages
Finished processing dependencies for torch==2.3.0a0+git5c5b71b

測試

進入python3，執行import pytorch，報錯沒有pytorch。執行import torch

看到沒有報錯，以為測試通過。其實是因為在pytorch目錄，有子目錄torch，誤以為pass了

是我唐突了，因為使用的develop模式，就是這樣用。

也就是必須在pytorch的目錄，這樣才能識別為develop的torch，在~/pytorch目錄，執行python3，在命令交互方式下，把下面這段代碼cp進去執行，測試通過

import torch
import torch.nn as nn
import torch.optim as optim
import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"N,D_in,H,D_out = 64, 1000, 100, 10 # N: batch size, D_in:input size, H:hidden size, D_out: output size
x = torch.randn(N,D_in) # x = np.random.randn(N,D_in)
y = torch.randn(N,D_out) # y = np.random.randn(N,D_out)
w1 = torch.randn(D_in,H) # w1 = np.random.randn(D_in,H)
w2 = torch.randn(H,D_out) # w2 = np.random.randn(H,D_out)
learning_rate = 1e-6
for it in range(200):# forward passh = x.mm(w1) # N * H      h = x.dot(w1)h_relu = h.clamp(min=0) # N * H     np.maximum(h,0)y_pred = h_relu.mm(w2) # N * D_out     h_relu.dot(w2)  # compute lossloss = (y_pred - y).pow(2).sum() # np.square(y_pred-y).sum()print(it,loss.item()) #  print(it,loss)    # BP - compute the gradientgrad_y_pred = 2.0 * (y_pred-y)grad_w2 = h_relu.t().mm(grad_y_pred) # h_relu.T.dot(grad_y_pred)grad_h_relu = grad_y_pred.mm(w2.t())  # grad_y_pred.dot(w2.T)grad_h = grad_h_relu.clone() # grad_h_relu.copy()grad_h[h<0] = 0grad_w1 = x.t().mm(grad_h) # x.T.dot(grad_h)    # update weights of w1 and w2w1 -= learning_rate * grad_w1w2 -= learning_rate * grad_w2

0 29870438.0
1 26166322.0
2 25949932.0
3 25343224.0
4 22287072.0
5 16840522.0
6 11024538.0
7 6543464.5
8 3774165.25
9 2248810.5
10 1440020.25
11 1001724.5
12 749632.625
13 592216.6875
14 485451.34375
15 407586.65625
16 347618.4375
17 299686.625
18 260381.9375
19 227590.734375

怎樣全環境可以用torch呢？

感覺是環境變量的問題，敬請期待

調試

安裝libopenblas-dev報錯

root@863c89a419ec:~/pytorch/third_party# apt install libopenblas-dev
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Package libopenblas-dev is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

竟然有人已經過了這個坑，可以跳過它,用編譯安裝openblas代替

編譯pytorch的時候報錯

python3 setup.py develop --cmake

Building wheel torch-2.3.0a0+git5c5b71b
-- Building version 2.3.0a0+git5c5b71b
Could not find any of CMakeLists.txt, Makefile, setup.py, LICENSE, LICENSE.md, LICENSE.txt in /root/pytorch/third_party/pybind11
Did you run 'git submodule update --init --recursive'?

進入third_parth目錄執行下面命令解決：

rm -rf pthreadpool
# 執行下列指令前回退到pytorch目錄
git submodule update --init --recursive

執行完還是報錯：

root@863c89a419ec:~/pytorch# python3 setup.py develop --cmake
Building wheel torch-2.3.0a0+git5c5b71b
-- Building version 2.3.0a0+git5c5b71b
Could not find any of CMakeLists.txt, Makefile, setup.py, LICENSE, LICENSE.md, LICENSE.txt in /root/pytorch/third_party/QNNPACK
Did you run 'git submodule update --init --recursive'?

再次執行命令 git submodule update --init --recursive 照舊。

將QNNPACK目錄刪除，再執行一遍 git submodule update --init --recursive ，過了。

報錯RuntimeError: Missing build dependency: Unable to `import yaml`.

python3 install pyyaml

報錯：ModuleNotFoundError: No module named 'typing_extensions'

python3 install typing_extensions 搞定。

編譯到78%報錯

/usr/bin/ld: /root/pytorch/build/lib/libtorch_cpu.so: undefined reference to `__atomic_exchange_1'
collect2: error: ld returned 1 exit status
make[2]: *** [caffe2/CMakeFiles/NamedTensor_test.dir/build.make:101: bin/NamedTensor_test] Error 1
make[1]: *** [CMakeFiles/Makefile2:3288: caffe2/CMakeFiles/NamedTensor_test.dir/all] Error 2
/usr/bin/ld: /root/pytorch/build/lib/libtorch_cpu.so: undefined reference to `__atomic_exchange_1'
collect2: error: ld returned 1 exit status
make[2]: *** [caffe2/CMakeFiles/cpu_profiling_allocator_test.dir/build.make:101: bin/cpu_profiling_allocator_test] Error 1
make[1]: *** [CMakeFiles/Makefile2:3505: caffe2/CMakeFiles/cpu_profiling_allocator_test.dir/all] Error 2
[ 78%] Linking CXX executable ../bin/cpu_rng_test
/usr/bin/ld: /root/pytorch/build/lib/libtorch_cpu.so: undefined reference to `__atomic_exchange_1'
collect2: error: ld returned 1 exit status
make[2]: *** [caffe2/CMakeFiles/cpu_rng_test.dir/build.make:101: bin/cpu_rng_test] Error 1
make[1]: *** [CMakeFiles/Makefile2:3536: caffe2/CMakeFiles/cpu_rng_test.dir/all] Error 2
make: *** [Makefile:146: all] Error 2

初步懷疑是cpu庫有問題。看cpu庫，沒問題。

試試這個辦法：

問題分析：對__atomic_exchange_1的未定義引用

解決方法：使用patchelf添加需要的動態庫

# 若提示無patchelf命令，則執行下列語句
apt install patchelf

# path為存放libtorch_cpu.so的路徑
patchelf --add-needed libatomic.so.1 /path/libtorch_cpu.so
?

存放libtorch_cpu.so的路徑：/root/pytorch/build/lib/libtorch_cpu.so

因此命令為：patchelf --add-needed libatomic.so.1 /root/pytorch/build/lib/libtorch_cpu.so

果然運行完這條命令后，編譯就能繼續下去了。

編譯100%報錯

running develop
/usr/lib/python3/dist-packages/setuptools/command/easy_install.py:146: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
? warnings.warn(
Traceback (most recent call last):
? File "setup.py", line 1401, in <module>
??? main()
? File "setup.py", line 1346, in main
??? setup(
? File "/usr/lib/python3/dist-packages/setuptools/__init__.py", line 87, in setup
??? return distutils.core.setup(**attrs)
? File "/usr/lib/python3/dist-packages/setuptools/_distutils/core.py", line 185, in setup
??? return run_commands(dist)
? File "/usr/lib/python3/dist-packages/setuptools/_distutils/core.py", line 201, in run_commands
??? dist.run_commands()
? File "/usr/lib/python3/dist-packages/setuptools/_distutils/dist.py", line 973, in run_commands
??? self.run_command(cmd)
? File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 1217, in run_command
??? super().run_command(command)
? File "/usr/lib/python3/dist-packages/setuptools/_distutils/dist.py", line 991, in run_command
??? cmd_obj.ensure_finalized()
? File "/usr/lib/python3/dist-packages/setuptools/_distutils/cmd.py", line 109, in ensure_finalized
??? self.finalize_options()
? File "/usr/lib/python3/dist-packages/setuptools/command/develop.py", line 52, in finalize_options
??? easy_install.finalize_options(self)
? File "/usr/lib/python3/dist-packages/setuptools/command/easy_install.py", line 231, in finalize_options
??? self.config_vars = dict(sysconfig.get_config_vars())
UnboundLocalError: local variable 'sysconfig' referenced before assignment

嘗試升級setuptools試試

root@863c89a419ec:~# pip3 install? setuptools -U
Collecting setuptools
? Using cached setuptools-69.1.0-py3-none-any.whl (819 kB)
Installing collected packages: setuptools
? Attempting uninstall: setuptools
??? Found existing installation: setuptools 65.3.0
??? Not uninstalling setuptools at /usr/lib/python3/dist-packages, outside environment /usr
??? Can't uninstall 'setuptools'. No files were found to uninstall.
Successfully installed setuptools-69.1.0
然后再次編譯，過了！