上篇我們從0開始構建了基本的環境,這篇我們繼續后續的標定,遙操作,錄制數據,上傳,訓練。
環境:顯卡技嘉的5060,cpui5-13490f,主板技嘉b760m gaming,雙系統ubuntu2204,win10專業版。
主要參考的教程,鏈接:https://huggingface.co/docs/lerobot/so101?example=Linux
環境就緒后,我們需要連接硬件,如果你還沒有硬件,tb上搜soarm101可買,當然也可以自己打印,github上有開源圖紙。包括硬件的連接方式組裝方式這里都暫時不講,如果你從tb買的,可以找tb客服問。這里假設你已經裝好硬件,連接到主機上。
Configure the motors
1. Find the USB ports associated with each arm
python lerobot/find_port.py
官方提供了找端口的py腳本,但是我沒用過,linux下基本就是ttyACM0,ttyACM1,ttyACM2這幾個端口,原始的方法你拔掉一個電源看哪個口沒了那他就是哪個口。注意!不要熱插拔,先斷電源,再把數據線拔掉,后面調試的過程中始終記住,不然可能會導致電機出問題。
給予權限:
sudo chmod 666 /dev/ttyACM0
sudo chmod 666 /dev/ttyACM1
2. Set the motors ids and baudrates
俺找官網的視頻中,逐個電機接上,然后設置id號,當然還有一種更簡單的,不用動手的。如果你有一臺win的電腦,下一個飛特的電機調式軟件。可以參考這篇文章中更新固件部分,有下載鏈接。參考:lerobot-soarm100標定報錯:ConnectionError: Read failed due to communication error on port /dev/ttyACM0 _lerobot機械臂校準報錯-CSDN博客文章瀏覽閱讀1k次,點贊24次,收藏29次。可能原因二:feetech舵機的固件版本需要更新,找fd的調試軟件里更新。波特率最大然后開始,搜索,給每個舵機都升。升級完后再執行lerobot下標定腳本,即可成功標定。可能原因一:線纜沒接好,檢查控制板和電機之間的接線。每個舵機都檢測更新一遍。我的是1.9.8.3。_lerobot機械臂校準報錯https://blog.csdn.net/Jzzzzzzzzzzzzzz/article/details/148081567?spm=1001.2014.3001.5501
就是在編程中,逐個設置id,我這里給出b站的視頻鏈接,可以直接照著操作。
【全球首發】LeRobot SO-ARM101 具身智能機械臂 - 組裝和配置教程_嗶哩嗶哩_bilibiliLeRobot SO-ARM101 具身智能機械臂 - 組裝和配置教程, 視頻播放量 7922、彈幕量 0、點贊數 107、投硬幣枚數 35、收藏人數 210、轉發人數 33, 視頻作者 WowRobo機器人, 作者簡介 WowRobo機器人專注具身智能領域軟硬件開發和應用。淘寶:WowRobo機器人企業店鋪,交流Q群:517472861,相關視頻:LeRobot具身智能機械臂實操入門課程-01:軟件環境配置和雙臂標定,30分鐘裝好SO-ARM100-保姆級視頻組裝教程-從臂,全網最低成本lerobot機械臂,機械臂運動學優秀學員作品展示第一期來啦 感謝@高質量用戶 胡同學的精彩畢設分享,具身智能+機械臂創造無限可能💪💪 后面還有更多精彩等你一起來~,便宜沒好貨?SO100機械臂重復定位精度測試,讓我們看看它能完成挑戰嗎?,具身智能基礎技術路線,2-不到一千七百元,搭出自己的lerobot-aloha真實機械臂材料清單,Lerobot機械臂:新版本so101&虛擬舵機硬件組裝,LeRobot具身智能機械臂實操入門課程-02:相機選型、接線與代碼調試,LeRobot具身智能機械臂實操入門課程-03:機械臂的數據集錄制與模型訓練https://www.bilibili.com/video/BV13bLyzKES8?t=3244.6當然也可以就按照教程中的,但是那種可能會一些朋友熱插拔,而且需要逐個拔掉電機連接線,我覺得有點麻煩。
Follower
Connect the usb cable from your computer and the power supply to the follower arm’s controller board. Then, run the following command or run the API example with the port you got from the previous step. You’ll also need to give your leader arm a name with the?id
?parameter.
記得換成自己的端口號。
python -m lerobot.setup_motors \--robot.type=so101_follower \--robot.port=/dev/tty.usbmodem585A0076841 # <- paste here the port found at previous step
You should see the following instruction
Connect the controller board to the 'gripper' motor only and press enter.
As instructed, plug the gripper’s motor. Make sure it’s the only motor connected to the board, and that the motor itself is not yet daisy-chained to any other motor. As you press?[Enter]
, the script will automatically set the id and baudrate for that motor.
Leader
Do the same steps for the leader arm.
python -m lerobot.setup_motors \--teleop.type=so101_leader \--teleop.port=/dev/tty.usbmodem575E0031751 # <- paste here the port found at previous step
Calibrate標定
python -m lerobot.calibrate \--robot.type=so101_follower \--robot.port=/dev/tty.usbmodem58760431551 \ # <- The port of your robot--robot.id=my_awesome_follower_arm # <- Give the robot a unique name
同樣記得換自己的端口號,名字改不改都行。
然后先把機械臂轉至中立位,然年吧每個關節電機都動一遍,最大角度和最小角度都動到,他會一直遍歷當下角度值,取最大最小,實現標定。
參考視頻:https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/calibrate_so101_2.mp4https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/calibrate_so101_2.mp4
Leader
Do the same steps to calibrate the leader arm, run the following command or API example:
從臂做相同的事情。
python -m lerobot.calibrate \--teleop.type=so101_leader \--teleop.port=/dev/tty.usbmodem58760431551 \ # <- The port of your robot--teleop.id=my_awesome_leader_arm # <- Give the robot a unique name
Teleoperate遙操作
完成標定后我們就可以嘗試遙操作,即主臂控制從臂。后面的訓練需要我們通過遙操作錄制訓練數據。依舊修改端口號。
python -m lerobot.teleoperate \--robot.type=so101_follower \--robot.port=/dev/tty.usbmodem58760431541 \--robot.id=my_awesome_follower_arm \--teleop.type=so101_leader \--teleop.port=/dev/tty.usbmodem58760431551 \--teleop.id=my_awesome_leader_arm
Teleoperate with cameras 帶攝像頭的遙操
相同的命令再加上攝像頭的使能,注意,官網給的教程有點小問題,給的是koch機械臂,我們要換位我們的so101
python -m lerobot.teleoperate \--robot.type=so101_follower \--robot.port=/dev/tty.usbmodem58760431541 \--robot.id=my_awesome_follower_arm \--teleop.type=so101_leader \--teleop.port=/dev/tty.usbmodem58760431551 \--teleop.id=my_awesome_leader_arm--robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 1920, height: 1080, fps: 30}}" \--teleop.type=koch_leader \--display_data=true
Add your token to the CLI by running this command:在這里把token換成自己的api key,獲得方式同樣是上huggingface官網注冊,創建自己的數據倉庫,然后獲取key。
huggingface-cli login --token ${HUGGINGFACE_TOKEN} --add-to-git-credential
?檢查下輸出是否是自己的huggingface的用戶名。
HF_USER=$(huggingface-cli whoami | head -n 1)
echo $HF_USER
一切都就緒后,可以開始錄制數據。
python -m lerobot.record \--robot.type=so101_follower \--robot.port=/dev/tty.usbmodem585A0076841 \ --robot.id=my_awesome_follower_arm \--robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 1920, height: 1080, fps: 30}}" \--teleop.type=so101_leader \--teleop.port=/dev/tty.usbmodem58760431551 \--teleop.id=my_awesome_leader_arm \--display_data=true \--dataset.repo_id=${HF_USER}/record-test \--dataset.num_episodes=2 \--dataset.single_task="Grab the black cube"
注意換端口號,會有語音提示,錄如episodes0,1等,或者看rerun界面,當無法遙操時錄制結束。
接下來,又能會有的問題來了。報錯如下:
(lerobot) dora@dora-B760M-GAMING:~/dora_ws/lerobot_v2$ python -m lerobot.teleoperate --robot.type=so101_follower --robot.port=/dev/ttyACM1 --robot.id=my_awesome_follower_arm --teleop.type=so101_leader --teleop.port=/dev/ttyACM2 --teleop.id=my_awesome_leader_arm --display_data=true --robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 1920, height: 1080, fps: 30}}" \ > [2025-06-23T08:26:10Z INFO re_grpc_server] Listening for gRPC connections on 0.0.0.0:9876. Connect by running `rerun --connect rerun+http://127.0.0.1:9876/proxy` [2025-06-23T08:26:10Z WARN wgpu_hal::gles::egl] Re-initializing Gles context due to Wayland window [2025-06-23T08:26:10Z INFO egui_wgpu] There were 2 available wgpu adapters: {backend: Vulkan, device_type: Cpu, name: "llvmpipe (LLVM 15.0.7, 256 bits)", driver: "llvmpipe", driver_info: "Mesa 23.2.1-1ubuntu3.1~22.04.3 (LLVM 15.0.7)", vendor: Mesa (0x10005)}, {backend: Gl, device_type: Cpu, name: "llvmpipe (LLVM 15.0.7, 256 bits)", driver_info: "4.5 (Core Profile) Mesa 23.2.1-1ubuntu3.1~22.04.3", vendor: Mesa (0x10005)} [2025-06-23T08:26:10Z WARN re_renderer::context] Software rasterizer detected - expect poor performance. See: https://www.rerun.io/docs/getting-started/troubleshooting#graphics-issues [2025-06-23T08:26:10Z INFO re_renderer::context] wgpu adapter backend: Vulkan, device_type: Cpu, name: "llvmpipe (LLVM 15.0.7, 256 bits)", driver: "llvmpipe", driver_info: "Mesa 23.2.1-1ubuntu3.1~22.04.3 (LLVM 15.0.7)" /home/dora/dora_ws/lerobot_v2/lerobot/teleoperate.py:87: DeprecationWarning: since 0.23.0: Use `Scalars` instead. rr.log(f"observation_{obs}", rr.Scalar(val)) /home/dora/dora_ws/lerobot_v2/lerobot/teleoperate.py:92: DeprecationWarning: since 0.23.0: Use `Scalars` instead. rr.log(f"action_{act}", rr.Scalar(val)) --------------------------- NAME | NORM shoulder_pan.pos | 13.38 shoulder_lift.pos | -98.73 elbow_flex.pos | 99.46 wrist_flex.pos | 46.98 wrist_roll.pos | 0.88 gripper.pos | 5.88 time: 67.44ms (15 Hz) Traceback (most recent call last): File "/home/dora/miniconda3/envs/lerobot/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/dora/miniconda3/envs/lerobot/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/dora/dora_ws/lerobot_v2/lerobot/teleoperate.py", line 137, in <module> teleoperate() (lerobot) dora@dora-B760M-GAMING:~/dora_ws/lerobot_v2$ 10/site-packages/draccus/argparsing.py", line 225, in wrapper_inner (lerobot) dora@dora-B760M-GAMING:~/dora_ws/lerobot_v2$ File "/home/dora/dora_ws/lerobot_v2/lerobot/teleoperate.py", line 126, in teleoperate teleop_loop(teleop, robot, cfg.fps, display_data=cfg.display_data, duration=cfg.teleop_time_s) File "/home/dora/dora_ws/lerobot_v2/lerobot/teleoperate.py", line 84, in teleop_loop observation = robot.get_observation() File "/home/dora/dora_ws/lerobot_v2/lerobot/common/robots/so101_follower/so101_follower.py", line 167, in get_observation obs_dict[cam_key] = cam.async_read() File "/home/dora/dora_ws/lerobot_v2/lerobot/common/cameras/opencv/camera_opencv.py", line 448, in async_read raise TimeoutError( TimeoutError: Timed out waiting for frame from camera OpenCVCamera(0) after 200 ms. Read thread alive: True.
(lerobot) dora@dora-B760M-GAMING:~/dora_ws/lerobot_v2$ python -m lerobot.teleoperate --robot.type=so101_follower --robot.port=/dev/ttyACM1 --robot.id=my_awesome_follower_arm --teleop.type=so101_leader --teleop.port=/dev/ttyACM2 --teleop.id=my_awesome_leader_arm --display_data=true --robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 1920, height: 1080, fps: 30}}" \
> [2025-06-23T08:26:10Z INFO re_grpc_server] Listening for gRPC connections on 0.0.0.0:9876. Connect by running `rerun --connect rerun+http://127.0.0.1:9876/proxy`
[2025-06-23T08:26:10Z WARN wgpu_hal::gles::egl] Re-initializing Gles context due to Wayland window
[2025-06-23T08:26:10Z INFO egui_wgpu] There were 2 available wgpu adapters: {backend: Vulkan, device_type: Cpu, name: "llvmpipe (LLVM 15.0.7, 256 bits)", driver: "llvmpipe", driver_info: "Mesa 23.2.1-1ubuntu3.1~22.04.3 (LLVM 15.0.7)", vendor: Mesa (0x10005)}, {backend: Gl, device_type: Cpu, name: "llvmpipe (LLVM 15.0.7, 256 bits)", driver_info: "4.5 (Core Profile) Mesa 23.2.1-1ubuntu3.1~22.04.3", vendor: Mesa (0x10005)}
[2025-06-23T08:26:10Z WARN re_renderer::context] Software rasterizer detected - expect poor performance. See: https://www.rerun.io/docs/getting-started/troubleshooting#graphics-issues
[2025-06-23T08:26:10Z INFO re_renderer::context] wgpu adapter backend: Vulkan, device_type: Cpu, name: "llvmpipe (LLVM 15.0.7, 256 bits)", driver: "llvmpipe", driver_info: "Mesa 23.2.1-1ubuntu3.1~22.04.3 (LLVM 15.0.7)"
/home/dora/dora_ws/lerobot_v2/lerobot/teleoperate.py:87: DeprecationWarning: since 0.23.0: Use `Scalars` instead.rr.log(f"observation_{obs}", rr.Scalar(val))
/home/dora/dora_ws/lerobot_v2/lerobot/teleoperate.py:92: DeprecationWarning: since 0.23.0: Use `Scalars` instead.rr.log(f"action_{act}", rr.Scalar(val))---------------------------
NAME | NORM
shoulder_pan.pos | 13.38
shoulder_lift.pos | -98.73
elbow_flex.pos | 99.46
wrist_flex.pos | 46.98
wrist_roll.pos | 0.88
gripper.pos | 5.88time: 67.44ms (15 Hz)
Traceback (most recent call last):File "/home/dora/miniconda3/envs/lerobot/lib/python3.10/runpy.py", line 196, in _run_module_as_mainreturn _run_code(code, main_globals, None,File "/home/dora/miniconda3/envs/lerobot/lib/python3.10/runpy.py", line 86, in _run_codeexec(code, run_globals)File "/home/dora/dora_ws/lerobot_v2/lerobot/teleoperate.py", line 137, in <module>teleoperate()
(lerobot) dora@dora-B760M-GAMING:~/dora_ws/lerobot_v2$ 10/site-packages/draccus/argparsing.py", line 225, in wrapper_inner
(lerobot) dora@dora-B760M-GAMING:~/dora_ws/lerobot_v2$ File "/home/dora/dora_ws/lerobot_v2/lerobot/teleoperate.py", line 126, in teleoperateteleop_loop(teleop, robot, cfg.fps, display_data=cfg.display_data, duration=cfg.teleop_time_s)File "/home/dora/dora_ws/lerobot_v2/lerobot/teleoperate.py", line 84, in teleop_loopobservation = robot.get_observation()File "/home/dora/dora_ws/lerobot_v2/lerobot/common/robots/so101_follower/so101_follower.py", line 167, in get_observationobs_dict[cam_key] = cam.async_read()File "/home/dora/dora_ws/lerobot_v2/lerobot/common/cameras/opencv/camera_opencv.py", line 448, in async_readraise TimeoutError(
TimeoutError: Timed out waiting for frame from camera OpenCVCamera(0) after 200 ms. Read thread alive: True.
開始時成功啟動的,但是立馬就終止了。
?
TimeoutError: Timed out waiting for frame from camera OpenCVCamera(0) after 200 ms. Read thread alive: True.
問題來自于這里,超時了。報錯給出了警告。
[2025-06-23T08:26:10Z WARN ?re_renderer::context] Software rasterizer detected - expect poor performance. See: https://www.rerun.io/docs/getting-started/troubleshooting#graphics-issues
[2025-06-23T08:26:10Z INFO ?re_renderer::context] wgpu adapter backend: Vulkan, device_type: Cpu, name: "llvmpipe (LLVM 15.0.7, 256 bits)"
[2025-06-23T08:26:10Z WARN re_renderer::context] Software rasterizer detected - expect poor performance. See: https://www.rerun.io/docs/getting-started/troubleshooting#graphics-issues
[2025-06-23T08:26:10Z INFO re_renderer::context] wgpu adapter backend: Vulkan, device_type: Cpu, name: "llvmpipe (LLVM 15.0.7, 256 bits)"
說明我的系統沒有使用硬件GPU進行圖形渲染。device_type: Cpu, name: "llvmpipe"
: 而是正在使用?llvmpipe
,這是一個基于CPU的軟件渲染器。所以性能極差,且超時。所以解決辦法就是裝驅動。實測現在570.144這般驅動已經進去22.04及以后的版本,20.04的ppa倉庫也已添加,這版以支持50系GPU,所以只需要在nvidia-driver-570后加open。
sudo apt install nvidia-driver-570-open
NVIDIA Proprietary:表示是nvidia專有模塊。
GPL/MIT:這個是使用GPL/MIT協議的開源模塊。
在nvidia官方發布的帖子中,有一篇專門的說明,詳見:
NVIDIA 全面轉向開源 GPU 內核模塊 - NVIDIA 技術博客
有一段內容是這個:
支持的 GPU
并不是每個 GPU 都能與開源 GPU 內核模塊兼容。
對于 NVIDIA Grace Hopper 或 NVIDIA Blackwell 等尖端平臺,您必須使用開源的 GPU 內核模塊,因為這些平臺不支持專有的驅動程序。
對于來自 Turing、Ampere、Ada Lovelace 或 Hopper 架構的較新 GPU,NVIDIA 建議將其切換到開源的 GPU 內核模塊。
對于 Maxwell、Pascal 或 Volta 架構中的舊版 GPU,其開源 GPU 內核模塊不兼容您的平臺。因此,請繼續使用 NVIDIA 專有驅動。
對于在同一系統中混合部署較舊和較新 GPU,請繼續使用專有驅動程序。
因為50系列是Blackwell架構,所以需要選擇gpl的開源內核模塊。
參考的文章:?https://minetest.top/archives/zai-linuxxia-wei-nvidia-50xi-an-zhuang-xian-qia-qu-donghttps://minetest.top/archives/zai-linuxxia-wei-nvidia-50xi-an-zhuang-xian-qia-qu-dong
安裝完后,輸入下面命令檢查,并看自己的cuda版本,后面還需要裝pytorch的環境,需要和cuda版本對應。我這里是cuda=12.8
nvidia-smi
?ok那我們繼續。再次運行上面的命令。寫到這里才發現,上面報錯的命令其實執行的其實是帶攝像頭的遙操,不過不要緊,總歸是要裝顯卡驅動的。下面才執行的是錄制的命令,但是報錯一致,仍然是超時,我們來查看錯誤信息。
(lerobot) dora@dora-B760M-GAMING:~/dora_ws$ python -m lerobot.record \--robot.type=so101_follower \--robot.port=/dev/ttyACM1 \--robot.id=my_awesome_follower_arm \--robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 1920, height: 1080, fps: 30}}" \--teleop.type=so101_leader \--teleop.port=/dev/ttyACM2 \--teleop.id=my_awesome_leader_arm \--display_data=true \--dataset.repo_id=${HF_USER}/record-test \--dataset.num_episodes=2 \--dataset.single_task="Grab the blue cube"
[2025-06-23T09:01:57Z INFO re_grpc_server] Listening for gRPC connections on 0.0.0.0:9876. Connect by running `rerun --connect rerun+http://127.0.0.1:9876/proxy`
[2025-06-23T09:01:57Z INFO winit::platform_impl::linux::x11::window] Guessed window scale factor: 1
[2025-06-23T09:01:57Z WARN wgpu_hal::gles::egl] No config found!
[2025-06-23T09:01:57Z WARN wgpu_hal::gles::egl] EGL says it can present to the window but not natively
[2025-06-23T09:01:57Z WARN wgpu_hal::gles::adapter] Max vertex attribute stride unknown. Assuming it is 2048
[2025-06-23T09:01:57Z WARN wgpu_hal::gles::adapter] Max vertex attribute stride unknown. Assuming it is 2048
[2025-06-23T09:01:57Z INFO egui_wgpu] There were 3 available wgpu adapters: {backend: Vulkan, device_type: DiscreteGpu, name: "NVIDIA Graphics Device", driver: "NVIDIA", driver_info: "570.133.07", vendor: NVIDIA (0x10DE), device: 0x2D04}, {backend: Vulkan, device_type: Cpu, name: "llvmpipe (LLVM 15.0.7, 256 bits)", driver: "llvmpipe", driver_info: "Mesa 23.2.1-1ubuntu3.1~22.04.3 (LLVM 15.0.7)", vendor: Mesa (0x10005)}, {backend: Gl, device_type: Other, name: "NVIDIA Graphics Device/PCIe/SSE2", driver_info: "3.3.0 NVIDIA 570.133.07", vendor: NVIDIA (0x10DE)}
/home/dora/dora_ws/lerobot_v2/lerobot/record.py:226: DeprecationWarning: since 0.23.0: Use `Scalars` instead.rr.log(f"observation.{obs}", rr.Scalar(val))
/home/dora/dora_ws/lerobot_v2/lerobot/record.py:231: DeprecationWarning: since 0.23.0: Use `Scalars` instead.rr.log(f"action.{act}", rr.Scalar(val))
/home/dora/dora_ws/lerobot_v2/lerobot/record.py:226: DeprecationWarning: since 0.23.0: Use `Scalars` instead.rr.log(f"observation.{obs}", rr.Scalar(val))
/home/dora/dora_ws/lerobot_v2/lerobot/record.py:231: DeprecationWarning: since 0.23.0: Use `Scalars` instead.rr.log(f"action.{act}", rr.Scalar(val))
Traceback (most recent call last):File "/home/dora/miniconda3/envs/lerobot/lib/python3.10/runpy.py", line 196, in _run_module_as_mainreturn _run_code(code, main_globals, None,File "/home/dora/miniconda3/envs/lerobot/lib/python3.10/runpy.py", line 86, in _run_codeexec(code, run_globals)File "/home/dora/dora_ws/lerobot_v2/lerobot/record.py", line 346, in <module>record()File "/home/dora/dora_ws/lerobot_v2/lerobot/configs/parser.py", line 226, in wrapper_innerresponse = fn(cfg, *args, **kwargs)File "/home/dora/dora_ws/lerobot_v2/lerobot/record.py", line 308, in recordrecord_loop(File "/home/dora/dora_ws/lerobot_v2/lerobot/common/datasets/image_writer.py", line 36, in wrapperraise eFile "/home/dora/dora_ws/lerobot_v2/lerobot/common/datasets/image_writer.py", line 29, in wrapperreturn func(*args, **kwargs)File "/home/dora/dora_ws/lerobot_v2/lerobot/record.py", line 189, in record_loopobservation = robot.get_observation()File "/home/dora/dora_ws/lerobot_v2/lerobot/common/robots/so101_follower/so101_follower.py", line 167, in get_observationobs_dict[cam_key] = cam.async_read()File "/home/dora/dora_ws/lerobot_v2/lerobot/common/cameras/opencv/camera_opencv.py", line 448, in async_readraise TimeoutError(
TimeoutError: Timed out waiting for frame from camera OpenCVCamera(0) after 200 ms. Read thread alive: True.
FATAL: exception not rethrown
已中止 (核心已轉儲)
通過輸出我們可以看到,顯卡驅動已經裝上
INFO egui_wgpu] There were 3 available wgpu adapters: {backend: Vulkan, device_type: DiscreteGpu, name: "NVIDIA Graphics Device", driver: "NVIDIA", driver_info: "570.133.07", vendor: NVIDIA (0x10DE), device: 0x2D04} ...
讓我們來分析一下 lerobot.teleoperate 和 lerobot.record 的區別:
teleoperate: 主要任務是 讀取遙控臂數據 -> 發送給機器人 -> 讀取攝像頭 -> 在屏幕上顯示。
它對性能的要求主要是實時顯示。雖然卡卡的但是能顯示。
record: 在 teleoperate 的基礎上,增加了極其耗費資源的任務:將每一幀數據處理、編碼并寫入硬盤。
所以問題就是目前處理性能達到了極限,不夠了:
與兩個機器人進行低延遲USB串口通信。 從攝像頭捕獲1080p30fps的高帶寬視頻流。 對每一幀1080p的圖像進行編碼并寫入硬盤。最關鍵的負載是:圖像編碼和磁盤I/O非常消耗CPU和系統資源。當主循環忙于處理和保存上一幀圖像時,它就沒空在200毫秒內去向攝像頭請求下一幀了,從而導致超時。 盡管GPU現在可以幫助顯示,但標準的OpenCV圖像捕獲和 lerobot 的數據保存邏輯主要是由CPU執行的(這一段話來自gemini2.5pro)。因此,即使GPU驅動好了,CPU和磁盤I/O仍然是瓶頸。
所以我們確定瓶頸是錄制(寫磁盤)而不是捕獲(讀攝像頭)所以需要減小我們的負載,直接辦法在 record
命令中降低攝像頭的分辨率。命令如下:
python -m lerobot.record \--robot.type=so101_follower \--robot.port=/dev/ttyACM1 \--robot.id=my_awesome_follower_arm \--robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \--teleop.type=so101_leader \--teleop.port=/dev/ttyACM2 \--teleop.id=my_awesome_leader_arm \--display_data=true \--dataset.repo_id=${HF_USER}/record-test \--dataset.num_episodes=2 \--dataset.single_task="Grab the blue cube"
將分辨率從 1920x1080
大幅降低到了 640x480
。每一幀的數據量減小了約9倍。很好,現在不僅實時顯示的很流暢,錄制問題也解決了。
錄制完成,我們現在可以去huggingface上看一眼我們錄制的數據。
如何管理 Hugging Face 上的數據集?
剛剛的命令: --dataset.repo_id=${HF_USER}/record-test1
實際上已經創建并上傳到了一個全新的倉庫。
假設你的Hugging Face用戶名($HF_USER
)是 dora
,那么現在可以https://huggingface.co/dora/record-test1
查看剛剛上傳的數據集。
每次錄制時,都給它起一個有意義的名字。 例如
今天錄制抓藍色方塊: --dataset.repo_id=${HF_USER}/so101_grab_blue_cube
明天錄制推紅色球: --dataset.repo_id=${HF_USER}/so101_push_red_ball
這樣做你的所有實驗數據都獨立保存,不會互相干擾,管理起來非常清晰。不然就像我的一開始倉庫里全是只變了一個字母的數據集,自己都分不清了。
Train a policy訓練
To train a policy to control your robot, use the?python lerobot/scripts/train.py?script. A few arguments are required. Here is an example command:執行訓練命令。
這里的?
--output_dir=outputs/train/act_so101_test \ 是你之前上傳的數據集
? --job_name=act_so101_test \是你wandb上管理的項目名字。都需要對應好,不然管理起來有些麻煩,做好名稱管理后,wandb還會有多此訓練的對比,會很清晰看到訓練效果。
python lerobot/scripts/train.py \--dataset.repo_id=${HF_USER}/record-test1 \--policy.type=act \--output_dir=outputs/train/act_so101_test \--job_name=act_so101_test \--policy.device=cuda \--wandb.enable=true
Let’s explain the command:
- We provided the dataset as argument with?
--dataset.repo_id=${HF_USER}/so101_test
. - We provided the policy with?
policy.type=act
. This loads configurations from?configuration_act.py. Importantly, this policy will automatically adapt to the number of motor states, motor actions and cameras of your robot (e.g.?laptop
?and?phone
) which have been saved in your dataset. - We provided?
policy.device=cuda
?since we are training on a Nvidia GPU, but you could use?policy.device=mps
?to train on Apple silicon. - We provided?
wandb.enable=true
?to use?Weights and Biases?for visualizing training plots. This is optional but if you use it, make sure you are logged in by running?wandb login
官方給的命令
新的問題,報錯如下:
(lerobot) dora@dora-B760M-GAMING:~/dora_ws/lerobot_v2$ python lerobot/scripts/train.py --dataset.repo_id=${HF_USER}/record-test1 --policy.type=act --output_dir=outputs/train/act_so101_test --job_name=act_so101_test --policy.device=cuda --wandb.enable=true
INFO 2025-06-23 17:37:48 ts/train.py:111 {'batch_size': 8,'dataset': {'episodes': None,Logs will be synced with wandb.
INFO 2025-06-23 17:38:04 db_utils.py:103 Track this run --> https://wandb.ai/yiming_jz-nankai-university/lerobot/runs/nsyt0nci
INFO 2025-06-23 17:38:04 ts/train.py:127 Creating dataset
Generating train split: 3593 examples [00:00, 24028.44 examples/s]
INFO 2025-06-23 17:38:07 ts/train.py:138 Creating policy
Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /home/dora/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 44.6M/44.7M [00:06<00:00, 6.89MB/s]
Traceback (most recent call last):File "/home/dora/dora_ws/lerobot_v2/lerobot/scripts/train.py", line 288, in <module>train()File "/home/dora/dora_ws/lerobot_v2/lerobot/configs/parser.py", line 226, in wrapper_innerresponse = fn(cfg, *args, **kwargs)File "/home/dora/dora_ws/lerobot_v2/lerobot/scripts/train.py", line 139, in trainpolicy = make_policy(File "/home/dora/dora_ws/lerobot_v2/lerobot/common/policies/factory.py", line 171, in make_policypolicy = policy_cls(**kwargs)File "/home/dora/dora_ws/lerobot_v2/lerobot/common/policies/act/modeling_act.py", line 74, in __init__self.model = ACT(config)File "/home/dora/dora_ws/lerobot_v2/lerobot/common/policies/act/modeling_act.py", line 335, in __init__backbone_model = getattr(torchvision.models, config.vision_backbone)(File "/home/dora/miniconda3/envs/lerobot/lib/python3.10/site-packages/torchvision/models/_utils.py", line 142, in wrapperreturn fn(*args, **kwargs)File "/home/dora/miniconda3/envs/lerobot/lib/python3.10/site-packages/torchvision/models/_utils.py", line 228, in inner_wrapperreturn builder(*args, **kwargs)File "/home/dora/miniconda3/envs/lerobot/lib/python3.10/site-packages/torchvision/models/resnet.py", line 705, in resnet18return _resnet(BasicBlock, [2, 2, 2, 2], weights, progress, **kwargs)File "/home/dora/miniconda3/envs/lerobot/lib/python3.10/site-packages/torchvision/models/resnet.py", line 301, in _resnetmodel.load_state_dict(weights.get_state_dict(progress=progress, check_hash=True))File "/home/dora/miniconda3/envs/lerobot/lib/python3.10/site-packages/torchvision/models/_api.py", line 90, in get_state_dictreturn load_state_dict_from_url(self.url, *args, **kwargs)File "/home/dora/miniconda3/envs/lerobot/lib/python3.10/site-packages/torch/hub.py", line 871, in load_state_dict_from_urldownload_url_to_file(url, cached_file, hash_prefix, progress=progress)File "/home/dora/miniconda3/envs/lerobot/lib/python3.10/site-packages/torch/hub.py", line 760, in download_url_to_fileraise RuntimeError(
RuntimeError: invalid hash value (expected "f37072fd", got "ab125a875db449ac29ade8b1523b786e6c2c91fe710e1ff41441f256095dd741")
(lerobot) dora@dora-B760M-GAMING:~/dora_ws/lerobot_v2$
原因:由于文件校驗失敗,PyTorch 認定該文件已損壞或下載不完整,于是拋出了?RuntimeError
,以防止使用這個壞掉的文件進行后續的訓練。大概率是網絡問題導致下載中出現問題,所以刪了重下就可以解決。
rm -rf /home/dora/.cache/torch/hub/checkpoints
在執行之前的訓練命令。
?又雙報錯了,內容如下:
INFO 2025-06-23 17:42:56 ts/train.py:202 Start offline training on a fixed dataset
Traceback (most recent call last):File "/home/dora/dora_ws/lerobot_v2/lerobot/scripts/train.py", line 288, in <module>train()File "/home/dora/dora_ws/lerobot_v2/lerobot/configs/parser.py", line 226, in wrapper_innerresponse = fn(cfg, *args, **kwargs)File "/home/dora/dora_ws/lerobot_v2/lerobot/scripts/train.py", line 212, in traintrain_tracker, output_dict = update_policy(File "/home/dora/dora_ws/lerobot_v2/lerobot/scripts/train.py", line 71, in update_policyloss, output_dict = policy.forward(batch)File "/home/dora/dora_ws/lerobot_v2/lerobot/common/policies/act/modeling_act.py", line 147, in forwardbatch = self.normalize_inputs(batch)File "/home/dora/miniconda3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_implreturn self._call_impl(*args, **kwargs)File "/home/dora/miniconda3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_implreturn forward_call(*args, **kwargs)File "/home/dora/miniconda3/envs/lerobot/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_contextreturn func(*args, **kwargs)File "/home/dora/dora_ws/lerobot_v2/lerobot/common/policies/normalize.py", line 170, in forwardassert not torch.isinf(mean).any(), _no_stats_error_str("mean")
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
不過多解釋了,想了解的大家可以自行ai一下,總歸就是當前環境中安裝的PyTorch預編譯包不適用于當前的cuda,所以我們需要根據自己的cuda版本更新一下。
先卸載
pip uninstall torch torchvision torchaudio
在查版本號
nvidia-smi
然后去官網:Get Started?
PyTorch Build: Stable (穩定版)
Your OS: Linux Package: Pip (因為你正在用pip)
Language: Python
cuda:你剛才找的版本號。
復制這整行命令,然后在你的 lerobot
conda環境中執行它。
裝完后驗證一下:
在你的conda環境中輸入python解釋器:
然后輸入以下代碼:
?
import torchprint(f"PyTorch version: {torch.__version__}")print(f"CUDA available: {torch.cuda.is_available()}")
如果一切正常,CUDA available
應該輸出 True
,并且你會看到PyTorch使用的CUDA版本和你的設備名稱。
如果到這里一切就緒,再次訓練,你會發現終端中開始輸出一些信息表明正在訓練了。同時你也可以去wandb查看訓練的效果,狀態。
但是,因為我們這次的目的是跑通流程,所以使用的錄制訓練集,只有兩個episode,過少的數據集會導致過擬合,所以長時間訓練沒有什么意義。但是為了跑通后續的模型評估部分,我們需要一個效果很爛但是完整的模型。那么我們就需要設置好檢查點,官方的教程沒有設置,所以我也不知道默認會跑多久,那我們就自己設置,每2k步觸發一個檢查點。
python lerobot/scripts/train.py \--dataset.repo_id=${HF_USER}/record-test1 \--policy.type=act \--output_dir=outputs/train/act_so101_test \--job_name=act_soarm_2 \--policy.device=cuda \--wandb.enable=true \--save_freq=2000
當訓練步驟達到2000步時,腳本會自動在 outputs/train/act_so101_test/ 目錄下創建一個名為 checkpoints 的文件夾。 然后它會在 checkpoints 文件夾里保存第一個模型文件,可能命名為 step_000002000.ckpt 或類似的名字。
那么到這里,基本的功能就已全部跑通,后續還會最后一個功能就是模型的評估,番外篇更新!
如果你喜歡這些機械臂小知識,或者想一起玩玩這些開源的機械臂控制,后續我們會發布更多相關優秀的指南教程,歡迎點贊關注,敬請期待~
預告預告~下一篇,lekiwi開源底座小車!!!