動手人形機器人（RL）

1 PPO的講解

核心步驟，如策略網絡和價值網絡的定義、優勢估計、策略更新、價值更新等基礎功能的實現

2 代碼構成

可能涉及

初始化，Behavior Clone

3 動手強化學習

import pytorch as torch
class actorcritic ##等待補充

4 PD Gains

在機器人學中，PD gains（比例 - 微分增益） 是指比例控制（Proportional control）和微分控制（Derivative control）中的增益參數，分別稱為 P gain（比例增益） 和 D gain（微分增益），它們是 PD 控制算法的核心組成部分，對機器人的運動控制性能起著關鍵作用。具體如下：

1. P gain（比例增益）

作用：與機器人當前的誤差（如位置誤差、角度誤差等）成正比，用于快速響應誤差。例如，當機器人的機械臂需要移動到某個目標位置時，若實際位置與目標位置存在誤差，比例增益會根據誤差大小輸出一個控制量，推動機械臂向減小誤差的方向運動。
影響：比例增益越大，系統對誤差的響應越迅速，但過大的比例增益可能導致系統超調（即運動超過目標位置），甚至產生震蕩，使機器人運動不穩定。

2. D gain（微分增益）

作用：與誤差的變化率成正比，用于預測誤差的變化趨勢。它能根據誤差變化的快慢調整控制量，抑制超調，增加系統的穩定性。例如，當機械臂接近目標位置時，微分增益會檢測到誤差變化率減小，提前降低控制量，使機械臂平穩停止，避免沖過目標位置。
影響：合適的微分增益可以改善系統的動態特性，減少調整時間；但微分增益過大可能使系統對噪聲過于敏感（如傳感器噪聲會被放大影響控制），過小則難以有效抑制超調。

機器人學中的應用示例

在機器人的關節控制中，PD 控制常用于調節電機的輸出。例如，若機器人某關節需要從當前角度轉動到目標角度：

當角度誤差較大時，比例增益起主導作用，快速驅動關節向目標角度轉動；
隨著角度誤差減小，微分增益根據誤差變化率調整輸出，使關節平穩地停在目標角度，避免來回晃動。

## 機器人關節電機控制模式及參數class control:## 控制類型：位置控制、速度控制、扭矩控制control_type = 'P' # P: position, V: velocity, T: torques## PD驅動的參數## stiffness代表剛度系數k_p damping代表阻尼系數k_dstiffness = {'joint_a': 10.0, 'joint_b': 15.}  # [N*m/rad]damping = {'joint_a': 1.0, 'joint_b': 1.5}     # [N*m*s/rad]## 公式如下，與action的轉化為什么要有這樣的比例因子暫未明白# action scale: target angle = actionScale * action + defaultAngleaction_scale = 0.5## decimation: Number of control action updates @ sim DT per policy DT## 仿真環境的控制頻率/decimation=實際環境中的控制頻率decimation = 4

5 相關研究分享

1 CMU的H2O

Learning Human-to-Humanoid Real-Time Whole-Body TeleoperationLearning Human-to-Humanoid Real-Time Whole-Body Teleoperationhttps://human2humanoid.com/

2 leggedgym

ETH開發的庫函數

https://github.com/leggedrobotics/legged_gymhttps://github.com/leggedrobotics/legged_gym

如何使用？：

Train:
python legged_gym/scripts/train.py --task=anymal_c_flat
To run on CPU add following arguments: --sim_device=cpu, --rl_device=cpu (sim on CPU and rl on GPU is possible).
To run headless (no rendering) add --headless.
Important: To improve performance, once the training starts press v to stop the rendering. You can then enable it later to check the progress.
The trained policy is saved in issacgym_anymal/logs/<experiment_name>/<date_time>_<run_name>/model_<iteration>.pt. Where <experiment_name> and <run_name> are defined in the train config.
The following command line arguments override the values set in the config files:
--task TASK: Task name.
--resume: Resume training from a checkpoint
--experiment_name EXPERIMENT_NAME: Name of the experiment to run or load.
--run_name RUN_NAME: Name of the run.
--load_run LOAD_RUN: Name of the run to load when resume=True. If -1: will load the last run.
--checkpoint CHECKPOINT: Saved model checkpoint number. If -1: will load the last checkpoint.
--num_envs NUM_ENVS: Number of environments to create.
--seed SEED: Random seed.
--max_iterations MAX_ITERATIONS: Maximum number of training iterations.

Play a trained policy:
python legged_gym/scripts/play.py --task=anymal_c_flat
By default, the loaded policy is the last model of the last run of the experiment folder.
Other runs/model iteration can be selected by setting load_run and checkpoint in the train config.

3 RL_rsl

https://github.com/leggedrobotics/rsl_rlhttps://github.com/leggedrobotics/rsl_rl

快速、簡單地實現RL算法,旨在在GPU上完全運行。這段代碼是一個進化過程。rl-pytorchNVIDIA 的 Isaac GYM 發布。

使用框架的環境存儲庫:

Isaac Lab(建立在NVIDIA Isaac Sim之上):https://github.com/isaac-sim/IsaacLab
Legged-Gym(基于 NVIDIA Isaac Gym 構建):https://leggedrobotics.github.io/legged_gym/

PPO主要分支支持PPO和學生教師蒸餾,以及我們研究的其他功能。這些包括:

隨機網絡蒸餾(RND)https://proceedings.mlr.press/v229/schwarke23a.html - 通過添加來鼓勵探索好奇心驅動的內在獎勵。
基于對稱性的增強https://arxiv.org/abs/2403.04359 - 使學習的行為更加對稱。

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/900750.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/900750.shtml
英文地址，請注明出處：http://en.pswp.cn/news/900750.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！