
Inverse Reinforcement Learning (IRL) 詳解
什么是 Inverse Reinforcement Learning?
在傳統的強化學習 (Reinforcement Learning, RL) 中,獎勵函數是已知的,智能體的任務是學習一個策略來最大化獎勵
而在逆向強化學習 (Inverse Reinforcement Learning, IRL) 中,情況相反:
- 我們不知道獎勵函數
缺失的
- 但是我們有專家的示范軌跡(比如專家怎么開車、怎么走路): τ = ( s 0 , a 0 , s 1 , a 1 , … , s T ) \tau = (s_0, a_0, s_1, a_1, \dots, s_T) τ=(s0?,a0?,s1?,a1?,…,sT?)
- 目標是:推斷出獎勵函數,使得專家行為在該獎勵下是最優的
簡單來說,IRL 是"從專家行為中推斷動機"
- Initialize an actor
- In each iteration
- The
actor
interacts with the environrment to obtain some trajectories- Define a reward functlon, which makes thetrajectories of the teacher better than the
actor
- The
actor
learns to maximize the reward based on the new reward function- Output the reward function and the
actor
learned from the reward function
IRL算法之 GAIL 算法詳解
GAIL(生成對抗模仿學習)結合了:生成對抗網絡 GAN(Generator 對抗 Discriminator)和 強化學習 Policy Gradient(策略梯度)
- 讓智能體學會產生像專家一樣的軌跡,但不直接學習獎勵函數,只通過模仿專家行為來訓練策略
判別器 (Discriminator) :試圖區分 “專家軌跡” 和 “生成器軌跡”
判別器的目標是最大化對數似然:判別器希望對于專家數據 expert
輸出接近 1,對于生成數據 policy
輸出接近 0
max ? D E expert [ log ? D ( s , a ) ] + E policy [ log ? ( 1 ? D ( s , a ) ) ] \max_D \mathbb{E}_{\text{expert}} [\log D(s, a)] + \mathbb{E}_{\text{policy}} [\log (1 - D(s, a))] Dmax?Eexpert?[logD(s,a)]+Epolicy?[log(1?D(s,a))]
生成器(策略網絡 Policy):試圖“欺騙”判別器,讓判別器以為它生成的軌跡是專家生成的
生成器的目標是最小化:
min ? π E τ ~ π [ log ? ( 1 ? D ( s , a ) ) ] \min_{\pi} \mathbb{E}_{\tau \sim \pi} [\log (1 - D(s, a))] πmin?Eτ~π?[log(1?D(s,a))]
這其實可以等價強化學習問題,獎勵信號變成了:
r ( s , a ) = ? log ? ( 1 ? D ( s , a ) ) r(s, a) = - \log (1 - D(s, a)) r(s,a)=?log(1?D(s,a))
- 這樣,跟標準的 policy gradient 非常類似,只不過獎勵是來自判別器
GAIL 簡單代碼示例
import gym
from stable_baselines3 import PPO
from imitation.algorithms.adversarial import GAIL
from imitation.data.types import TrajectoryWithRew
from imitation.data import rollout# 1. 創建環境
env = gym.make("CartPole-v1")# 2. 加載或創建專家模型
expert = PPO("MlpPolicy", env, verbose=0)
expert.learn(10000)# 3. 收集專家軌跡數據
trajectories = rollout.rollout(expert,env,rollout.make_sample_until(min_timesteps=None, min_episodes=20)
)# 4. 創建新模型作為 actor
learner = PPO("MlpPolicy", env, verbose=1)# 5. 使用 GAIL 進行逆強化學習訓練
gail_trainer = GAIL(venv=env,demonstrations=trajectories,gen_algo=learner
)
gail_trainer.train(10000)# 6. 測試訓練后的模型
obs = env.reset()
for _ in range(1000):action, _states = learner.predict(obs, deterministic=True)obs, reward, done, info = env.step(action)env.render()if done:obs = env.reset()env.close()