強化學習_Paper_2017_Curiosity-driven Exploration by Self-supervised Prediction

paper Link: ICM: Curiosity-driven Exploration by Self-supervised Prediction
GITHUB Link: 官方: noreward-rl

1- 主要貢獻

對好奇心進行定義與建模
1. 好奇心定義：next state的prediction error作為該state novelty
  1. 如果智能體真的“懂”一個state，那么給出這個state和所做的action，它應該能很準確的predict出next state是什么。也就是“What I can not predict well, is novel”
提出了Intrinsic Curiosity Module（ICM）來估計一個狀態的novelty大小，并給予相應的內在獎勵（intrinsic reward）
研究了3種泛化環境，驗證了好奇心在實際環境中的作用
1. 稀疏的外在獎勵, 好奇心使得與環境少量的互動，就那么好以實現目標;
2. 不添加外在獎勵的探索，好奇心促使主體更有效地探索；
3. 泛化到看不見的場景（例如同一游戲的新級別），在這些場景中，從早期經驗中獲得的知識可以幫助agnet比重新開始更快地探索新的地方。

2- Intrinsic Curiosity Module (ICM) 框架

在這里插入圖片描述

2.1 ICM三大件：

Encoder( $\theta_E$ ):
1. 將current state轉換成feature $\phi(s_t)$
2. 將next state轉換成feature $\phi(s_{t+1})$
Forward Model( $\theta_F$ ): 給定 $\phi(s_t)$ 和 action $a_t$ , 來估計next state feature $\hat{\phi}(s_{t+1})$
1. 用 $\min_{\theta_F} L(\hat{\phi}(s_{t+1}), \phi(s_{t+1}))$
2. $r^i_t = \frac{\eta}{2} ||\hat{\phi}(s_{t+1}) - \phi(s_{t+1})||^2_2$
Inverse Model( $\theta_I$ ): 給定 $\phi(s_t)$ 和 $\phi(s_{t+1})$ , 來估計action $\hat{a}_t$
1. 用 $\min_{\theta_I, \theta_E} L(a_t, \hat{a}_t) $
2. 讓Encoder輸出的表征限制于智能體、改變的空間里

2.2 三大件的作用:

Encoder顯然就是將state編碼
Forward Model就是基于state-Encoder和action給出next state feature，用真實next state feature和預估的值的差異作為好奇心內在獎勵
1. 對于見到少的組合給出的預估會不準，即好奇心reward會高
Inverse Model讓Encoder輸出的表征限制于智能體、改變的空間里
1. 因為Encoder+Forward Model會引入一個Noisy-TV Problem（雪花屏問題）：
  1. 當畫面都是噪音的時候，觀察者無法對下一個state做出預測
  2. 即預估和真實next state feature始終差異較大，內在獎勵就非常大，會導致觀察者很上癮，一直盯著noisy TV看：比如開槍時火花的隨機渲染，智能體可能就會一直停在原地開火，欣賞迸發出的火花
2. 所以Inverse Model根據兩個相鄰的state來推斷智能體所選的action $\hat{a}_t$ 。然后利用inverse prediction error( $L(a_t, \hat{a}_t)$ )來訓練Encoder。
  1. 最終 Encoder就不在意兩個不同的噪音的差別，也就不會提供novelty獎勵了

反向傳播迭代圖示
在這里插入圖片描述

Forward Model的Prediction error只用來訓練forward model，而不用于訓練Encoder
1. 即 $\hat{\phi}(s_{t+1})= f_{F}(\phi(s_t), a_t; \theta_F);\min_{\theta_F} L(\hat{\phi}(s_{t+1}), \phi(s_{t+1}))$
2. 對 $\phi(s_{t}), \phi(s_{t+1})$ detach
Inverse Model的Inverse prediction error既用來訓練Inverse model，也用來Encoder
1. 即 $\min_{\theta_I, \theta_E} L(a_t, \hat{a}_t)$
2. $\hat{a}_t = f_{I}( \phi(s_t), \phi(s_{t+1}); \theta_I) = f_{I}( \phi(s_t), f_{E}(s_{t+1}; \theta_E)); \theta_I)$
3. 對 $\phi(s_{t})$ detach

3- python code

ICM code

from  torch import nn 
import torch 
import torch.nn.functional as Fclass cnnICM(nn.Module):def __init__(self, channel_dim,state_dim, action_dim):super(cnnICM, self).__init__()self.state_dim = state_dimself.channel_dim = channel_dimself.action_dim = action_dimself.cnn_encoder_feature = nn.Sequential(nn.Conv2d(channel_dim, 32, kernel_size=8, stride=4),nn.ReLU(),nn.Conv2d(32, 64, kernel_size=4, stride=2),nn.ReLU(),nn.Conv2d(64, 64, kernel_size=3, stride=1),nn.ReLU(),nn.Flatten())cnn_out_dim = self._get_cnn_out_dim()self.cnn_encoder_header = nn.Sequential(nn.Linear(cnn_out_dim, 512),nn.ReLU())# 離散動作self.action_emb = nn.Embedding(self.action_dim, self.action_dim)self.forward_model = nn.Sequential(nn.Linear(512 + action_dim, 256),nn.ReLU(),nn.Linear(256, 512),)self.inverse_model = nn.Sequential(nn.Linear(512 + 512, 256),nn.ReLU(),nn.Linear(256, action_dim),nn.Softmax())@torch.no_graddef _get_cnn_out_dim(self):pic = torch.randn((1, self.channel_dim, self.state_dim, self.state_dim))return self.cnn_encoder_feature(pic).shape[1]  def encode_pred(self, state):return self.cnn_encoder_header(self.cnn_encoder_feature(state))def forward_pred(self, phi_s, action):return self.forward_model(torch.concat([phi_s, self.action_emb(action)], dim=1))def inverse_pred(self, phi_s, phi_s_next):return self.inverse_model(torch.concat([phi_s, phi_s_next], dim=1))def forward(self, state, n_state, action, mask):# 離散動作action = action.type(torch.LongTensor).reshape(-1).to(state.device)# encodephi_s = self.encode_pred(state)phi_s_next = self.encode_pred(n_state)# forward  不用于訓練Encoderhat_phi_s_next = self.forward_pred(phi_s.detach(), action)# intrinisc reward & forward_loss  r_i = 0.5 * nn.MSELoss(reduction='none')(hat_phi_s_next, phi_s_next.detach())r_i = r_i.mean(dim=1) * mask forward_loss = r_i.mean()# inverse 同時用于訓練Encoderhat_a = self.inverse_pred(phi_s.detach(), phi_s_next)# inverse loss inv_loss = (nn.CrossEntropyLoss(reduction='none')(hat_a, action) * mask).mean()return r_i, inv_loss, forward_loss

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/bicheng/79142.shtml
繁體地址，請注明出處：http://hk.pswp.cn/bicheng/79142.shtml
英文地址，請注明出處：http://en.pswp.cn/bicheng/79142.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！