paper Link: ICM: Curiosity-driven Exploration by Self-supervised Prediction
GITHUB Link: 官方: noreward-rl
1- 主要貢獻
- 對好奇心進行定義與建模
- 好奇心定義:next state的prediction error作為該state novelty
- 如果智能體真的“懂”一個state,那么給出這個state和所做的action,它應該能很準確的predict出next state是什么。也就是“What I can not predict well, is novel”
- 好奇心定義:next state的prediction error作為該state novelty
- 提出了Intrinsic Curiosity Module(ICM)來估計一個狀態的novelty大小,并給予相應的內在獎勵(intrinsic reward)
- 研究了3種泛化環境,驗證了好奇心在實際環境中的作用
- 稀疏的外在獎勵, 好奇心使得與環境少量的互動,就那么好以實現目標;
- 不添加外在獎勵的探索,好奇心促使主體更有效地探索;
- 泛化到看不見的場景(例如同一游戲的新級別),在這些場景中,從早期經驗中獲得的知識可以幫助agnet比重新開始更快地探索新的地方。
2- Intrinsic Curiosity Module (ICM) 框架
2.1 ICM三大件:
- Encoder( θ E \theta_E θE?):
- 將current state轉換成feature ? ( s t ) \phi(s_t) ?(st?)
- 將next state轉換成feature ? ( s t + 1 ) \phi(s_{t+1}) ?(st+1?)
- Forward Model( θ F \theta_F θF?): 給定 ? ( s t ) \phi(s_t) ?(st?) 和 action a t a_t at?, 來估計next state feature ? ^ ( s t + 1 ) \hat{\phi}(s_{t+1}) ?^?(st+1?)
- 用 min ? θ F L ( ? ^ ( s t + 1 ) , ? ( s t + 1 ) ) \min_{\theta_F} L(\hat{\phi}(s_{t+1}), \phi(s_{t+1})) minθF??L(?^?(st+1?),?(st+1?))
- r t i = η 2 ∣ ∣ ? ^ ( s t + 1 ) ? ? ( s t + 1 ) ∣ ∣ 2 2 r^i_t = \frac{\eta}{2} ||\hat{\phi}(s_{t+1}) - \phi(s_{t+1})||^2_2 rti?=2η?∣∣?^?(st+1?)??(st+1?)∣∣22?
- Inverse Model( θ I \theta_I θI?): 給定 ? ( s t ) \phi(s_t) ?(st?) 和 ? ( s t + 1 ) \phi(s_{t+1}) ?(st+1?), 來估計action a ^ t \hat{a}_t a^t?
- 用 $\min_{\theta_I, \theta_E} L(a_t, \hat{a}_t) $
- 讓Encoder輸出的表征限制于智能體、改變的空間里
2.2 三大件的作用:
- Encoder顯然就是將state編碼
- Forward Model就是基于state-Encoder和action給出next state feature, 用真實next state feature和預估的值的差異作為好奇心內在獎勵
- 對于見到少的組合給出的預估會不準,即好奇心reward會高
- Inverse Model讓Encoder輸出的表征限制于智能體、改變的空間里
- 因為Encoder+Forward Model會引入一個Noisy-TV Problem(雪花屏問題):
- 當畫面都是噪音的時候,觀察者無法對下一個state做出預測
- 即預估和真實next state feature始終差異較大,內在獎勵就非常大,會導致觀察者很上癮,一直盯著noisy TV看:比如開槍時火花的隨機渲染,智能體可能就會一直停在原地開火,欣賞迸發出的火花
- 所以Inverse Model根據兩個相鄰的state來推斷智能體所選的action a ^ t \hat{a}_t a^t?。然后利用inverse prediction error( L ( a t , a ^ t ) L(a_t, \hat{a}_t) L(at?,a^t?) )來訓練Encoder。
- 最終 Encoder就不在意兩個不同的噪音的差別,也就不會提供novelty獎勵了
- 因為Encoder+Forward Model會引入一個Noisy-TV Problem(雪花屏問題):
反向傳播迭代圖示
- Forward Model的Prediction error只用來訓練forward model,而不用于訓練Encoder
- 即 ? ^ ( s t + 1 ) = f F ( ? ( s t ) , a t ; θ F ) ; min ? θ F L ( ? ^ ( s t + 1 ) , ? ( s t + 1 ) ) \hat{\phi}(s_{t+1})= f_{F}(\phi(s_t), a_t; \theta_F);\min_{\theta_F} L(\hat{\phi}(s_{t+1}), \phi(s_{t+1})) ?^?(st+1?)=fF?(?(st?),at?;θF?);minθF??L(?^?(st+1?),?(st+1?))
- 對 ? ( s t ) , ? ( s t + 1 ) \phi(s_{t}), \phi(s_{t+1}) ?(st?),?(st+1?) detach
- Inverse Model的Inverse prediction error既用來訓練Inverse model,也用來Encoder
- 即 min ? θ I , θ E L ( a t , a ^ t ) \min_{\theta_I, \theta_E} L(a_t, \hat{a}_t) minθI?,θE??L(at?,a^t?)
- a ^ t = f I ( ? ( s t ) , ? ( s t + 1 ) ; θ I ) = f I ( ? ( s t ) , f E ( s t + 1 ; θ E ) ) ; θ I ) \hat{a}_t = f_{I}( \phi(s_t), \phi(s_{t+1}); \theta_I) = f_{I}( \phi(s_t), f_{E}(s_{t+1}; \theta_E)); \theta_I) a^t?=fI?(?(st?),?(st+1?);θI?)=fI?(?(st?),fE?(st+1?;θE?));θI?)
- 對 ? ( s t ) \phi(s_{t}) ?(st?) detach
3- python code
ICM code
from torch import nn
import torch
import torch.nn.functional as Fclass cnnICM(nn.Module):def __init__(self, channel_dim,state_dim, action_dim):super(cnnICM, self).__init__()self.state_dim = state_dimself.channel_dim = channel_dimself.action_dim = action_dimself.cnn_encoder_feature = nn.Sequential(nn.Conv2d(channel_dim, 32, kernel_size=8, stride=4),nn.ReLU(),nn.Conv2d(32, 64, kernel_size=4, stride=2),nn.ReLU(),nn.Conv2d(64, 64, kernel_size=3, stride=1),nn.ReLU(),nn.Flatten())cnn_out_dim = self._get_cnn_out_dim()self.cnn_encoder_header = nn.Sequential(nn.Linear(cnn_out_dim, 512),nn.ReLU())# 離散動作self.action_emb = nn.Embedding(self.action_dim, self.action_dim)self.forward_model = nn.Sequential(nn.Linear(512 + action_dim, 256),nn.ReLU(),nn.Linear(256, 512),)self.inverse_model = nn.Sequential(nn.Linear(512 + 512, 256),nn.ReLU(),nn.Linear(256, action_dim),nn.Softmax())@torch.no_graddef _get_cnn_out_dim(self):pic = torch.randn((1, self.channel_dim, self.state_dim, self.state_dim))return self.cnn_encoder_feature(pic).shape[1] def encode_pred(self, state):return self.cnn_encoder_header(self.cnn_encoder_feature(state))def forward_pred(self, phi_s, action):return self.forward_model(torch.concat([phi_s, self.action_emb(action)], dim=1))def inverse_pred(self, phi_s, phi_s_next):return self.inverse_model(torch.concat([phi_s, phi_s_next], dim=1))def forward(self, state, n_state, action, mask):# 離散動作action = action.type(torch.LongTensor).reshape(-1).to(state.device)# encodephi_s = self.encode_pred(state)phi_s_next = self.encode_pred(n_state)# forward 不用于訓練Encoderhat_phi_s_next = self.forward_pred(phi_s.detach(), action)# intrinisc reward & forward_loss r_i = 0.5 * nn.MSELoss(reduction='none')(hat_phi_s_next, phi_s_next.detach())r_i = r_i.mean(dim=1) * mask forward_loss = r_i.mean()# inverse 同時用于訓練Encoderhat_a = self.inverse_pred(phi_s.detach(), phi_s_next)# inverse loss inv_loss = (nn.CrossEntropyLoss(reduction='none')(hat_a, action) * mask).mean()return r_i, inv_loss, forward_loss