求一個張量的梯度_張量流中離散策略梯度的最小工作示例2 0

求一個張量的梯度

Training discrete actor networks with TensorFlow 2.0 is easy once you know how to do it, but also rather different from implementations in TensorFlow 1.0. As the 2.0 version was only released in September 2019, most examples that circulate on the web are still designed for TensorFlow 1.0. In a related article — in which we also discuss the mathematics in more detail — we already treated the continuous case. Here, we use a simple multi-armed bandit problem to show how we can implement and update an actor network the discrete setting [1].

一旦您知道如何使用TensorFlow 2.0訓練離散的actor網絡就很容易了，而且與TensorFlow 1.0的實現也有很大不同。由于2.0版本僅在2019年9月發布，因此大多數在網絡上傳播的示例仍是針對TensorFlow 1.0設計的。在相關文章中(我們還將更詳細地討論數學)，我們已經處理了連續的情況。在這里，我們使用一個簡單的多臂匪問題來說明如何實現離散設置[1]并更新演員網絡。

一點數學 (A bit of mathematics)

We use the classical policy gradient algorithm REINFORCE in which the actor is represented by a neural network known as the actor network. In the discrete case, the network output is simply the probability of selecting each of the actions. So, if the set of actions is defined by A and the action by a ∈ A, then the network output are the probabilities p(a), ?a ∈ A. The input layer contains the state s or a feature array ?(s), followed by one or more hidden layers that transform the input, with the output being the probabilities for each action that might be selected.

我們使用經典的策略梯度算法REINFORCE，其中角色由稱為角色網絡的神經網絡表示。在離散情況下，網絡輸出僅僅是選擇每個動作的概率。因此，如果一組動作由A定義，而動作由a∈A定義 ，則網絡輸出為概率p(a) ， ?a∈A 。輸入層包含狀態s或要素數組?(s) ，后跟一個或多個隱藏層來轉換輸入，輸出是每個可能選擇的動作的概率。

The policy π is parameterized by θ, which in deep reinforcement learning represents the neural network weights. After each action we take, we observe a reward v. Computing the gradients for θ and using learning rate α, the update rule typically encountered in textbooks looks as follows [2,3]:

策略π由θ參數化，在深度強化學習中，它表示神經網絡權重。在我們執行每個動作之后，我們都會觀察到獎勵v 。計算θ的梯度并使用學習率α ，教科書中通常會遇到的更新規則如下[2,3]：

When applying backpropagation updates to neural networks we must slightly modify this update rule, but the procedure follows the same lines. Although we might update the network weights manually, we typically prefer to let TensorFlow (or whatever library you use) handle the update. We only need to provide a loss function; the computer handles the calculation of gradients and other fancy tricks such as customized learning rates. In fact, the sole thing we have to do is add a minus sign, as we perform gradient descent rather than ascent. Thus, the loss function — which is known as the log loss function or cross-entropy loss function[4] — looks like this:

將反向傳播更新應用于神經網絡時，我們必須稍微修改此更新規則，但是該過程遵循相同的原則。盡管我們可能會手動更新網絡權重，但我們通常更喜歡讓TensorFlow(或您使用的任何庫)來處理更新。我們只需要提供一個損失函數；計算機可以處理梯度和其他花式技巧(例如自定義學習率)的計算。實際上，我們唯一要做的就是添加一個減號，因為我們執行梯度下降而不是上升。因此，損失函數(稱為對數損失函數或交叉熵損失函數 [4])如下所示：

TensorFlow 2.0實施 (TensorFlow 2.0 implementation)

Now let’s move on to the actual implementation. If you have some experience with TensorFlow, you likely first compile your network withmodel.compileand then perform model.fitormodel.train_on_batchto fit the network to your data. As TensorFlow 2.0 requires a loss function to have exactly two arguments, (y_true and y_predicted) we cannot use these methods though, since we need the action, state and reward as input arguments. The GradientTapefunctionality — which did not exist in TensorFlow 1.0 [5] — conveniently solves this problem. After storing a forward pass through the actor network on a `tape' , it is able to perform automatic differentiation in a backward pass later on.

現在讓我們繼續實際的實現。如果您有使用TensorFlow的經驗，則可能首先使用model.compile編譯網絡，然后執行model.fit或model.train_on_batch使網絡適合您的數據。由于TensorFlow 2.0需要一個損失函數來具有正好兩個參數( y_true和y_predicted )，因此我們無法使用這些方法，因為我們需要將操作，狀態和獎勵作為輸入參數。 TensorFlow 1.0 [5]中不存在的GradientTape功能可以方便地解決此問題。在通過actor網絡將前向通行證存儲在“ 磁帶 ”上之后，稍后可以在后向通行證中執行自動區分。

We start by defining our cross entropy loss function:

我們首先定義交叉熵損失函數：

In the next step, we use the function .trainable_variables to retrieve the network weights. Subsequently, tape.gradient calculates all the gradients for you by simply plugging in the loss value and the trainable variables. With optimizer.apply_gradients we update the network weights using a selected optimizer. As mentioned earlier, it is crucial that the forward pass (in which we obtain the action probabilities from the network) is included in the GradientTape. The code to update the weights is as follows:

在下一步中，我們使用函數.trainable_variables檢索網絡權重。隨后， tape.gradient只需插入損失值和可訓練變量，即可為您計算所有梯度。通過optimizer.apply_gradients我們使用選定的優化器來更新網絡權重。如前所述，至關重要的是將正向傳遞(我們從網絡中獲得動作概率)包括在GradientTape中。更新權重的代碼如下：

多臂匪 (Multi-armed bandit)

In the multi-armed bandit problem, we are able to play several slot machines with unique pay-off properties [6]. Each machine i has a mean payoff μ_i and a standard deviation σ_i, which are unknown to the player. At every decision moment you play one of the machines and observe the reward. After sufficient iterations and exploration, you should be able to fairly accurately estimate the mean reward of each machine. Naturally, the optimal policy is to always play the slot machine with the highest expected payoff.

在多武裝匪徒問題中，我們能夠玩幾臺具有獨特回報特性的老虎機[6]。每臺機器i均具有玩家不知道的平均收益μ_i和標準偏差σ_i 。在每個決策時刻，您都玩一臺機器并觀察獎勵。經過足夠的迭代和探索，您應該能夠相當準確地估計每臺機器的平均回報。自然，最佳策略是始終使用預期收益最高的老虎機。

Using Keras, we define a dense actor network. It takes a fixed state (a tensor with value 1) as input. We have two hidden layers that use five ReLUs per layer as activation functions. The network outputs the probabilities of playing each slot machine. The bias weights are initialized in such a way that each machine has equal probability at the beginning. Finally, the chosen optimizer is Adam with its default learning rate of 0.001.

使用Keras，我們定義了一個密集的actor網絡。它采用固定狀態(值為1的張量)作為輸入。我們有兩個隱藏層，每個層使用五個ReLU作為激活函數。網絡輸出玩每個老虎機的概率。偏置權重的初始化方式是，每臺機器在開始時都有相同的概率。最后，選擇的優化器是Adam，默認學習率為0.001。

We test four settings with differing mean payoffs. For simplicity we set all standard deviations equal. The figures below show the learned probabilities for each slot machine, testing with four machines. As expected, the policy learns to play the machine(s) with the highest expected payoff. Some exploration naturally persists, especially when payoffs are close together. A bit of fine-tuning and you surely will do a lot better during your next Vegas trip.

我們測試了四種具有不同平均收益的設置。為簡單起見，我們將所有標準偏差設置為相等。下圖顯示了在四臺老虎機上進行測試后，每臺老虎機的學習概率。正如預期的那樣，該策略將學習播放具有最高預期收益的機器。自然會持續進行一些探索，尤其是當收益接近時。進行一些微調，在您下一次維加斯之旅中，您肯定會做得更好。

關鍵點 (Key points)

We define a pseudo-loss to update actor networks. For discrete control, the pseudo-loss function is simply the negative log probability multiplied with the reward signal, also known as the log loss- or cross-entropy loss function.
我們定義了偽損失來更新參與者網絡。對于離散控制，偽損失函數僅是負對數概率乘以獎勵信號，也稱為對數損失或交叉熵損失函數。
Common TensorFlow 2.0 functions only accept loss functions with exactly two arguments. The GradientTape does not have this restriction.
常見的TensorFlow 2.0函數僅接受具有兩個參數的損失函數。 GradientTape沒有此限制。
Actor networks are updated using three steps: (i) define a custom loss function, (ii) compute the gradients for the trainable variables and (iii) apply the gradients to update the weights of the actor network.
使用三個步驟來更新角色網絡：(i)定義自定義損失函數；(ii)計算可訓練變量的梯度；(iii)應用梯度來更新角色網絡的權重。

This article is partially based on my method paper: ‘Implementing Actor Networks for Discrete Control in TensorFlow 2.0’ [1]

本文部分基于我的方法論文：“在Actors Networks中實現TensorFlow 2.0中的離散控制” [1]

The GitHub code (implemented using Python 3.8 and TensorFlow 2.3) can be found at: www.github.com/woutervanheeswijk/example_discrete_control

GitHub代碼(使用Python 3.8和TensorFlow 2.3實現)可以在以下位置找到： www.github.com/woutervanheeswijk/example_discrete_control

翻譯自: https://towardsdatascience.com/a-minimal-working-example-for-discrete-policy-gradients-in-tensorflow-2-0-d6a0d6b1a6d7

求一個張量的梯度

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/389961.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/389961.shtml
英文地址，請注明出處：http://en.pswp.cn/news/389961.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！