梯度 cv2.sobel_TensorFlow 2.0中連續策略梯度的最小工作示例

梯度 cv2.sobel

At the root of all the sophisticated actor-critic algorithms that are designed and applied these days is the vanilla policy gradient algorithm, which essentially is an actor-only algorithm. Nowadays, the actor that learns the decision-making policy is often represented by a neural network. In continuous control problems, this network outputs the relevant distribution parameters to sample appropriate actions.

如今，已設計和應用的所有復雜的行為者批評算法的根本是香草策略梯度算法，該算法本質上是僅行為者算法。如今，學習決策策略的演員通常以神經網絡為代表。在連續控制問題中，該網絡輸出相關的分配參數以對適當的動作進行采樣。

With so many deep reinforcement learning algorithms in circulation, you’d expect it to be easy to find abundant plug-and-play TensorFlow implementations for a basic actor network in continuous control, but this is hardly the case. Various reasons may exist for this. First, TensorFlow 2.0 was released only in September 2019, differing quite substantially from its predecessor. Second, most implementations focus on discrete action spaces rather than continuous ones. Third, there are many different implementations in circulation, yet some are tailored such that they only work in specific problem settings. It can be a tad frustrating to plow through several hundred lines of code riddled with placeholders and class members, only to find out the approach is not suitable to your problem after all. This article — based on our ResearchGate note [1] — provides a minimal working example that functions in TensorFlow 2.0. We will show that the real magic happens in only three lines of code!

鑒于有如此之多的深度強化學習算法正在流通中，您希望可以輕松地為連續控制中的基本參與者網絡找到豐富的即插即用TensorFlow實現，但這并不是事實。為此可能存在多種原因。首先，TensorFlow 2.0僅在2019年9月發布，與之前的版本有很大不同。其次，大多數實現都集中在離散的動作空間而不是連續的動作空間上。第三，流通中有許多不同的實現，但有些實現是經過量身定制的，因此它們僅在特定的問題環境中起作用。翻遍數百行充滿占位符和類成員的代碼可能有點令人沮喪，只是發現該方法畢竟不適合您的問題。本文基于我們的ResearchGate注釋[1]，提供了一個在TensorFlow 2.0中起作用的最小工作示例。我們將展示真正的魔力僅發生在三行代碼中！

一些數學背景 (Some mathematical background)

In this article, we present a simple and generic implementation for an actor network in the context of the vanilla policy gradient algorithm REINFORCE [2]. In the continuous variant, we usually draw actions from a Gaussian distribution; the goal is to learn an appropriate mean μ and a standard deviation σ. The actor network learns and outputs these parameters.

在本文中，我們在香草策略梯度算法REINFORCE [2]的背景下，為參與者網絡提供了一個簡單而通用的實現。在連續變體中，我們通常從高斯分布中得出動作。目的是學習適當的平均值μ和標準偏差σ 。參與者網絡學習并輸出這些參數。

Let’s formalize this actor network a bit more. Here, the input is the state s or a feature array ?(s), followed by one or more hidden layers that transform the input, with the output being μ and σ. Once obtaining this output, an action a is randomly drawn from the corresponding Gaussian distribution. Thus, we have a=μ(s)+σ(s)ξ , where ξ ～ 𝒩(0,1).

讓我們進一步規范這個actor網絡。在這里，輸入是狀態s或特征數組?(s) ，后跟一個或多個隱藏層來轉換輸入，輸出為μ和σ 。一旦獲得此輸出，便從相應的高斯分布中隨機抽取一個動作a 。因此，我們有A =μ(s)+σ(S)ξ，其中ξ?𝒩(0,1)。

After taking our action a, we observe a corresponding reward signal v. Together with some learning rate α, we may update the weights into a direction that improves the expected reward of our policy. The corresponding update rule [2] — based on gradient ascent — is given by:

在采取行動a之后 ，我們觀察到相應的獎勵信號v 。連同一些學習率α，我們可以將權重更新為一個方向，以改善我們的政策的預期回報。基于梯度上升的相應更新規則[2]由下式給出：

If we use a linear approximation scheme μ_θ(s)=θ^? ?(s), we may directly apply these update rules on each feature weight. For neural networks, it may not be as straightforward how we should perform this update though.

如果我們使用線性近似方案μ_θ(s)=θ^?(s) ，則可以將這些更新規則直接應用于每個特征權重。對于神經網絡，我們應該如何執行此更新可能并不那么簡單。

Neural networks are trained by minimizing a loss function. We often compute the loss by computing the mean-squared error (squaring the difference between the predicted- and observed value). For instance, in a critic network the loss could be defined as (r? + Q??? - Q?)2, with Q? being the predicted value and r? + Q??? the observed value. After computing the loss, we backpropagate it through the network, computing the partial losses and gradients required to update the network weights.

通過最小化損失函數來訓練神經網絡。我們通常通過計算均方誤差(對預測值和觀察值之間的差進行平方)來計算損失。例如，在一個網絡評論家的損失可以被定義為(R?+ Q??? - Q?)2，具有Q?作為預測值和r?+ Q???所觀察到的值。計算完損耗后，我們通過網絡反向傳播，計算更新網絡權重所需的部分損耗和梯度。

At first glance, the update equations have little in common with such a loss function. We simply try to improve our policy by moving into a certain direction, but do not have an explicit ‘target’ or ‘true value’ in mind. Indeed, we will need to define a ‘pseudo loss function’ that helps us update the network [3]. The link between the traditional update rules and this loss function become more clear when expressing the update rule into its generic form:

乍一看，更新方程與這種損失函數幾乎沒有共同點。我們只是試圖朝某個方向發展以改善政策，但沒有明確的“目標”或“真實價值”。確實，我們將需要定義一個“偽損失函數”來幫助我們更新網絡[3]。將更新規則表達為通用形式時，傳統更新規則與此損失函數之間的聯系變得更加清晰：

Transformation into a loss function is fairly straightforward. As the loss is only the input for the backpropagation procedure, we first drop the learning rate α and gradient ?_θ. Furthermore, neural networks are updated using gradient descent instead of gradient ascent, so we must add a minus sign. These steps yield the following loss function:

轉換為損失函數非常簡單。由于損耗只是反向傳播過程的輸入，因此我們首先降低學習率α和梯度?_θ 。此外，神經網絡是使用梯度下降而不是梯度上升來更新的，因此我們必須添加減號。這些步驟產生以下損失函數：

Quite similar to the update rule, right? To provide some intuition: remind that the log transformation yields a negative number for all values smaller than 1. If we have an action with a low probability and a high reward, we’d want to observe a large loss, i.e., a strong signal to update our policy into the direction of that high reward. The loss function does precisely that.

很像更新規則，對不對？為了提供一些直覺：請注意，對數轉換會為所有小于1的值產生一個負數。如果我們進行的動作的概率較低且回報較高，則希望觀察到較大的損失，即強烈的信號將我們的政策更新為高回報的方向。損失函數正是這樣做的。

To apply the update for a Gaussian policy, we can simply substitute π_θ with the Gaussian probability density function (pdf) — note that in the continuous domain we work with pdf values rather than actual probabilities — to obtain the so-called weighted Gaussian log likelihood loss function:

要將更新應用于高斯策略，我們可以簡單地用高斯概率密度函數(pdf)替換π_θ-注意，在連續域中，我們使用pdf值而不是實際概率-獲得所謂的加權高斯對數似然損失函數 ：

TensorFlow 2.0實施 (TensorFlow 2.0 implementation)

Enough mathematics for now, it’s time for the implementation.

到目前為止，數學已經足夠多了，是時候實施了。

We just defined the loss function, but unfortunately we cannot directly apply it in Tensorflow 2.0. When training a neural network, you may be used to something like model.compile(loss='mse',optimizer=opt), followed by model.fitormodel.train_on_batch, but this doesn’t work. First of all, the Gaussian log likelihood loss function is not a default one in TensorFlow 2.0 — it is in the Theano library for example[4] — meaning we have to create a custom loss function. More restrictive though: TensorFlow 2.0 requires a loss function to have exactly two arguments, y_true and y_predicted. As we just saw, we have three arguments due to multiplying with the reward. Let’s worry about that later though and first present our custom Guassian loss function:

我們只是定義了損失函數，但不幸的是我們無法在Tensorflow 2.0中直接應用它。在訓練神經網絡時，您可能習慣了諸如model.compile(loss='mse',optimizer=opt) ，隨后是model.fit或model.train_on_batch ，但這是行不通的。首先，高斯對數似然損失函數不是TensorFlow 2.0中的默認函數-例如在Theano庫中[4]-這意味著我們必須創建一個自定義損失函數。但是， y_true y_predicted ：TensorFlow 2.0需要一個損失函數來具有正好兩個參數y_true和y_predicted 。正如我們剛剛看到的，由于乘以獎勵，我們有三個論點。不過，讓我們稍后再擔心，首先介紹我們的自定義Guassian損失函數：

"""Weighted Gaussian log likelihood loss function for RL"""
def custom_loss_gaussian(state, action, reward):# Predict mu and sigma with actor networkmu, sigma = actor_network(state)# Compute Gaussian pdf valuepdf_value = tf.exp(-0.5 *((action - mu) / (sigma))**2)* 1 / (sigma * tf.sqrt(2 * np.pi))# Convert pdf value to log probabilitylog_probability = tf.math.log(pdf_value + 1e-5)# Compute weighted lossloss_actor = - reward * log_probabilityreturn loss_actor

So we have the correct loss function now, but we cannot apply it!? Of course we can — otherwise all of this would have been fairly pointless — it’s just slightly different than you might be used to.

因此，我們現在具有正確的損失函數，但是我們無法應用它！我們當然可以-否則所有這些都將毫無意義-與您過去的習慣略有不同。

This is where the GradientTapefunctionality comes in, which is a novel addition to TensorFlow 2.0 [5]. It essentially records your forward steps on a ‘tape’ such that it can apply automatic differentiation. The updating approach consists of three steps [6]. First, in our custom loss function we make a forward pass through the actor network — which is memorized — and calculate the loss. Second, with the function .trainable_variables, we recall the weights found during our forward pass. Subsequently, tape.gradient calculates all the gradients for you by simply plugging in the loss value and the trainable variables. Third, with optimizer.apply_gradients we update the network weights, where the optimizer is one of your choosing (e.g., SGD, Adam, RMSprop). In Python, the update steps look as follows:

這是GradientTape功能出現的地方，它是TensorFlow 2.0 [5]的新增功能。它實質上將您的前進步驟記錄在“ 磁帶”上 ，以便可以應用自動區分。更新方法包括三個步驟[6]。首先，在我們的自定義損失函數中，我們通過參與者網絡(已存儲)進行前向傳遞，然后計算損失。其次，使用函數.trainable_variables ，我們可以回憶起向前通過過程中發現的權重。隨后， tape.gradient只需插入損失值和可訓練變量，即可為您計算所有梯度。第三，使用optimizer.apply_gradients我們更新網絡權重，其中優化器是您的選擇之一(例如，SGD，Adam，RMSprop)。在Python中，更新步驟如下所示：

"""Compute and apply gradients to update network weights"""
with tf.GradientTape() as tape:# Compute Gaussian loss with custom loss functionloss_value = custom_loss_gaussian(state, action, reward)# Compute gradients for actor networkgrads = tape.gradient(loss_value, actor_network.trainable_variables)# Apply gradients to update network weightsoptimizer.apply_gradients(zip(grads, actor_network.trainable_variables))

So in the end, we only need a few lines of codes to perform the update!

因此，最后，我們只需要執行幾行代碼即可執行更新！

數值例子 (Numerical example)

We present a minimal working example for a continuous control problem, the full code can be found on my GitHub. We consider an extremely simple problem, namely a one-shot game with only one state and a trivial optimal policy. The closer we are to the (fixed but unknown) target, the higher our reward. The reward function is formally denoted as R =ζ β / max(ζ,|τ - a|), with β as the maximum reward, τ as the target and ζ as the target range.

我們為連續控制問題提供了一個最小的工作示例，完整的代碼可以在我的GitHub上找到。我們考慮一個非常簡單的問題，即只有一個狀態和一個瑣碎的最優策略的單機游戲。我們越接近(固定但未知)的目標，我們的獎勵就越高。獎勵函數正式表示為R =ζβ/ max (ζ，|τ-a |) ，其中β為最大獎勵， τ為目標， ζ為目標范圍。

To represent the actor we define a dense neural network (using Keras) that takes the fixed state (a tensor with value 1) as input, performs transformations in two hidden layers with ReLUs as activation functions (five per layer) and returns μ and σ as output. We initialize bias weights such that we start with μ=0 and σ=1. For our optimizer, we use Adam with its default learning rate of 0.001.

為了表示參與者，我們定義了一個密集的神經網絡(使用Keras)，該網絡以固定狀態(值為1的張量)作為輸入，在兩個隱藏層中執行變換，其中ReLU作為激活函數(每層五個)，并返回μ和σ作為輸出。我們初始化偏差權重，使得我們從μ = 0和σ = 1開始。對于我們的優化器，我們使用默認學習率為0.001的Adam。

Some sample runs are shown in the figure below. Note that the convergence pattern is in line with our expectations. At first the losses are relatively high, causing μ to move into the direction of higher rewards and σ to increase and allow for more exploration. Once hitting the target the observed losses decrease, resulting in μ to stabilize and σ to drop to nearly 0.

下圖顯示了一些示例運行。請注意，收斂模式符合我們的預期。首先，損失相對較高，導致μ向更高獎勵的方向移動，而σ增加并允許更多的探索。一旦達到目標，觀察到的損耗就會減少，從而使μ穩定下來，而σ下降到接近0。

關鍵點 (Key points)

The policy gradient method does not work with traditional loss functions; we must define a pseudo-loss to update actor networks. For continuous control, the pseudo-loss function is simply the negative log of the pdf value multiplied with the reward signal.
策略梯度法不適用于傳統的損失函數；我們必須定義一個偽損失來更新參與者網絡。對于連續控制，偽損失函數只是pdf值的負對數乘以獎勵信號。
Several TensorFlow 2.0 update functions only accept custom loss functions with exactly two arguments. The GradientTape functionality does not have this restriction.
一些TensorFlow 2.0更新函數僅接受帶有兩個自變量的自定義損失函數。 GradientTape功能沒有此限制。
Actor networks are updated using three steps: (i) define a custom loss function, (ii) compute the gradients for the trainable variables and (iii) apply the gradients to update the weights of the actor network.
使用三個步驟來更新Actor網絡：(i)定義自定義損失函數，(ii)計算可訓練變量的梯度，以及(iii)應用梯度來更新Actor網絡的權重。

This article is partially based on my ResearchGate paper: ‘Implementing Gaussian Actor Networks for Continuous Control in TensorFlow 2.0’ , available at https://www.researchgate.net/publication/343714359_Implementing_Gaussian_Actor_Networks_for_Continuous_Control_in_TensorFlow_20

本文部分基于我的ResearchGate論文：``在TensorFlow 2.0中實現高斯Actor網絡以實現連續控制''， 網址為 https://www.researchgate.net/publication/343714359_Implementing_Gaussian_Actor_Networks_for_Continuous_Control_in_TensorFlow_20

The GitHub code (implemented using Python 3.8 and TensorFlow 2.3) can be found at: www.github.com/woutervanheeswijk/example_continuous_control

GitHub代碼(使用Python 3.8和TensorFlow 2.3實現)可以在以下位置找到： www.github.com/woutervanheeswijk/example_continuous_control

[1] Van Heeswijk, W.J.A. (2020) Implementing Gaussian Actor Networks for Continuous Control in TensorFlow 2.0. https://www.researchgate.net/publication/343714359_Implementing_Gaussian_Actor_Networks_for_Continuous_Control_in_TensorFlow_20

[1] Van Heeswijk，WJA(2020)在TensorFlow 2.0中實現高斯演員網絡進行連續控制。 https://www.researchgate.net/publication/343714359_Implementing_Gaussian_Actor_Networks_for_Continuous_Control_in_TensorFlow_20

[2] Williams, R. J. (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3–4):229-256.

[2] Williams，RJ(1992)用于連接主義強化學習的簡單統計梯度跟蹤算法。機器學習，8(3-4)：229-256。

[3] Levine, S. (2019) CS 285 at UC Berkeley Deep Reinforcement Learning: Policy Gradients. http://rail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-5.pdf

[3] Levine，S.(2019)CS 285，加州大學伯克利分校深度強化學習：政策梯度。 http://rail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-5.pdf

[4] Theanets 0.7.3 documentation. Gaussian Log Likelihood Function. https://theanets.readthedocs.io/en/stable/api/generated/theanets.losses.GaussianLogLikelihood.html#theanets.losses.GaussianLogLikelihood

[4] Theanets 0.7.3文檔。高斯對數似然函數。 https://theanets.readthedocs.io/zh_CN/stable/api/generated/theanets.losses.GaussianLogLikelihood.html#theanets.losses.GaussianLogLikelihood

[5] Rosebrock, A. (2020) Using TensorFlow and GradientTape to train a Keras model. https://www.tensorflow.org/api_docs/python/tf/GradientTape

[5] Rosebrock，A.(2020)使用TensorFlow和GradientTape訓練Keras模型。 https://www.tensorflow.org/api_docs/python/tf/GradientTape

[6] Nandan, A. (2020) Actor Critic Method. https://keras.io/examples/rl/actor_critic_cartpole/

[6] Nandan，A.(2020)演員評論方法。 https://keras.io/examples/rl/actor_critic_cartpole/

翻譯自: https://towardsdatascience.com/a-minimal-working-example-for-continuous-policy-gradients-in-tensorflow-2-0-d3413ec38c6b

梯度 cv2.sobel

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/390819.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/390819.shtml
英文地址，請注明出處：http://en.pswp.cn/news/390819.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！