doom 源碼

by Thomas Simonini

通過托馬斯·西蒙尼(Thomas Simonini)

Cartpole和Doom的策略梯度簡介 (An introduction to Policy Gradients with Cartpole and Doom)

This article is part of Deep Reinforcement Learning Course with Tensorflow ??. Check the syllabus here.
本文是使用Tensorflow？?的深度強化學習課程的一部分。檢查課程表。

In the last two articles about Q-learning and Deep Q learning, we worked with value-based reinforcement learning algorithms. To choose which action to take given a state, we take the action with the highest Q-value (maximum expected future reward I will get at each state). As a consequence, in value-based learning, a policy exists only because of these action-value estimates.

在有關Q學習和深度Q學習的最后兩篇文章中，我們使用了基于價值的強化學習算法。為了選擇在某個狀態下要采取的行動，我們采取Q值最高的行動(我將在每個狀態下獲得的最大預期未來獎勵)。結果，在基于價值的學習中，僅由于這些行動價值估計而存在策略。

Today, we’ll learn a policy-based reinforcement learning technique called Policy Gradients.

今天，我們將學習一種稱為“策略梯度”的基于策略的強化學習技術。

We’ll implement two agents. The first will learn to keep the bar in balance.

我們將實現兩個代理。第一個將學會保持平衡。

The second will be an agent that learns to survive in a Doom hostile environment by collecting health.

第二個將是通過收集健康來學習在末日敵對環境中生存的特工。

In policy-based methods, instead of learning a value function that tells us what is the expected sum of rewards given a state and an action, we learn directly the policy function that maps state to action (select actions without using a value function).

在基于策略的方法中，我們沒有直接學習將狀態映射到操作的策略函數(選擇不使用值函數的操作)，而是學習了告訴我們給定狀態和操作的預期回報總和的價值函數。

It means that we directly try to optimize our policy function π without worrying about a value function. We’ll directly parameterize π (select an action without a value function).

這意味著我們直接嘗試優化策略函數π，而不必擔心值函數。我們將直接參數化π(選擇一個沒有值函數的動作)。

Sure, we can use a value function to optimize the policy parameters. But the value function will not be used to select an action.

當然，我們可以使用值函數來優化策略參數。但是值函數將不會用于選擇動作。

In this article you’ll learn:

在本文中，您將學習：

What is Policy Gradient, and its advantages and disadvantages
什么是“政策梯度”及其優缺點
How to implement it in Tensorflow.
如何在Tensorflow中實現它。

為什么要使用基于策略的方法？ (Why using Policy-Based methods?)

兩種政策 (Two types of policy)

A policy can be either deterministic or stochastic.

策略可以是確定性的，也可以是隨機的。

A deterministic policy is policy that maps state to actions. You give it a state and the function returns an action to take.

確定性策略是將狀態映射到操作的策略。給它一個狀態，函數返回一個要執行的動作。

Deterministic policies are used in deterministic environments. These are environments where the actions taken determine the outcome. There is no uncertainty. For instance, when you play chess and you move your pawn from A2 to A3, you’re sure that your pawn will move to A3.

確定性策略用于確定性環境中。在這些環境中，所采取的措施決定了結果。沒有不確定性。例如，當您下棋并將典當從A2移至A3時，您確定典當將移至A3。

On the other hand, a stochastic policy outputs a probability distribution over actions.

另一方面，隨機策略輸出動作上的概率分布。

It means that instead of being sure of taking action a (for instance left), there is a probability we’ll take a different one (in this case 30% that we take south).

這意味著我們沒有確定采取行動(例如左手行動)，而是有可能采取另一種行動(在這種情況下，我們采取向南行動的可能性為30％)。

The stochastic policy is used when the environment is uncertain. We call this process a Partially Observable Markov Decision Process (POMDP).

當環境不確定時，將使用隨機策略。我們稱此過程為部分可觀察的馬爾可夫決策過程(POMDP)。

Most of the time we’ll use this second type of policy.

大多數時候，我們將使用第二種類型的策略。

優點 (Advantages)

But Deep Q Learning is really great! Why using policy-based reinforcement learning methods?

但是Deep Q Learning真的很棒！為什么要使用基于策略的強化學習方法？

There are three main advantages in using Policy Gradients.

使用策略梯度有三個主要優點。

收斂 (Convergence)

For one, policy-based methods have better convergence properties.

首先，基于策略的方法具有更好的收斂性。

The problem with value-based methods is that they can have a big oscillation while training. This is because the choice of action may change dramatically for an arbitrarily small change in the estimated action values.

基于價值的方法的問題在于，它們在訓練時會產生很大的振動。這是因為，對于估計的動作值的任意小的變化，動作的選擇可能會發生巨大的變化。

On the other hand, with policy gradient, we just follow the gradient to find the best parameters. We see a smooth update of our policy at each step.

另一方面，對于策略梯度，我們只需遵循梯度即可找到最佳參數。我們會在每個步驟上看到我們政策的順利更新。

Because we follow the gradient to find the best parameters, we’re guaranteed to converge on a local maximum (worst case) or global maximum (best case).

因為我們遵循梯度來找到最佳參數，所以我們可以保證收斂于局部最大值(最壞情況)或全局最大值(最好情況)。

策略梯度在高維操作空間中更有效 (Policy gradients are more effective in high dimensional action spaces)

The second advantage is that policy gradients are more effective in high dimensional action spaces, or when using continuous actions.

第二個優點是策略梯度在高維操作空間或使用連續操作時更有效。

The problem with Deep Q-learning is that their predictions assign a score (maximum expected future reward) for each possible action, at each time step, given the current state.

深度Q學習的問題在于，他們的預測會在給定當前狀態的情況下，在每個時間步長為每個可能的動作分配一個分數(最大預期未來獎勵)。

But what if we have an infinite possibility of actions?

但是，如果我們有無限可能采取行動，該怎么辦？

For instance, with a self driving car, at each state you can have a (near) infinite choice of actions (turning the wheel at 15°, 17.2°, 19,4°, honk…). We’ll need to output a Q-value for each possible action!

例如，對于自動駕駛汽車，您可以在每種狀態下(幾乎)無限選擇動作(將車輪旋轉15°，17.2°，19.4°，鳴笛……)。我們需要為每個可能的動作輸出一個Q值！

On the other hand, in policy-based methods, you just adjust the parameters directly: thanks to that you’ll start to understand what the maximum will be, rather than computing (estimating) the maximum directly at every step.

另一方面，在基于策略的方法中，您只需要直接調整參數：由于這樣，您將開始了解最大值，而不是直接在每個步驟中計算(估計)最大值。

策略梯度可以學習隨機策略 (Policy gradients can learn stochastic policies)

A third advantage is that policy gradient can learn a stochastic policy, while value functions can’t. This has two consequences.

第三個優點是，策略梯度可以學習隨機策略，而價值函數則不能。這有兩個后果。

One of these is that we don’t need to implement an exploration/exploitation trade off. A stochastic policy allows our agent to explore the state space without always taking the same action. This is because it outputs a probability distribution over actions. As a consequence, it handles the exploration/exploitation trade off without hard coding it.

其中之一是，我們無需進行勘探/開發折衷。隨機策略使我們的代理無需始終執行相同的操作即可探索狀態空間。這是因為它輸出動作上的概率分布。結果，它無需硬編碼就可以處理勘探/開發折衷。

We also get rid of the problem of perceptual aliasing. Perceptual aliasing is when we have two states that seem to be (or actually are) the same, but need different actions.

我們還擺脫了感知混疊的問題。感知別名是指當我們有兩個看起來(或實際上是)相同但需要不同動作的狀態時。

Let’s take an example. We have a intelligent vacuum cleaner, and its goal is to suck the dust and avoid killing the hamsters.

讓我們舉個例子。我們有一個智能的真空吸塵器，其目標是吸除灰塵并避免殺死倉鼠。

Our vacuum cleaner can only perceive where the walls are.

我們的真空吸塵器只能感知墻壁的位置。

The problem: the two red cases are aliased states, because the agent perceives an upper and lower wall for each two.

問題是：這兩個紅色的情況是別名狀態，因為代理會感知到每個情況的上下兩堵墻。

Under a deterministic policy, the policy will be either moving right when in red state or moving left. Either case will cause our agent to get stuck and never suck the dust.

在確定性策略下，該策略將在處于紅色狀態時向右移動或向左移動。無論哪種情況都將導致我們的代理商被卡住，并且永不吸塵。

Under a value-based RL algorithm, we learn a quasi-deterministic policy (“epsilon greedy strategy”). As a consequence, our agent can spend a lot of time before finding the dust.

在基于價值的RL算法下，我們學習了一種準確定性策略(“ε貪婪策略”)。結果，我們的代理商可能會花很多時間才能找到灰塵。

On the other hand, an optimal stochastic policy will randomly move left or right in grey states. As a consequence it will not be stuck and will reach the goal state with high probability.

另一方面，最佳隨機策略將在灰色狀態下隨機向左或向右移動。結果，它不會卡住，并且很有可能達到目標狀態。

缺點 (Disadvantages)

Naturally, Policy gradients have one big disadvantage. A lot of the time, they converge on a local maximum rather than on the global optimum.

自然，政策梯度有一個很大的缺點。很多時候，它們收斂于局部最大值而不是全局最優值。

Instead of Deep Q-Learning, which always tries to reach the maximum, policy gradients converge slower, step by step. They can take longer to train.

并非總是嘗試達到最大深度學習的“深度Q學習”，而是逐步地緩慢收斂了策略梯度。他們可能需要更長的時間來訓練。

However, we’ll see there are solutions to this problem.

但是，我們將看到針對此問題的解決方案。

政策搜尋 (Policy Search)

We have our policy π that has a parameter θ. This π outputs a probability distribution of actions.

我們的策略π的參數為θ。該π輸出動作的概率分布。

Awesome! But how do we know if our policy is good?

太棒了！但是，我們如何知道我們的政策是否良好？

Remember that policy can be seen as an optimization problem. We must find the best parameters (θ) to maximize a score function, J(θ).

請記住，策略可以被視為優化問題。我們必須找到最佳參數(θ)以最大化得分函數J(θ)。

There are two steps:

分兩個步驟：

Measure the quality of a π (policy) with a policy score function J(θ)
使用策略得分函數J(θ)測量π(策略)的質量
Use policy gradient ascent to find the best parameter θ that improves our π.
使用策略梯度上升來找到改善我們的π的最佳參數θ。

The main idea here is that J(θ) will tell us how good our π is. Policy gradient ascent will help us to find the best policy parameters to maximize the sample of good actions.

這里的主要思想是J(θ)會告訴我們π有多好。政策梯度上升將幫助我們找到最佳的政策參數，以最大程度地采取良好行動。

第一步：策略得分函數J( θ) (First Step: the Policy Score function J(θ))

To measure how good our policy is, we use a function called the objective function (or Policy Score Function) that calculates the expected reward of policy.

為了衡量我們的保單質量，我們使用了一個稱為目標函數(或保單得分函數)的函數，該函數計算保單的預期收益。

Three methods work equally well for optimizing policies. The choice depends only on the environment and the objectives you have.

三種方法在優化策略方面同樣有效。選擇僅取決于環境和您的目標。

First, in an episodic environment, we can use the start value. Calculate the mean of the return from the first time step (G1). This is the cumulative discounted reward for the entire episode.

首先，在情景環境中，我們可以使用起始值。計算從第一時間步驟(G1)開始的收益平均值。這是整個情節的累積折扣獎勵。

The idea is simple. If I always start in some state s1, what’s the total reward I’ll get from that start state until the end?

這個想法很簡單。如果我總是從某個狀態s1開始，那么從該開始狀態到結束我將獲得的總獎勵是多少？

We want to find the policy that maximizes G1, because it will be the optimal policy. This is due to the reward hypothesis explained in the first article.

我們想要找到最大化G1的策略，因為它將是最佳策略。這是由于第一篇文章中解釋了獎勵假設。

For instance, in Breakout, I play a new game, but I lost the ball after 20 bricks destroyed (end of the game). New episodes always begin at the same state.

例如，在Breakout中，我玩一個新游戲，但是在銷毀20塊磚(游戲結束)之后我輸了球。新劇集總是以相同的狀態開始。

I calculate the score using J1(θ). Hitting 20 bricks is good, but I want to improve the score. To do that, I’ll need to improve the probability distributions of my actions by tuning the parameters. This happens in step 2.

我使用J1(θ)計算分數。打20塊磚很好，但是我想提高分數。為此，我需要通過調整參數來改善動作的概率分布。這發生在步驟2中。

In a continuous environment, we can use the average value, because we can’t rely on a specific start state.

在連續環境中，我們可以使用平均值，因為我們不能依賴特定的開始狀態。

Each state value is now weighted (because some happen more than others) by the probability of the occurrence of the respected state.

現在，通過狀態值的出現概率來加權每個狀態值(因為某些狀態比其他狀態發生的更多)。

Third, we can use the average reward per time step. The idea here is that we want to get the most reward per time step.

第三，我們可以使用每個時間步長的平均獎勵。這里的想法是我們希望在每個時間步長中獲得最大的回報。

第二步：政策梯度上升 (Second step: Policy gradient ascent)

We have a Policy score function that tells us how good our policy is. Now, we want to find a parameter θ that maximizes this score function. Maximizing the score function means finding the optimal policy.

我們有一個政策評分功能，可以告訴我們我們的政策有多好。現在，我們想找到一個使該得分函數最大化的參數θ。最大化得分函數意味著找到最佳策略。

To maximize the score function J(θ), we need to do gradient ascent on policy parameters.

為了最大化得分函數J(θ)，我們需要對策略參數進行梯度提升。

Gradient ascent is the inverse of gradient descent. Remember that gradient always points to the steepest change.

梯度上升是梯度下降的逆過程。請記住，梯度始終指向最陡峭的變化。

In gradient descent, we take the direction of the steepest decrease in the function. In gradient ascent we take the direction of the steepest increase of the function.

在梯度下降中，我們采用函數中最陡峭的下降方向。在梯度上升中，我們采用函數最陡峭增加的方向。

Why gradient ascent and not gradient descent? Because we use gradient descent when we have an error function that we want to minimize.

為什么要進行梯度上升而不是梯度下降？因為當我們要最小化誤差函數時使用梯度下降。

But, the score function is not an error function! It’s a score function, and because we want to maximize the score, we need gradient ascent.

但是，得分功能不是錯誤功能！這是一個得分函數，并且由于我們要最大化得分，因此需要漸變上升。

The idea is to find the gradient to the current policy π that updates the parameters in the direction of the greatest increase, and iterate.

想法是找到當前策略π的梯度，該梯度沿最大增加的方向更新參數并進行迭代。

Okay, now let’s implement that mathematically. This part is a bit hard, but it’s fundamental to understand how we arrive at our gradient formula.

好的，現在讓我們以數學方式實現它。這部分比較難，但是了解我們如何得出梯度公式是基礎。

We want to find the best parameters θ*, that maximize the score:

我們想要找到使得分最大化的最佳參數θ*：

Our score function can be defined as:

我們的得分函數可以定義為：

Which is the total summation of expected reward given policy.

這是給定策略的預期獎勵的總和。

Now, because we want to do gradient ascent, we need to differentiate our score function J(θ).

現在，由于我們要進行梯度上升，因此需要區分得分函數J(θ)。

Our score function J(θ) can be also defined as:

我們的得分函數J(θ)也可以定義為：

We wrote the function in this way to show the problem we face here.

我們以這種方式編寫函數來顯示我們在此處面臨的問題。

We know that policy parameters change how actions are chosen, and as a consequence, what rewards we get and which states we will see and how often.

我們知道政策參數會改變行動的選擇方式，從而改變我們獲得的報酬，我們將看到的狀態以及頻率。

So, it can be challenging to find the changes of policy in a way that ensures improvement. This is because the performance depends on action selections and the distribution of states in which those selections are made.

因此，以確保改進的方式找到策略的變化可能是具有挑戰性的。這是因為性能取決于動作選擇和做出這些選擇的狀態分布。

Both of these are affected by policy parameters. The effect of policy parameters on the actions is simple to find, but how do we find the effect of policy on the state distribution? The function of the environment is unknown.

兩者均受策略參數影響。策略參數對操作的影響很容易找到，但是我們如何找到策略對狀態分布的影響呢？環境的功能未知。

As a consequence, we face a problem: how do we estimate the ? (gradient) with respect to policy θ, when the gradient depends on the unknown effect of policy changes on the state distribution?

結果，我們面臨一個問題：當梯度取決于策略更改對狀態分布的未知影響時，我們如何相對于策略θ估算?(梯度)？

The solution will be to use the Policy Gradient Theorem. This provides an analytic expression for the gradient ? of J(θ) (performance) with respect to policy θ that does not involve the differentiation of the state distribution.

解決方案將是使用“策略梯度定理”。這提供了不涉及狀態分布的微??分的，相對于策略θ的J(θ)(性能)的梯度?的解析表達式。

So let’s calculate:

因此，讓我們計算一下：

Remember, we’re in a situation of stochastic policy. This means that our policy outputs a probability distribution π(τ ; θ). It outputs the probability of taking these series of steps (s0, a0, r0…), given our current parameters θ.

請記住，我們處于隨機政策中。這意味著我們的策略輸出概率分布π(τ;θ)。給定當前參數θ，它輸出采取這一系列步驟的概率(s0，a0，r0…)。

But, differentiating a probability function is hard, unless we can transform it into a logarithm. This makes it much simpler to differentiate.

但是，很難區分概率函數，除非我們可以將其轉換為對數。這使得區分變得更加簡單。

Here we’ll use the likelihood ratio trick that replaces the resulting fraction into log probability.

在這里，我們將使用似然比技巧，將結果分數替換為對數概率。

Now let’s convert the summation back to an expectation:

現在讓我們將總和轉換回期望值：

As you can see, we only need to compute the derivative of the log policy function.

如您所見，我們只需要計算日志策略函數的導數。

Now that we’ve done that, and it was a lot, we can conclude about policy gradients:

現在我們已經完成了很多工作，我們可以得出有關政策梯度的結論：

This Policy gradient is telling us how we should shift the policy distribution through changing parameters θ if we want to achieve an higher score.

該策略梯度告訴我們，如果我們想獲得更高的分數，應該如何通過更改參數θ來改變策略分配。

R(tau) is like a scalar value score:

R(tau)就像一個標量值得分：

If R(tau) is high, it means that on average we took actions that lead to high rewards. We want to push the probabilities of the actions seen (increase the probability of taking these actions).
如果R(tau)高，則意味著平均而言，我們采取的行動會帶來很高的回報。我們要提高所看到動作的概率(增加采取這些動作的可能性)。
On the other hand, if R(tau) is low, we want to push down the probabilities of the actions seen.
另一方面，如果R(tau)低，我們想降低所看到動作的概率。

This policy gradient causes the parameters to move most in the direction that favors actions that has the highest return.

此策略梯度會導致參數朝著收益最高的操作的方向移動最多。

蒙特卡洛政策梯度 (Monte Carlo Policy Gradients)

In our notebook, we’ll use this approach to design the policy gradient algorithm. We use Monte Carlo because our tasks can be divided into episodes.

在筆記本中，我們將使用這種方法來設計策略梯度算法。之所以使用蒙特卡洛，是因為我們的任務可以分為幾集。

Initialize θfor each episode τ = S0, A0, R1, S1, …, ST:    for t <-- 1 to T-1:        Δθ = α ?theta(log π(St, At, θ)) Gt        θ = θ + Δθ

For each episode:    At each time step within that episode:         Compute the log probabilities produced by our policy function. Multiply it by the score function.         Update the weights

But we face a problem with this algorithm. Because we only calculate R at the end of the episode, we average all actions. Even if some of the actions taken were very bad, if our score is quite high, we will average all the actions as good.

但是我們在使用該算法時遇到了問題。因為我們僅在情節結束時計算R，所以我們平均了所有動作。即使所采取的某些措施非常糟糕，但如果我們的得分很高，我們也會將所有措施的平均結果都評為“良好”。

So to have a correct policy, we need a lot of samples… which results in slow learning.

因此，要制定正確的政策，我們需要大量樣本……這會導致學習緩慢。

如何改善我們的模型？ (How to improve our Model?)

We’ll see in the next articles some improvements:

我們將在接下來的文章中看到一些改進：

Actor Critic: a hybrid between value-based algorithms and policy-based algorithms.
演員評論家：基于價值的算法和基于策略的算法之間的混合體。
Proximal Policy Gradients: ensures that the deviation from the previous policy stays relatively small.
鄰近策略梯度：確保與先前策略的偏差保持相對較小。

讓我們用Cartpole和Doom實現它 (Let’s implement it with Cartpole and Doom)

We made a video where we implement a Policy Gradient agent with Tensorflow that learns to play Doom ?? in a Deathmatch environment.
我們制作了一個視頻，在該視頻中，我們使用Tensorflow實現了Policy Gradient代理，該代理學習了《毀滅戰士》？ 在Deathmatch環境中。

You can directly access the notebooks in the Deep Reinforcement Learning Course repo.

您可以在“ 深度強化學習課程”存儲庫中直接訪問筆記本。

Cartpole:

卡特波爾：

Doom:

厄運：

That’s all! You’ve just created an agent that learns to survive in a Doom environment. Awesome!

就這樣！ 您剛剛創建了一個可以在Doom環境中生存的代理。 太棒了！

Don’t forget to implement each part of the code by yourself. It’s really important to try to modify the code I gave you. Try to add epochs, change the architecture, change the learning rate, use a harder environment …and so on. Have fun!

不要忘記自己實現代碼的每個部分。 嘗試修改我給您的代碼非常重要。 嘗試添加時代，改變架構，改變學習率，使用更艱苦的環境……等等。 玩得開心！

In the next article, I will discuss the last improvements in Deep Q-learning:

在下一篇文章中，我將討論深度Q學習的最新改進：

Fixed Q-values
固定的Q值
Prioritized Experience Replay
優先體驗重播
Double DQN
雙DQN
Dueling Networks
決斗網絡

If you liked my article, please click the ? below as many time as you liked the article so other people will see this here on Medium. And don’t forget to follow me!

如果您喜歡我的文章， 請單擊“？”。 您可以根據自己喜歡該文章的次數在下面進行搜索，以便其他人可以在Medium上看到此內容。 并且不要忘記跟隨我！

If you have any thoughts, comments, questions, feel free to comment below or send me an email: hello@simoninithomas.com, or tweet me @ThomasSimonini.

如果您有任何想法，意見，問題，請在下面發表評論，或給我發送電子郵件：hello@simoninithomas.com或向我發送@ThomasSimonini信息。