深度強化學習和強化學習

by Jannes Klaas

簡尼斯·克拉斯(Jannes Klaas)

深度強化學習：從哪里開始 (Deep reinforcement learning: where to start)

Last year, DeepMind’s AlphaGo beat Go world champion Lee Sedol 4–1. More than 200 million people watched as reinforcement learning (RL) took to the world stage. A few years earlier, DeepMind had made waves with a bot that could play Atari games. The company was soon acquired by Google.

去年， DeepMind的AlphaGo以4比1擊敗了圍棋世界冠軍Lee Sedol 。超過2億人觀看了強化學習(RL)進入世界舞臺。幾年前，DeepMind用一種可以玩Atari游戲的機器人引起了轟動。該公司很快被Google收購。

Many researchers believe that RL is our best shot at creating artificial general intelligence. It is an exciting field, with many unsolved challenges and huge potential.

許多研究人員認為，RL是我們創造人工智能的最佳選擇。這是一個令人興奮的領域，具有許多未解決的挑戰和巨大的潛力。

Although it can appear challenging at first, getting started in RL is actually not so difficult. In this article, we will create a simple bot with Keras that can play a game of Catch.

盡管乍一看似乎很困難，但是入門RL實際上并不那么困難。在本文中，我們將使用Keras創建一個簡單的機器人，該機器人可以玩Catch游戲。

游戲 (The game)

Catch is a very simple arcade game, which you might have played as a child. Fruits fall from the top of the screen, and the player has to catch them with a basket. For every fruit caught, the player scores a point. For every fruit lost, the player loses a point.

Catch是一款非常簡單的街機游戲，您可能小時候就玩過。水果從屏幕頂部掉落，玩家必須用籃子抓住它們。玩家每抓到一個水果就得分。失去的每一果，玩家都會失去一分。

The goal here is to let the computer play Catch by itself. But we will not use the pretty game above. Instead, we will use a simplified version to make the task easier:

這里的目標是讓計算機自己播放Catch。但是我們不會使用上面的漂亮游戲。相反，我們將使用簡化版本來簡化任務：

While playing Catch, the player decides between three possible actions. They can move the basket to the left, to the right, or stay put.

在玩接球時，玩家在三個可能的動作之間做出選擇。他們可以將籃子向左，向右移動或保持原狀。

The basis for this decision is the current state of the game. In other words: the positions of the falling fruit and of the basket.

該決定的基礎是游戲的當前狀態。換句話說：落下的水果和籃子的位置。

Our goal is to create a model, which, given the content of the game screen, chooses the action which leads to the highest score possible.

我們的目標是創建一個模型，該模型在給定游戲屏幕內容的情況下，選擇可能導致最高分的動作。

This task can be seen as a simple classification problem. We could ask expert human players to play the game many times and record their actions. Then, we could train a model to choose the ‘correct’ action that mirrors the expert players.

該任務可以看作是一個簡單的分類問題。我們可以請人類專家多次玩游戲并記錄他們的行為。然后，我們可以訓練一個模型以選擇反映專家玩家的“正確”動作。

But this is not how humans learn. Humans can learn a game like Catch by themselves, without guidance. This is very useful. Imagine if you had to hire a bunch of experts to perform a task thousands of times every time you wanted to learn something as simple as Catch! It would be expensive and slow.

但這不是人類學習的方式。人們可以在沒有指導的情況下自己學習“追趕”這樣的游戲。這非常有用。想象一下，如果您每次想學習像Catch這樣簡單的東西時，都必須聘請一群專家來執行數千次任務，那么！這將是昂貴且緩慢的。

In reinforcement learning, the model trains from experience, rather than labeled data.

在強化學習中，該模型從經驗中訓練，而不是從標記的數據中訓練。

深度強化學習 (Deep reinforcement learning)

Reinforcement learning is inspired by behavioral psychology.

強化學習是受行為心理學啟發的。

Instead of providing the model with ‘correct’ actions, we provide it with rewards and punishments. The model receives information about the current state of the environment (e.g. the computer game screen). It then outputs an action, like a joystick movement. The environment reacts to this action and provides the next state, alongside with any rewards.

我們沒有為模型提供“正確”的操作，而是為它提供了獎勵和懲罰。該模型接收有關環境當前狀態的信息(例如，計算機游戲屏幕)。然后，它輸出一個動作，例如操縱桿運動。環境對此動作做出React，并提供下一個狀態以及所有獎勵。

The model then learns to find actions that lead to maximum rewards.

然后，模型學習尋找導致最大回報的行動。

There are many ways this can work in practice. Here, we are going to look at Q-Learning. Q-Learning made a splash when it was used to train a computer to play Atari games. Today, it is still a relevant concept. Most modern RL algorithms are some adaptation of Q-Learning.

在實踐中有許多方法可以起作用。在這里，我們將研究Q學習。 Q-Learning在用于訓練計算機以玩Atari游戲時引起了轟動。今天，它仍然是一個相關的概念。大多數現代RL算法都是Q學習的一種改編。

Q學習直覺 (Q-learning intuition)

A good way to understand Q-learning is to compare playing Catch with playing chess.

理解Q學習的一種好方法是將“接球”與“下棋”進行比較。

In both games you are given a state, S. With chess, this is the positions of the figures on the board. In Catch, this is the location of the fruit and the basket.

在兩個游戲中，您都被賦予狀態S 對于國際象棋，這是數字在板上的位置。在“捕獲”中，這是水果和籃子的位置。

The player then has to take an action, A. In chess, this is moving a figure. In Catch, this is to move the basket left or right, or remain in the current position.

然后，玩家必須采取行動A 在國際象棋中，這是一個數字。在“捕獲”中，這是將籃子向左或向右移動，或保持在當前位置。

As a result, there will be some reward R, and a new state S’.

結果，將有一些獎勵R和一個新狀態S' 。

The problem with both Catch and chess is that the rewards do not appear immediately after the action.

接球和下象棋的問題在于，獎勵不會在行動后立即出現。

In Catch, you only earn rewards when the fruits hit the basket or fall on the floor, and in chess you only earn a reward when you win or lose the game. This means that rewards are sparsely distributed. Most of the time, R will be zero.

在“接球”中，只有當水果撞到籃子或掉在地板上時，您才能獲得獎勵；而在國際象棋中，您只有在游戲輸贏時才獲得獎勵。這意味著獎勵稀疏分配。 大多數情況下， R將為零。

When there is a reward, it is not always a result of the action taken immediately before. Some action taken long before might have caused the victory. Figuring out which action is responsible for the reward is often referred to as the credit assignment problem.

當獲得獎勵時，并不一定是緊接之前采取的行動的結果。早就采取的某些行動可能會導致勝利。弄清楚哪個動作負責獎勵，通常被稱為信用分配問題 。

Because rewards are delayed, good chess players do not choose their plays only by the immediate reward. Instead, they choose by the expected future reward.

由于獎勵被延遲，因此優秀的棋手不會僅憑立即獎勵來選擇自己的游戲。相反，他們根據預期的未來獎勵進行選擇。

For example, they do not only think about whether they can eliminate an opponent’s figure in the next move. They also consider how taking a certain action now will help them in the long run.

例如，他們不僅考慮下一步是否可以消除對手的身材。他們還考慮從現在開始采取某種行動從長遠來看將如何幫助他們。

In Q-learning, we choose our action based on the highest expected future reward. We use a “Q-function” to calculate this. This is a math function that takes two arguments: the current state of the game, and a given action.

在Q學習中，我們根據預期的最高未來獎勵來選擇行動。我們使用“ Q函數”進行計算。這是一個帶有兩個參數的數學函數：游戲的當前狀態和給定的動作。

We can write this as: Q(state, action)

我們可以這樣寫： Q(state, action)

While in state S, we estimate the future reward for each possible action A. We assume that after we have taken action A and moved to the next state S’, everything works out perfectly.

在狀態S ，我們估計每個可能動作A的未來獎勵。我們假設在采取行動A并移至下一個狀態S' ，一切都將順利進行。

The expected future reward Q(S,A) for a given a state S and action A is calculated as the immediate reward R, plus the expected future reward thereafter Q(S',A'). We assume the next action A' is optimal.

給定狀態S和動作A的預期未來獎勵Q(S,A)計算為立即獎勵R ，加上其后的預期未來獎勵Q(S',A') 。我們假設下一個動作A'是最佳的。

Because there is uncertainty about the future, we discount Q(S’,A’) by the factor gamma γ.

由于對未來存在不確定性，因此我們將Q(S',A')減去因子γ。

Q(S,A) = R + γ * max Q(S’,A’)

Q(S,A) = R + γ * max Q(S',A')

Good chess players are very good at estimating future rewards in their head. In other words, their Q-function Q(S,A) is very precise.

優秀的國際象棋棋手非常擅長估算未來的回報。換句話說，它們的Q函數Q(S,A)非常精確。

Most chess practice revolves around developing a better Q-function. Players peruse many old games to learn how specific moves played out in the past, and how likely a given action is to lead to victory.

大多數國際象棋實踐都圍繞著開發更好的Q功能。玩家細讀許多舊游戲，以了解過去的特定動作是如何進行的，以及給定的動作導致勝利的可能性。

But how can a machine estimate a good Q-function? This is where neural networks come into play.

但是一臺機器如何估計一個好的Q函數呢？這就是神經網絡起作用的地方。

畢竟回歸 (Regression after all)

When playing a game, we generate lots of “experiences”. These experiences consist of:

在玩游戲時，我們會產生很多“經驗”。這些經驗包括：

the initial state, S
初始狀態S
the action taken, A
采取的行動， A
the reward earned, R
獲得的獎勵， R
and the state that followed, S’
以及隨后的狀態S'

These experiences are our training data. We can frame the problem of estimating Q(S,A) as a regression problem. To solve this, we can use a neural network.

這些經驗是我們的培訓數據。我們可以將估計Q(S,A)的問題構造為回歸問題。為了解決這個問題，我們可以使用神經網絡。

Given an input vector consisting of S and A, the neural net is supposed to predict the value of Q(S,A) equal to the target: R + γ * max Q(S’,A’).

給定組成的輸入矢量S和A ，神經網絡被認為預測值Q(S,A)等于目標： R + γ * max Q(S',A')

If we are good at predicting Q(S,A) for different states S and actions A, we have a good approximation of the Q-function. Note that we estimate Q(S’,A’) through the same neural net as Q(S,A).

如果我們擅長預測不同狀態S和動作A Q(S,A) ，則可以很好地逼近Q函數。請注意，我們通過與Q(S,A)相同的神經網絡估算Q(S',A') Q(S,A) 。

培訓過程 (The training process)

Given a batch of experiences < S, A, R, S’ >, the training process then looks as follows:

給定一批經驗< S, A, R, S '>，然后訓練過程如下：

For each possible action A’ (left, right, stay), predict the expected future reward Q(S’,A’) using the neural net
對于每個可能的動作A' (左，右，停留)，使用神經網絡預測預期的未來獎勵Q(S',A')
Choose the highest value of the three predictions as max Q(S’,A’)
選擇三個預測中的max Q(S',A')作為max Q(S',A')
Calculate r + γ * max Q(S’,A’). This is the target value for the neural net
計算r + γ * max Q(S',A') 。這是神經網絡的目標值
Train the neural net using a loss function. This is a function that calculates how near or far the predicted value is from the target value. Here, we will use 0.5 * (predicted_Q(S,A) — target)2 as the loss function.
使用損失函數訓練神經網絡。該函數可計算預測值與目標值之間的距離。在這里，我們將使用0.5 * (predicted_Q(S,A) — target)2作為損失函數。

During gameplay, all the experiences are stored in a replay memory. This acts like a simple buffer in which we store < S, A, R, S’ > pairs. The experience replay class also handles preparing the data for training. Check out the code below:

在游戲過程中，所有體驗都存儲在重播內存中 。這就像一個簡單的緩沖區，我們在其中存儲< S, A, R, S '>對。體驗重播類還處理為訓練準備數據的過程。查看以下代碼：

定義模型 (Defining the model)

Now it is time to define the model that will learn a Q-function for Catch.

現在是時候定義將學習Catch Q函數的模型了。

We are using Keras as a front end to Tensorflow. Our baseline model is a simple three-layer dense network.

我們使用Keras作為前端Tensorflow 。我們的基準模型是一個簡單的三層密集網絡。

Already, this model performs quite well on this simple version of Catch. Head over to GitHub for the full implementation. You can experiment with more complex models to see if you can get better performance.

該模型已經在此簡單版本的Catch上運行良好。前往GitHub進行完整的實現。您可以嘗試使用更復雜的模型，以查看是否可以獲得更好的性能。

勘探 (Exploration)

A final ingredient to Q-Learning is exploration.

Q學習的最后一個要素是探索。

Everyday life shows that sometimes you have to do something weird and/or random to find out whether there is something better than your daily trot.

日常生活表明，有時您必須做一些奇怪和/或隨意的事情才能發現是否有比您日常小跑更好的東西。

The same goes for Q-Learning. Always choosing the best option means you might miss out on some unexplored paths. To avoid this, the learner will sometimes choose a random option, and not necessarily the best.

Q學習也是如此。始終選擇最佳選項意味著您可能會錯過一些未開發的路徑。為了避免這種情況，學習者有時會選擇一個隨機選項，不一定是最好的。

Now we can define the training method:

現在我們可以定義訓練方法：

I let the game train for 5,000 epochs, and it does quite well now!

我讓游戲訓練了5,000個紀元，并且現在表現不錯！

As you can see in the animation, the computer catches the apples falling from the sky.

正如您在動畫中看到的那樣，計算機捕獲了從天上掉下來的蘋果。

To visualize how the model learned, I plotted the moving average of victories over the epochs:

為了可視化模型的學習方式，我繪制了各個時期勝利的移動平均值：

從這往哪兒走 (Where to go from here)

You now have gained a first overview and an intuition of RL. I recommend taking a look at the full code for this tutorial. You can experiment with it.

現在，您已經獲得了RL的初步概述和直覺。我建議看一下本教程的完整代碼。您可以嘗試一下。

You might also want to check out Arthur Juliani’s series. If you’d like a more formal introduction, have a look at Stanford’s CS 234, Berkeley’s CS 294 or David Silver’s lectures from UCL.

您可能還想看看Arthur Juliani的系列。如果您想要更正式的介紹，請參閱斯坦福大學的CS 234 ，伯克利的CS 294或UCL的David Silver的演講。

A great way to practice your RL skills is OpenAI’s Gym, which offers a set of training environments with a standardized API.

練習RL技能的絕佳方法是OpenAI的Gym ，它提供了一組帶有標準化API的訓練環境。

致謝 (Acknowledgements)

This article builds upon Eder Santana’s simple RL example, from 2016. I refactored his code and added explanations in a notebook I wrote earlier in 2017. For readability on Medium, I only show the most relevant code here. Head over to the notebook or Eder’s original post for more.

本文以Eder Santana從2016年開始的簡單RL示例為基礎。我重構了他的代碼，并在2017年早些時候寫的筆記本中添加了解釋。為便于閱讀，我在這里僅顯示最相關的代碼。前往筆記本或Eder的原始文章了解更多。

關于簡尼斯·克拉斯 (About Jannes Klaas)

This text is part of the Machine Learning in Financial Context course material, which helps economics and business students understand machine learning.

此文本是金融上下文機器學習課程材料的一部分，該課程材料幫助經濟學和商科學生理解機器學習。

I spent a decade building software and am now on the journey bring ML to the financial world. I study at the Rotterdam School of Management and have done research with the Institute for Housing and Urban development studies.

我花了十年的時間開發軟件，現在正在將ML帶到金融界的旅程。我在鹿特丹管理學院學習，并在住房和城市發展研究所進行了研究。

You can follow me on Twitter. If you have any questions or suggestions please leave a comment or ping me on Medium.

您可以在Twitter上關注我。如果您有任何疑問或建議，請在Medium上發表評論或ping我。