強化學習簡介

by ADL

通過ADL

Reinforcement Learning is an aspect of Machine learning where an agent learns to behave in an environment, by performing certain actions and observing the rewards/results which it get from those actions.

強化學習是機器學習的一個方面，其中代理通過執行某些動作并觀察其從這些動作中獲得的回報/結果來學習在環境中的行為。

With the advancements in Robotics Arm Manipulation, Google Deep Mind beating a professional Alpha Go Player, and recently the OpenAI team beating a professional DOTA player, the field of reinforcement learning has really exploded in recent years.

隨著機器人手臂操縱技術的進步，Google Deep Mind擊敗了專業的Alpha Go Player以及最近的OpenAI團隊擊敗了專業的DOTA玩家，強化學習領域的確在近年來得到了飛速發展。

In this article, we’ll discuss:

在本文中，我們將討論：

What reinforcement learning is and its nitty-gritty like rewards, tasks, etc
強化學習是什么，它的實質是獎勵，任務等
3 categorizations of reinforcement learning
強化學習的3種分類

什么是強化學習？ (What is Reinforcement Learning?)

Let’s start the explanation with an example — say there is a small baby who starts learning how to walk.

讓我們從一個例子開始說明-假設有一個小嬰兒開始學習如何走路。

Let’s divide this example into two parts:

讓我們將此示例分為兩部分：

1. 嬰兒開始走路并成功地到達沙發上 (1. Baby starts walking and successfully reaches the couch)

Since the couch is the end goal, the baby and the parents are happy.

由于沙發是最終目標，因此嬰兒和父母都很開心。

So, the baby is happy and receives appreciation from her parents. It’s positive — the baby feels good (Positive Reward +n).

因此，嬰兒很快樂，并得到了父母的贊賞。這是積極的-嬰兒感覺良好(積極獎勵+ n)。

2. 嬰兒開始走動，由于中間的一些障礙而摔倒并受傷。 (2. Baby starts walking and falls due to some obstacle in between and gets bruised.)

Ouch! The baby gets hurt and is in pain. It’s negative — the baby cries (Negative Reward -n).

哎喲! 嬰兒受傷并處于疼痛中。這是負面的-嬰兒哭了(負面獎勵-n)。

That’s how we humans learn — by trail and error. Reinforcement learning is conceptually the same, but is a computational approach to learn by actions.

這就是我們人類學習的方法–循序漸進。強化學習在概念上是相同的，但它是通過動作學習的一種計算方法。

強化學習 (Reinforcement Learning)

Let’s suppose that our reinforcement learning agent is learning to play Mario as a example. The reinforcement learning process can be modeled as an iterative loop that works as below:

讓我們假設我們的強化學習代理正在學習玩Mario為例。可以將增強學習過程建模為一個迭代循環，其工作方式如下：

The RL Agent receives state S? from the environment i.e. Mario
RL代理從環境(即Mario)接收狀態 S?
Based on that state S?, the RL agent takes an action A?, say — our RL agent moves right. Initially, this is random.
根據狀態S?， RL代理采取行動A say，例如-我們的RL代理向右移動。最初，這是隨機的。
Now, the environment is in a new state S1 (new frame from Mario or the game engine)
現在，環境處于新狀態S1 (來自Mario或游戲引擎的新框架)
Environment gives some reward R1 to the RL agent. It probably gives a +1 because the agent is not dead yet.
環境給RL代理人一些獎勵 R1。可能是+1，因為代理尚未死亡。

This RL loop continues until we are dead or we reach our destination, and it continuously outputs a sequence of state, action and reward.

RL循環一直持續到我們死亡或到達目的地為止，并不斷輸出一系列狀態，動作和獎勵。

The basic aim of our RL agent is to maximize the reward.

我們的RL代理商的基本目標是使報酬最大化。

獎勵最大化 (Reward Maximization)

The RL agent basically works on a hypothesis of reward maximization. That’s why reinforcement learning should have best possible action in order to maximize the reward.

RL代理基本上基于獎勵最大化的假設。 這就是為什么強化學習應該采取最佳措施以最大程度地提高獎勵的原因。

The cumulative rewards at each time step with the respective action is written as:

每個時間步長和相應操作的累積獎勵寫為：

However, things don’t work in this way when summing up all the rewards.

但是，在總結所有獎勵時，事情就不會以這種方式起作用。

Let us understand this, in detail:

讓我們詳細了解一下：

Let us say our RL agent (Robotic mouse) is in a maze which contains cheese, electricity shocks, and cats. The goal is to eat the maximum amount of cheese before being eaten by the cat or getting an electricity shock.

假設我們的RL代理(機器人老鼠)在迷宮中，迷宮中包含奶酪，電擊和貓 。目標是在被貓吃掉或觸電之前先吃最大量的奶酪。

It seems obvious to eat the cheese near us rather than the cheese close to the cat or the electricity shock, because the closer we are to the electricity shock or the cat, the danger of being dead increases. As a result, the reward near the cat or the electricity shock, even if it is bigger (more cheese), will be discounted. This is done because of the uncertainty factor.

在我們附近而不是在貓或電擊附近的奶酪上吃奶酪似乎是顯而易見的，因為我們離電擊或貓越近，死亡的危險就越大。結果，靠近貓或電擊的獎勵，即使更大(更多奶酪)也將被打折。這樣做是由于不確定性因素。

It makes sense, right?

有道理吧？

獎勵折扣如下所示： (Discounting of rewards works like this:)

We define a discount rate called gamma. It should be between 0 and 1. The larger the gamma, the smaller the discount and vice versa.

我們定義了稱為gamma的折現率。它應該在0到1之間。伽瑪值越大，折扣越小，反之亦然。

So, our cumulative expected (discounted) rewards is:

因此，我們的累計預期(折現)獎勵為：

強化學習中的任務及其類型 (Tasks and their types in reinforcement learning)

A task is a single instance of a reinforcement learning problem. We basically have two types of tasks: continuous and episodic.

任務是強化學習問題的單個實例。我們基本上有兩種任務：連續任務和情節任務。

連續任務 (Continuous tasks)

These are the types of tasks that continue forever. For instance, a RL agent that does automated Forex/Stock trading.

這些是永遠持續的任務類型。 例如，執行自動外匯/股票交易的RL代理。

In this case, the agent has to learn how to choose the best actions and simultaneously interacts with the environment. There is no starting point and end state.

在這種情況下，代理必須學習如何選擇最佳操作并同時與環境交互。沒有起點和終點狀態。

The RL agent has to keep running until we decide to manually stop it.

RL代理必須繼續運行，直到我們決定手動停止它為止。

情景任務 (Episodic task)

In this case, we have a starting point and an ending point called the terminal state. This creates an episode: a list of States (S), Actions (A), Rewards (R).

在這種情況下，我們有一個起點和終點， 稱為終端狀態。 這將創建一個情節 ：狀態列表(S)，動作(A)，獎勵(R)。

For example, playing a game of counter strike, where we shoot our opponents or we get killed by them.We shoot all of them and complete the episode or we are killed. So, there are only two cases for completing the episodes.

對于例如，玩反擊游戲，我們向對手射擊或者被對手殺死，我們向所有人射擊并完成情節，否則我們被殺死。因此，只有兩種情況可以完成這些情節。

勘探與開發之間的權衡 (Exploration and exploitation trade off)

There is an important concept of the exploration and exploitation trade off in reinforcement learning. Exploration is all about finding more information about an environment, whereas exploitation is exploiting already known information to maximize the rewards.

在強化學習中，有一個重要的探索和權衡取舍概念。探索就是尋找有關環境的更多信息，而探索則是利用已知的信息來最大化回報。

Real Life Example: Say you go to the same restaurant every day. You are basically exploiting. But on the other hand, if you search for new restaurant every time before going to any one of them, then it’s exploration. Exploration is very important for the search of future rewards which might be higher than the near rewards.

現實生活示例：假設您每天去同一家餐廳。您基本上是在利用。 但另一方面，如果您每次去新餐廳之前都去尋找新餐廳，那就是探索。探索對于尋找可能比接近獎勵更高的未來獎勵非常重要。

In the above game, our robotic mouse can have a good amount of small cheese (+0.5 each). But at the top of the maze there is a big sum of cheese (+100). So, if we only focus on the nearest reward, our robotic mouse will never reach the big sum of cheese — it will just exploit.

在上面的游戲中，我們的機械鼠標可以放入大量的小奶酪(每個+0.5)。但是在迷宮的頂部，有一大堆奶酪(+100)。因此，如果我們只關注最接近的獎勵，那么我們的機器人鼠標將永遠無法獲得大量的奶酪，而只會被利用。

But if the robotic mouse does a little bit of exploration, it can find the big reward i.e. the big cheese.

但是，如果機械鼠標進行一些探索，它可以找到巨大的回報，即巨大的奶酪。

This is the basic concept of the exploration and exploitation trade-off.

這是勘探與開發權衡的基本概念。

強化學習的方法 (Approaches to Reinforcement Learning)

Let us now understand the approaches to solving reinforcement learning problems. Basically there are 3 approaches, but we will only take 2 major approaches in this article:

現在讓我們了解解決強化學習問題的方法。基本上有3種方法，但是在本文中我們將僅采用2種主要方法：

1.基于政策的方法 (1. Policy-based approach)

In policy-based reinforcement learning, we have a policy which we need to optimize. The policy basically defines how the agent behaves:

在基于策略的強化學習中，我們有一個需要優化的策略。該策略基本上定義了代理的行為方式：

We learn a policy function which helps us in mapping each state to the best action.

我們學習了一項政策功能，可以幫助我們將每個狀態映射到最佳行動。

Getting deep into policies, we further divide policies into two types:

深入了解策略后，我們進一步將策略分為兩種類型：

Deterministic: a policy at a given state(s) will always return the same action(a). It means, it is pre-mapped as S=(s) ? A=(a).
確定性的 ：處于給定狀態的策略將始終返回相同的動作。 這意味著，它被預先映射為S =(s)?A =(a)。
Stochastic: It gives a distribution of probability over different actions. i.e Stochastic Policy ? p( A = a | S = s )
隨機的 ：它給出了不同動作的概率分布。 即隨機策略?p(A = a | S = s)

2.基于價值 (2. Value Based)

In value-based RL, the goal of the agent is to optimize the value function V(s) which is defined as a function that tells us the maximum expected future reward the agent shall get at each state.

在基于價值的RL中，代理的目標是優化價值函數V(s) ，定義為告訴我們代理商在每個州應獲得的最大預期未來回報的功能。

The value of each state is the total amount of the reward an RL agent can expect to collect over the future, from a particular state.

每個州的價值是RL代理商期望從特定州獲得的獎勵總額。

The agent will use the above value function to select which state to choose at each step. The agent will always take the state with the biggest value.

代理將使用上述值功能來選擇每個步驟要選擇的狀態。代理將始終采用具有最大價值的狀態。

In the below example, we see that at each step, we will take the biggest value to achieve our goal: 1 ? 3 ? 4 ? 6 so on…

在下面的例子中，我們看到，在每一步，我們將采取的最大價值，實現我們的目標：1?3?4?6等等...

乒乓球游戲—直觀案例研究 (The game of Pong — An Intuitive case study)

Let us take a real life example of playing pong. This case study will just introduce you to the Intuition of How reinforcement Learning Works. We will not get into details in this example, but in the next article we will certainly dig deeper.

讓我們以打乒乓球為例。本案例研究將向您介紹強化學習的原理 。在本示例中，我們不會詳細介紹，但是在下一篇文章中，我們當然會進行更深入的研究。

Suppose we teach our RL agent to play the game of Pong.

假設我們教RL經紀人玩Pong游戲。

Basically, we feed in the game frames (new states) to the RL algorithm and let the algorithm decide where to go up or down. This network is said to be a policy network, which we will discuss in our next article.

基本上，我們將游戲框架(新狀態)輸入到RL算法中，然后讓算法決定向上或向下移動的位置。據說該網絡是一個策略網絡，我們將在下一篇文章中進行討論。

The method used to train this Algorithm is called the policy gradient. We feed random frames from the game engine, and the algorithm produces a random output which gives a reward and this is fed back to the algorithm/network. This is an iterative process.

用于訓練該算法的方法稱為策略梯度 。我們從游戲引擎中獲取隨機幀，并且該算法產生一個隨機輸出，該輸出給出獎勵并將其反饋給算法/網絡。這是一個反復的過程。

We will discuss policy gradients in the next Article with greater details.

我們將在下一篇文章中詳細討論政策梯度 。

In the context of the game, the score board acts as a reward or feed back to the agent. Whenever the agent tends to score +1, it understands that the action taken by it was good enough at that state.

在游戲的背景下，計分板可以作為獎勵或反饋給代理商。每當代理傾向于得分+1時，它就會了解到代理在該狀態下采取的措施足夠好。

Now we will train the agent to play the pong game. To start, we will feed in a bunch of game frame (states) to the network/algorithm and let the algorithm decide the action.The Initial actions of the agent will obviously be bad, but our agent can sometimes be lucky enough to score a point and this might be a random event. But due to this lucky random event, it receives a reward and this helps the agent to understand that the series of actions were good enough to fetch a reward.

現在，我們將訓練代理商打乒乓球。首先，我們將一堆游戲框架(狀態)饋送到網絡/算法中，然后由算法決定動作.Agent的初始動作顯然會很糟糕，但我們的Agent有時會很幸運地得分點，這可能是隨機事件。但是由于這個幸運的隨機事件，它會收到獎勵，這有助于代理了解一系列操作足以獲取獎勵。

So, in the future, the agent is likely to take the actions which will fetch a reward over an action which will not. Intuitively, the RL agent is leaning to play the game.

因此，在將來，代理可能會采取將獲得獎勵的行動，而不是將采取行動的獎勵。憑直覺，RL特工傾向于玩游戲。

局限性 (Limitations)

During the training of the agent, when an agent loses an episode, then the algorithm will discard or lower the likelyhood of taking all the series of actions which existed in this episode.

在對代理進行訓練期間，當代理丟失某個情節時，該算法將放棄或降低采取該情節中存在的所有一系列動作的可能性。

But if the agent was performing well from the start of the episode, but just due to the last 2 actions the agent lost the game, it does not make sense to discard all the actions. Rather it makes sense if we just remove the last 2 actions which resulted in the loss.

但是，如果特工從情節開始就表現良好，而僅僅由于最后2個動作，特工便輸了比賽，則放棄所有動作是沒有道理的。相反，如果我們僅刪除導致損失的最后兩個動作，這是有道理的。

This is called the Credit Assignment Problem. This problem arises because of a sparse reward setting. That is, instead of getting a reward at every step, we get the reward at the end of the episode. So, it’s on the agent to learn which actions were correct and which actual action led to losing the game.

這稱為學分分配問題。 由于稀疏的獎勵設置而出現此問題。也就是說，我們沒有在每一步都獲得獎勵，而是在情節結束時獲得了獎勵。因此，由代理來了解哪些動作是正確的，以及哪些實際動作導致游戲輸了。

So, due to this sparse reward setting in RL, the algorithm is very sample-inefficient. This means that huge training examples have to be fed in, in order to train the agent. But the fact is that sparse reward settings fail in many circumstance due to the complexity of the environment.

因此，由于RL中這種稀疏的獎勵設置，該算法的樣本效率非常低。這意味著必須提供大量的培訓示例，以培訓代理。但是事實是，由于環境的復雜性，稀疏的獎勵設置在許多情況下都失敗了。

So, there is something called rewards shaping which is used to solve this. But again, rewards shaping also suffers from some limitation as we need to design a custom reward function for every game.

因此，有一種叫做獎勵整形的東西可以用來解決這個問題。但同樣，由于我們需要為每個游戲設計自定義獎勵功能，因此獎勵塑造也受到一些限制。

結語 (Closing Note)

Today, reinforcement learning is an exciting field of study. Major developments has been made in the field, of which deep reinforcement learning is one.

如今，強化學習已成為令人興奮的研究領域。該領域已取得重大進展，其中深度強化學習就是其中之一。

We will cover deep reinforcement learning in our upcoming articles. This article covers a lot of concepts. Please take your own time to understand the basic concepts of reinforcement learning.

我們將在以后的文章中介紹深度強化學習。本文涵蓋了許多概念。請花些時間來了解強化學習的基本概念。

But, I would like to mention that reinforcement is not a secret black box. Whatever advancements we are seeing today in the field of reinforcement learning are a result of bright minds working day and night on specific applications.

但是，我想提一下，加固不是一個秘密的黑匣子。我們今天在強化學習領域看到的任何進步都是由于在特定應用程序上日夜工作的頭腦聰明的結果。

Next time we’ll work on a Q-learning agent and also cover some more basic stuff in reinforcement learning.

下次，我們將研究Q學習代理，并且還將介紹強化學習中的一些更基本的知識。

Until, then enjoy AI ?…

直到享受AI為止……

Important : This article is 1st part of Deep Reinforcement Learning series, The Complete series shall be available both on Text Readable forms on Medium and in Video explanatory Form on my channel on YouTube.
重要提示 ：本文是“深度強化學習”系列的第一部分，“完整”系列既可以在“ 媒體”上的“文本可讀”表格中使用，也可以在我的YouTube頻道中以“視頻說明”形式使用。

For deep and more Intuitive understanding of reinforcement learning, I would recommend that you watch the below video:

為了對強化學習有更深入，更直觀的理解，我建議您觀看以下視頻：

Subscribe to my YouTube channel For more AI videos : ADL .

訂閱我的YouTube頻道以獲取更多AI視頻： ADL 。

If you liked my article, please click the ? as I remain motivated to write stuffs and Please follow me on Medium &

如果您喜歡我的文章，請單擊“ ？”。 A S我仍然動機寫的東西，并請跟隨我的中型和

If you have any questions, please let me know in a comment below or Twitter. Subscribe to my YouTube Channel For More Tech videos : ADL .

如果您有任何疑問，請在下面的評論或Twitter中讓我知道。訂閱我的YouTube頻道以獲取更多技術視頻： ADL 。