強化學習q學習求最值

by Thomas Simonini

通過托馬斯·西蒙尼(Thomas Simonini)

通過Q學習更深入地學習強化學習 (Diving deeper into Reinforcement Learning with Q-Learning)

This article is part of Deep Reinforcement Learning Course with Tensorflow ??. Check the syllabus here.
本文是使用Tensorflow？?的深度強化學習課程的一部分。檢查課程表。

Today we’ll learn about Q-Learning. Q-Learning is a value-based Reinforcement Learning algorithm.

今天，我們將學習Q學習。 Q學習是一種基于價值的強化學習算法。

This article is the second part of a free series of blog post about Deep Reinforcement Learning. For more information and more resources, check out the syllabus of the course. See the first article here.

本文是有關深度強化學習的一系列免費博客文章的第二部分。有關更多信息和更多資源，請查看課程提綱。在這里看到第一篇文章。

In this article you’ll learn:

在本文中，您將學習：

What Q-Learning is
什么是Q學習
How to implement it with Numpy
如何使用Numpy實施

大圖：騎士與公主 (The big picture: the Knight and the Princess)

Let’s say you’re a knight and you need to save the princess trapped in the castle shown on the map above.

假設您是一名騎士，您需要保存困在上面地圖所示城堡中的公主。

You can move one tile at a time. The enemy can’t, but land on the same tile as the enemy, and you will die. Your goal is to go the castle by the fastest route possible. This can be evaluated using a “points scoring” system.

您一次只能移動一個圖塊。敵人不能，但是和敵人降落在同一塊地上，你會死。您的目標是盡可能快地走城堡。可以使用“積分”系統進行評估。

You lose -1 at each step (losing points at each step helps our agent to be fast).
您每步損失-1(每步損失點幫助我們的代理更快)。
If you touch an enemy, you lose -100 points, and the episode ends.
如果碰到敵人，您將失去-100分，情節結束。
If you are in the castle you win, you get +100 points.
如果您在城堡中獲勝，您將獲得+100分。

The question is: how do you create an agent that will be able to do that?

問題是：如何創建能夠做到這一點的代理？

Here’s a first strategy. Let say our agent tries to go to each tile, and then colors each tile. Green for “safe,” and red if not.

這是第一個策略。假設我們的經紀人嘗試轉到每個圖塊，然后為每個圖塊著色。綠色表示“安全”，否則表示紅色。

Then, we can tell our agent to take only green tiles.

然后，我們可以告訴我們的代理商只拿綠磚。

But the problem is that it’s not really helpful. We don’t know the best tile to take when green tiles are adjacent each other. So our agent can fall into an infinite loop by trying to find the castle!

但是問題在于它并沒有真正的幫助。我們不知道綠色瓷磚彼此相鄰時采取的最佳瓷磚。因此我們的經紀人可以通過嘗試找到城堡陷入無限循環！

Q表介紹 (Introducing the Q-table)

Here’s a second strategy: create a table where we’ll calculate the maximum expected future reward, for each action at each state.

這是第二種策略：創建一個表，在該表中，我們將為每個州的每個動作計算最大的預期未來獎勵。

Thanks to that, we’ll know what’s the best action to take for each state.

因此，我們將知道對每個州采取的最佳措施是什么。

Each state (tile) allows four possible actions. These are moving left, right, up, or down.

每個狀態(平鋪)都允許四個可能的操作。它們在向左，向右，向上或向下移動。

In terms of computation, we can transform this grid into a table.

在計算方面，我們可以將此網格轉換為表格。

This is called a Q-table (“Q” for “quality” of the action). The columns will be the four actions (left, right, up, down). The rows will be the states. The value of each cell will be the maximum expected future reward for that given state and action.

這稱為Q表 (“ Q”代表動作的“質量”)。列將是四個動作(左，右，上，下)。這些行將是狀態。每個單元格的值將是給定狀態和動作的最大預期未來回報。

Each Q-table score will be the maximum expected future reward that I’ll get if I take that action at that state with the best policy given.

如果我在給出最佳政策的情況下在該州采取該行動，則每個Q表得分將是我將獲得的最大預期未來獎勵。

Why do we say “with the policy given?” It’s because we don’t implement a policy. Instead, we just improve our Q-table to always choose the best action.

我們為什么說“給出了政策？” 這是因為我們沒有執行政策。 相反，我們只是改進我們的Q表，以始終選擇最佳操作。

Think of this Q-table as a game “cheat sheet.” Thanks to that, we know for each state (each line in the Q-table) what’s the best action to take, by finding the highest score in that line.

將此Q表視為游戲“備忘單”。多虧了這一點，我們知道了每個狀態(Q表中的每一行)所采取的最佳措施，即找出該行中的最高得分。

Yeah! We solved the castle problem! But wait… How do we calculate the values for each element of the Q table?

是的我們解決了城堡問題！但是等等……我們如何計算Q表中每個元素的值？

To learn each value of this Q-table, we’ll use the Q learning algorithm.

要學習此Q表的每個值， 我們將使用Q學習算法。

Q學習算法：學習動作值函數 (Q-learning algorithm: learning the Action Value Function)

The Action Value Function (or “Q-function”) takes two inputs: “state” and “action.” It returns the expected future reward of that action at that state.

動作值函數(或“ Q函數”)接受兩個輸入：“狀態”和“動作”。它返回該狀態下該動作的預期將來獎勵。

We can see this Q function as a reader that scrolls through the Q-table to find the line associated with our state, and the column associated with our action. It returns the Q value from the matching cell. This is the “expected future reward.”

我們可以看到這個Q函數，它是一個讀取器，它滾動Q表以查找與我們的狀態關聯的行以及與我們的動作關聯的列。它從匹配的單元格返回Q值。這是“預期的未來獎勵”。

But before we explore the environment, the Q-table gives the same arbitrary fixed value (most of the time 0). As we explore the environment, the Q-table will give us a better and better approximation by iteratively updating Q(s,a) using the Bellman Equation (see below!).

但是在探索環境之前，Q表會給出相同的任意固定值(大多數情況下為0)。當我們探索環境時，通過使用Bellman方程迭代更新Q(s，a) ， Q表將為我們提供越來越好的近似值(請參見下文！)。

Q學習算法過程 (The Q-learning algorithm Process)

Step 1: Initialize Q-valuesWe build a Q-table, with m cols (m= number of actions), and n rows (n = number of states). We initialize the values at 0.

第1步：初始化Q值我們構建一個Q表，其中包含m個 cols(m =動作數)和n行(n =狀態數)。我們將值初始化為0。

Step 2: For life (or until learning is stopped)Steps 3 to 5 will be repeated until we reached a maximum number of episodes (specified by the user) or until we manually stop the training.

步驟2：終生(或直到學習停止) ，將重復步驟3至5，直到達到最大發作次數(由用戶指定)或直到我們手動停止訓練為止。

Step 3: Choose an actionChoose an action a in the current state s based on the current Q-value estimates.

步驟3：選擇一個動作根據當前的Q值估算值，選擇一個處于當前狀態s的動作。

But…what action can we take in the beginning, if every Q-value equals zero?

但是……如果每個Q值等于零，我們在一開始可以采取什么措施？

That’s where the exploration/exploitation trade-off that we spoke about in the last article will be important.

那就是我們在上一篇文章中談到的勘探/開采權衡的重要性所在。

The idea is that in the beginning, we’ll use the epsilon greedy strategy:

想法是，一開始，我們將使用epsilon貪婪策略：

We specify an exploration rate “epsilon,” which we set to 1 in the beginning. This is the rate of steps that we’ll do randomly. In the beginning, this rate must be at its highest value, because we don’t know anything about the values in Q-table. This means we need to do a lot of exploration, by randomly choosing our actions.
我們指定一個探索率“ε”，在開始時將其設置為1。這是我們將隨機執行的步驟的速度。首先，該速率必須為最高值，因為我們對Q表中的值一無所知。這意味著我們需要通過隨機選擇行動來進行大量探索。
We generate a random number. If this number > epsilon, then we will do “exploitation” (this means we use what we already know to select the best action at each step). Else, we’ll do exploration.
我們生成一個隨機數。如果該數字> epsil o n，那么我們將進行“剝削”(這意味著我們將使用我們已經知道的方法在每個步驟中選擇最佳操作)。否則，我們將進行探索。
The idea is that we must have a big epsilon at the beginning of the training of the Q-function. Then, reduce it progressively as the agent becomes more confident at estimating Q-values.
想法是，在訓練Q功能時，我們必須有一個很大的epsilon。然后，隨著代理對估計Q值變得更有信心時，逐漸減小它。

Steps 4–5: Evaluate!Take the action a and observe the outcome state s’ and reward r. Now update the function Q(s,a).

步驟4–5：評估！ 采取行動a并觀察結果狀態s'并獎勵r。現在更新函數Q(s，a)。

We take the action a that we chose in step 3, and then performing this action returns us a new state s’ and a reward r (as we saw in the Reinforcement Learning process in the first article).

我們采取在步驟3中選擇的動作a ，然后執行此動作將為我們返回新狀態s'和獎勵r (如我們在第一篇文章的強化學習過程中所看到的)。

Then, to update Q(s,a) we use the Bellman equation:

然后，要更新Q(s，a)，我們使用Bellman方程：

The idea here is to update our Q(state, action) like this:

這里的想法是像這樣更新我們的Q(state，action)：

New Q value =    Current Q value +    lr * [Reward + discount_rate * (highest Q value between possible actions from the new state s’ ) — Current Q value ]

Let’s take an example:

讓我們舉個例子：

One cheese = +1
一種奶酪= +1
Two cheese = +2
兩塊奶酪= +2
Big pile of cheese = +10 (end of the episode)
大堆奶酪= +10(情節結束)
If you eat rat poison =-10 (end of the episode)
如果您吃了鼠藥= -10(發作結束)

Step 1: We init our Q-table

步驟1：我們建立Q表

Step 2: Choose an action From the starting position, you can choose between going right or down. Because we have a big epsilon rate (since we don’t know anything about the environment yet), we choose randomly. For example… move right.

第2步：選擇一個動作從起始位置，您可以選擇向右還是向下。因為我們的epsilon率很高(因為我們對環境一無所知)，所以我們隨機選擇。例如……向右移動。

We found a piece of cheese (+1), and we can now update the Q-value of being at start and going right. We do this by using the Bellman equation.

我們找到了一塊奶酪(+1)，現在我們可以更新起點和終點的Q值。我們通過使用Bellman方程來做到這一點。

Steps 4–5: Update the Q-function

步驟4–5：更新Q功能

First, we calculate the change in Q value ΔQ(start, right)
首先，我們計算Q值的變化量ΔQ(開始，右)
Then we add the initial Q value to the ΔQ(start, right) multiplied by a learning rate.
然后，我們將初始Q值與ΔQ(start，right)相乘乘以學習率。

Think of the learning rate as a way of how quickly a network abandons the former value for the new. If the learning rate is 1, the new estimate will be the new Q-value.

可以將學習率視為網絡放棄新舊價值的一種方式。如果學習率是1，則新的估計值將是新的Q值。

Good! We’ve just updated our first Q value. Now we need to do that again and again until the learning is stopped.

好！我們剛剛更新了第一個Q值。現在，我們需要一次又一次地這樣做，直到學習停止。

實施Q學習算法 (Implement a Q-learning algorithm)

We made a video where we implement a Q-learning agent that learns to play Taxi-v2 with Numpy.

我們制作了一個視頻，其中我們實現了一個Q學習代理，該代理學習了如何與Numpy玩Taxi-v2。

Now that we know how it works, we’ll implement the Q-learning algorithm step by step. Each part of the code is explained directly in the Jupyter notebook below.

現在我們知道了它的工作原理，我們將逐步實現Q學習算法。下面的Jupyter筆記本中直接解釋了代碼的每個部分。

You can access it in the Deep Reinforcement Learning Course repo.

您可以在“ 深度強化學習課程”存儲庫中訪問它。

Or you can access it directly on Google Colaboratory:

或者，您可以直接在Google合作實驗室上訪問它：

Q* Learning with Frozen Lakecolab.research.google.com

Q *與凍湖 一起學習 colab.research.google.com

回顧... (A recap…)

Q-learning is a value-based Reinforcement Learning algorithm that is used to find the optimal action-selection policy using a q function.
Q學習是一種基于值的強化學習算法，用于使用aq函數查找最佳的動作選擇策略。
It evaluates which action to take based on an action-value function that determines the value of being in a certain state and taking a certain action at that state.
它基于操作值函數來評估要采取的操作，該函數確定處于某個狀態并在該狀態下執行某個操作的值。
Goal: maximize the value function Q (expected future reward given a state and action).
目標：最大化價值函數Q(給定狀態和動作的預期未來回報)。
Q table helps us to find the best action for each state.
Q表可幫助我們找到每個狀態的最佳操作。
To maximize the expected reward by selecting the best of all possible actions.
通過選擇所有可能的動作中的最佳動作來最大化預期的回報。
The Q come from quality of a certain action in a certain state.
Q來自品質在特定狀態下的特定動作。
Function Q(state, action) → returns expected future reward of that action at that state.
函數Q(狀態，動作)→返回該動作在該狀態的預期未來回報。
This function can be estimated using Q-learning, which iteratively updates Q(s,a) using the Bellman Equation
可以使用Q學習估計該函數，Q學習使用Bellman方程迭代地更新Q(s，a)。
Before we explore the environment: Q table gives the same arbitrary fixed value → but as we explore the environment → Q gives us a better and better approximation.
在探索環境之前：Q表給出相同的任意固定值→但在我們探索環境時→Q給出了越來越好的近似值。

That’s all! Don’t forget to implement each part of the code by yourself — it’s really important to try to modify the code I gave you.

就這樣！不要忘了自己實現代碼的每個部分-嘗試修改我給您的代碼非常重要。

Try to add epochs, change the learning rate, and use a harder environment (such as Frozen-lake with 8x8 tiles). Have fun!

嘗試添加紀元，更改學習率，并使用更艱苦的環境(例如帶有8x8磁貼的冰凍湖)。玩得開心！

Next time we’ll work on Deep Q-learning, one of the biggest breakthroughs in Deep Reinforcement Learning in 2015. And we’ll train an agent that that plays Doom and kills enemies!

下次，我們將進行深度Q學習，這是2015年深度強化學習中最大的突破之一。我們還將訓練一個扮演末日并殺死敵人的特工！

If you liked my article, please click the ? below as many time as you liked the article so other people will see this here on Medium. And don’t forget to follow me!

如果您喜歡我的文章， 請單擊“？”。 您可以根據自己喜歡該文章的次數在下面進行搜索，以便其他人可以在Medium上看到此內容。并且不要忘記跟隨我！

If you have any thoughts, comments, questions, feel free to comment below or send me an email: hello@simoninithomas.com, or tweet me @ThomasSimonini.

如果您有任何想法，意見，問題，請在下面發表評論，或給我發送電子郵件：hello@simoninithomas.com或向我發送@ThomasSimonini信息。

Keep learning, stay awesome!

繼續學習，保持卓越！