強化學習應用于組合優化問題

by Sterling Osborne, PhD Researcher

作者：斯特林·奧斯本(Sterling Osborne)，博士研究員

如何將強化學習應用于現實生活中的計劃問題 (How to apply Reinforcement Learning to real life planning problems)

Recently, I have published some examples where I have created Reinforcement Learning models for some real life problems. For example, using Reinforcement Learning for Meal Planning based on a Set Budget and Personal Preferences.

最近，我發布了一些示例，其中我針對一些現實生活中的問題創建了強化學習模型。例如，將強化學習用于基于計劃預算和個人偏好的膳食計劃。

Reinforcement Learning can be used in this way for a variety of planning problems including travel plans, budget planning and business strategy. The two advantages of using RL is that it takes into account the probability of outcomes and allows us to control parts of the environment. Therefore, I decided to write a simple example so others may consider how they could start using it to solve some of their day-to-day or work problems.

強化學習可以通過這種方式用于各種計劃問題，包括旅行計劃，預算計劃和業務策略。使用RL的兩個優點是它考慮了結果的可能性，并允許我們控制環境的某些部分。因此，我決定寫一個簡單的示例，以便其他人可以考慮如何開始使用它來解決他們的一些日常或工作問題。

什么是強化學習？ (What is Reinforcement Learning?)

Reinforcement Learning (RL) is the process of testing which actions are best for each state of an environment by essentially trial and error. The model introduces a random policy to start, and each time an action is taken an initial amount (known as a reward) is fed to the model. This continues until an end goal is reached, e.g. you win or lose the game, where that run (or episode) ends and the game resets.

強化學習(RL)是通過本質上的反復試驗來測試哪種操作最適合環境的每個狀態的過程。該模型引入了一個隨機策略來啟動，并且每次采取行動時，都會向該模型提供初始金額(稱為獎勵)。這一直持續到達到最終目標為止，例如，您贏了或輸了游戲，游戲結束(或情節)并重置了游戲。

As the model goes through more and more episodes, it begins to learn which actions are more likely to lead us to a positive outcome. Therefore it finds the best actions in any given state, known as the optimal policy.

隨著模型經歷越來越多的事件，它開始了解哪些行動更有可能導致我們取得積極的結果。因此，它會在任何給定狀態下找到最佳操作，稱為最佳策略。

Many of the RL applications online train models on a game or virtual environment where the model is able to interact with the environment repeatedly. For example, you let the model play a simulation of tic-tac-toe over and over so that it observes success and failure of trying different moves.

許多RL應用程序在線地在游戲或虛擬環境中訓練模型，其中模型能夠與環境反復交互。例如，您讓模型反復模擬井字游戲，以便觀察嘗試不同動作的成功和失敗。

In real life, it is likely we do not have access to train our model in this way. For example, a recommendation system in online shopping needs a person’s feedback to tell us whether it has succeeded or not, and this is limited in its availability based on how many users interact with the shopping site.

在現實生活中，我們很可能無法以這種方式訓練模型。例如，在線購物中的推薦系統需要一個人的反饋來告訴我們它是否成功，并且基于有多少用戶與購物網站進行交互，其可用性受到限制。

Instead, we may have sample data that shows shopping trends over a time period that we can use to create estimated probabilities. Using these, we can create what is known as a Partially Observed Markov Decision Process (POMDP) as a way to generalise the underlying probability distribution.

取而代之的是，我們可能有一些示例數據顯示了一段時間內的購物趨勢，可以用來創建估計的概率。使用這些，我們可以創建所謂的部分觀測的馬爾可夫決策過程(POMDP)，作為概括潛在概率分布的一種方法。

部分觀測的馬爾可夫決策過程(POMDP) (Partially Observed Markov Decision Processes (POMDPs))

Markov Decision Processes (MDPs) provide a framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. The key feature of MDPs is that they follow the Markov Property; all future states are independent of the past given the present. In other words, the probability of moving into the next state is only dependent on the current state.

馬爾可夫決策過程(MDP)提供了一個框架，用于在結果部分隨機且部分受決策者控制的情況下對決策建模。 MDP的關鍵特征是它們遵循Markov屬性。所有未來狀態都獨立于過去給出的當前狀態。換句話說，進入下一個狀態的概率僅取決于當前狀態。

POMDPs work similarly except it is a generalisation of the MDPs. In short, this means the model cannot simply interact with the environment but is instead given a set probability distribution based on what we have observed. More info can be found here. We could use value iteration methods on our POMDP, but instead I’ve decided to use Monte Carlo Learning in this example.

POMDP的工作方式類似，只是它是MDP的概括。簡而言之，這意味著該模型不能簡單地與環境交互，而是根據我們觀察到的結果給定一個固定的概率分布。更多信息可以在這里找到。我們可以在POMDP上使用值迭代方法，但是我決定在此示例中使用“蒙特卡洛學習”。

示例環境 (Example Environment)

Imagine you are back at school (or perhaps still are) and are in a classroom, the teacher has a strict policy on paper waste and requires that any pieces of scrap paper must be passed to him at the front of the classroom and he will place the waste into the bin (trash can).

想象一下，您回到學校(或可能仍然在教室里)，老師對廢紙有嚴格的政策，要求任何廢紙必須在教室前面傳遞給他，他將把廢物進入垃圾箱(垃圾桶)。

However, some students in the class care little for the teacher’s rules and would rather save themselves the trouble of passing the paper round the classroom. Instead, these troublesome individuals may choose to throw the scrap paper into the bin from a distance. Now this angers the teacher and those that do this are punished.

但是，班上的一些學生不太在乎老師的規矩，寧愿為自己省去把紙傳到教室的麻煩。相反，這些麻煩的人可能會選擇從遠處將廢紙扔進垃圾箱。現在，這激怒了老師，而那些這樣做的人將受到懲罰。

This introduces a very basic action-reward concept, and we have an example classroom environment as shown in the following diagram.

這引入了一個非常基本的行動獎勵概念，并且我們有一個示例教室環境，如下圖所示。

Our aim is to find the best instructions for each person so that the paper reaches the teacher and is placed into the bin and avoids being thrown in the bin.

我們的目標是為每個人找到最好的指導，以便將紙傳到老師手中并放入垃圾箱中，避免將其扔到垃圾箱中。

狀態與行動 (States and Actions)

In our environment, each person can be considered a state and they have a variety of actions they can take with the scrap paper. They may choose to pass it to an adjacent class mate, hold onto it or some may choose to throw it into the bin. We can therefore map our environment to a more standard grid layout as shown below.

在我們的環境中，每個人都可以被視為一種狀態，他們可以對廢紙采取多種行動。他們可以選擇將其傳遞給相鄰的同伴，抓住它，也可以選擇將其扔到垃圾箱中。因此，我們可以將環境映射到更標準的網格布局，如下所示。

This is purposefully designed so that each person, or state, has four actions: up, down, left or right and each will have a varied ‘real life’ outcome based on who took the action. An action that puts the person into a wall (including the black block in the middle) indicates that the person holds onto the paper. In some cases, this action is duplicated, but is not an issue in our example.

這樣做的目的是使每個人或每個州有四個動作：向上，向下，向左或向右，并且每個人都會根據采取該動作的人而有不同的“現實生活”結果。將人放到墻壁上的動作(包括中間的黑色方塊)表示該人握住紙張。在某些情況下，此操作是重復的，但在我們的示例中不是問題。

For example, person A’s actions result in:

例如，人A的動作導致：

Up = Throw into bin
向上=扔進垃圾箱
Down = Hold onto paper
向下=握住紙張
Left = Pass to person B
左=傳給B人
Right = Hold onto paper
右=放在紙上

概率環境 (Probabilistic Environment)

For now, the decision maker that partly controls the environment is us. We will tell each person which action they should take. This is known as the policy.

目前，部分控制環境的決策者是我們。我們將告訴每個人應該采取的行動。這就是所謂的政策。

The first challenge I face in my learning is understanding that the environment is likely probabilistic and what this means. A probabilistic environment is when we instruct a state to take an action under our policy, there is a probability associated as to whether this is successfully followed. In other words, if we tell person A to pass the paper to person B, they can decide not to follow the instructed action in our policy and instead throw the scrap paper into the bin.

我在學習中面臨的第一個挑戰是了解環境很可能是概率性的，這意味著什么。概率環境是當我們指示某個國家根據我們的政策采取行動時，是否成功遵循該概率存在相關性。換句話說，如果我們告訴A人將紙張傳遞給B人，他們可以決定不遵循我們政策中的指示操作，而是將廢紙扔進垃圾箱。

Another example is if we are recommending online shopping products there is no guarantee that the person will view each one.

另一個例子是，如果我們建議使用在線購物產品，則不能保證該人會查看每個產品。

觀察到的過渡概率 (Observed Transitional Probabilities)

To find the observed transitional probabilities, we need to collect some sample data about how the environment acts. Before we collect information, we first introduce an initial policy. To start the process, I have randomly chosen one that looks as though it would lead to a positive outcome.

為了找到觀察到的過渡概率，我們需要收集一些有關環境行為的樣本數據。在收集信息之前，我們首先介紹一個初始政策。為了開始這一過程，我隨機選擇了一種看起來會帶來積極成果的方法。

Now we observe the actions each person takes given this policy. In other words, say we sat at the back of the classroom and simply observed the class and observed the following results for person A:

現在，我們觀察每個人在此政策下采取的行動。換句話說，假設我們坐在教室后面，只是觀察班級，并觀察到A人的以下結果：

We see that a paper passed through this person 20 times; 6 times they kept hold of it, 8 times they passed it to person B and another 6 times they threw it in the trash. This means that under our initial policy, the probability of keeping hold or throwing it in the trash for this person is 6/20 = 0.3 and likewise 8/20 = 0.4 to pass to person B. We can observe the rest of the class to collect the following sample data:

我們看到一篇論文通過了這個人20次；他們保留了6次，將其傳遞給B人8次，還有6次將其扔進垃圾桶。這意味著在我們最初的政策下，保持住該人或將其扔到垃圾桶中的概率為6/20 = 0.3，同樣8/20 = 0.4傳遞給人B的概率。我們可以觀察到班上的其他人收集以下樣本數據：

Likewise, we then calculate the probabilities to be the following matrix and we could use this to simulate experience. The accuracy of this model will depend greatly on whether the probabilities are true representations of the whole environment. In other words, we need to make sure we have a sample that is large and rich enough in data.

同樣，然后我們將概率計算為以下矩陣，并可以使用它來模擬經驗。該模型的準確性將在很大程度上取決于概率是否是整個環境的真實表示。換句話說，我們需要確保我們有一個足夠大且數據足夠豐富的樣本。

多武裝土匪，情節，獎勵，退貨和折扣率 (Multi-Armed Bandits, Episodes, Rewards, Return and Discount Rate)

So we have our transition probabilities estimated from the sample data under a POMDP. The next step, before we introduce any models, is to introduce rewards. So far, we have only discussed the outcome of the final step; either the paper gets placed in the bin by the teacher and nets a positive reward or gets thrown by A or M and nets a negative rewards. This final reward that ends the episode is known as the Terminal Reward.

因此，我們根據POMDP下的樣本數據估算了轉換概率。在介紹任何模型之前，下一步是介紹獎勵。到目前為止，我們僅討論了最后一步的結果；要么是老師將紙放到垃圾箱中，然后獲得正數獎勵，要么被A或M擲出紙張，從而得到負數獎勵。結束劇集的最終獎勵被稱為終端獎勵 。

But, there is also third outcome that is less than ideal either; the paper continually gets passed around and never (or takes far longer than we would like) reaches the bin. Therefore, in summary we have three final outcomes

但是，還有第三種結果也不理想。紙張會不斷地繞過，而不會(或比我們想要的更長的時間)到達垃圾箱。因此，總而言之，我們有三個最終結果

Paper gets placed in bin by teacher and nets a positive terminal reward
紙張被老師放進垃圾桶，并獲得積極的終端獎勵
Paper gets thrown in bin by a student and nets a negative terminal reward
學生將紙張扔進垃圾桶，并最終獲得負獎勵
Paper gets continually passed around room or gets stuck on students for a longer period of time than we would like
紙不斷地流過房間或卡在學生身上的時間比我們想要的更長

To avoid the paper being thrown in the bin we provide this with a large, negative reward, say -1, and because the teacher is pleased with it being placed in the bin this nets a large positive reward, +1. To avoid the outcome where it continually gets passed around the room, we set the reward for all other actions to be a small, negative value, say -0.04.

為避免將紙張扔到垃圾箱中，我們為此提供了較大的負數獎勵(例如-1)，并且由于老師對將紙張放入垃圾箱感到高興，因此獲得了較大的正數獎勵+1。為了避免結果不斷在房間中傳遞，我們將所有其他操作的獎勵設置為較小的負值，例如-0.04。

If we set this as a positive or null number then the model may let the paper go round and round as it would be better to gain small positives than risk getting close to the negative outcome. This number is also very small as it will only collect a single terminal reward but it could take many steps to end the episode and we need to ensure that, if the paper is place in the bin, the positive outcome is not cancelled out.

如果我們將其設置為正數或空數，則該模型可能會讓論文四處走動，因為獲得小的正數比冒險接近負數的結果更好。這個數字也非常小，因為它只會收集單個終端獎勵，但是可能需要采取許多步驟才能結束劇集，我們需要確保，如果將論文放在垃圾箱中，則不會取消正面結果。

Please note: the rewards are always relative to one another and I have chosen arbitrary figures, but these can be changed if the results are not as desired.

請注意：獎勵總是彼此相關的，我選擇了任意數字，但是如果結果不理想，可以更改這些數字。

Although we have inadvertently discussed episodes in the example, we have yet to formally define it. An episode is simply the actions each paper takes through the classroom reaching the bin, which is the terminal state and ends the episode. In other examples, such as playing tic-tac-toe, this would be the end of a game where you win or lose.

盡管我們在示例中無意中討論了情節，但我們尚未正式定義它。 情節只是簡單地講，每篇論文在教室中到達教室時即到達最終狀態并結束情節的垃圾箱 。在其他示例中，例如打井字游戲，這將是您輸贏的游戲結局。

The paper could in theory start at any state and this introduces why we need enough episodes to ensure that every state and action is tested enough so that our outcome is not being driven by invalid results. However, on the flip side, the more episodes we introduce the longer the computation time will be and, depending on the scale of the environment, we may not have an unlimited amount of resources to do this.

從理論上講，本文可以從任何狀態開始，這介紹了為什么我們需要足夠的情節來確保對每個狀態和動作進行足夠的測試，以確保我們的結果不會受到無效結果的驅動。但是，另一方面，我們引入的情節越多，計算時間就越長，并且根據環境的規模，我們可能沒有無限數量的資源來執行此操作。

This is known as the Multi-Armed Bandit problem; with finite time (or other resources), we need to ensure that we test each state-action pair enough that the actions selected in our policy are, in fact, the optimal ones. In other words, we need to validate that actions that have lead us to good outcomes in the past are not by sheer luck but are in fact in the correct choice, and likewise for the actions that appear poor. In our example this may seem simple with how few states we have, but imagine if we increased the scale and how this becomes more and more of an issue.

這就是所謂的多武裝強盜問題 。在有限的時間(或其他資源)下，我們需要確保對每個狀態操作對進行足夠的測試，以使策略中選擇的操作實際上是最佳操作。換句話說，我們需要驗證過去導致我們取得良好成果的行動并非偶然，而是實際上是正確的選擇，同樣對于那些表現不佳的行動也是如此。在我們的示例中，這似乎很簡單，因為我們只有幾個州，但是想像一下我們是否擴大規模，以及這如何成為越來越多的問題。

The overall goal of our RL model is to select the actions that maximises the expected cumulative rewards, known as the return. In other words, the Return is simply the total reward obtained for the episode. A simple way to calculate this would be to add up all the rewards, including the terminal reward, in each episode.

我們的RL模型的總體目標是選擇能夠最大化預期累積獎勵(稱為收益)的行動。換句話說，回報只是該集獲得的總獎勵。一種簡單的計算方法是將每個情節中的所有獎勵加起來，包括最終獎勵。

A more rigorous approach is to consider the first steps to be more important than later ones in the episode by applying a discount factor, gamma, in the following formula:

一種更嚴格的方法是，通過在以下公式中應用折扣因子gamma，認為第一步比后續步驟更重要。

In other words, we sum all the rewards but weigh down later steps by a factor of gamma to the power of how many steps it took to reach them.

換句話說，我們將所有獎勵相加，但是將后續步驟權重乘以要達到這些步驟所需要執行的步驟的能力，即伽馬系數。

If we think about our example, using a discounted return becomes even clearer to imagine as the teacher will reward (or punish accordingly) anyone who was involved in the episode but would scale this based on how far they are from the final outcome.

如果我們考慮我們的示例，那么使用折現收益將變得更容易想象，因為老師會獎勵(或相應地懲罰)參與該情節的任何人，但會根據他們與最終結果的距離來進行縮放。

For example, if the paper passed from A to B to M who threw it in the bin, M should be punished most, then B for passing it to him and lastly person A who is still involved in the final outcome but less so than M or B. This also emphasises that the longer it takes (based on the number of steps) to start in a state and reach the bin the less is will either be rewarded or punished but will accumulate negative rewards for taking more steps.

例如，如果論文從A傳遞到B并傳遞給M，然后將其扔進垃圾箱，則M應該受到最嚴厲的懲罰，然后B則將其傳遞給他，最后是仍參與最終結果但小于M的人A或B。這也強調，開始進入狀態并到達垃圾箱所花費的時間越長(基于步驟數)，得到獎勵或懲罰的次數越少，但由于采取更多的步驟而積累的負面獎勵就越大。

將模型應用于我們的示例 (Applying a Model to our Example)

As our example environment is small, we can apply each and show some of the calculations performed manually and illustrate the impact of changing parameters.

由于示例環境很小，因此我們可以應用每種方法，并展示一些手動執行的計算，并說明更改參數的影響。

For any algorithm, we first need to initialise the state value function, V(s), and have decided to set each of these to 0 as shown below.

對于任何算法，我們首先需要初始化狀態值函數V(s)，并已決定將每個值設置為0，如下所示。

Next, we let the model simulate experience on the environment based on our observed probability distribution. The model starts a piece of paper in random states and the outcomes of each action under our policy are based on our observed probabilities. So for example, say we have the first three simulated episodes to be the following:

接下來，我們讓模型根據觀察到的概率分布模擬環境經驗。該模型以隨機狀態開始工作，而在我們的政策下，每個操作的結果都基于我們觀察到的概率。舉例來說，假設我們的前三個模擬情節如下：

With these episodes we can calculate our first few updates to our state value function using each of the three models given. For now, we pick arbitrary alpha and gamma values to be 0.5 to make our hand calculations simpler. We will show later the impact this variable has on results.

通過這些事件，我們可以使用給定的三個模型分別計算對狀態值函數的前幾個更新。現在，我們將任意的alpha和gamma值選擇為0.5，以使我們的手算更加簡單。稍后我們將顯示此變量對結果的影響。

First, we apply temporal difference 0, the simplest of our models and the first three value updates are as follows:

首先，我們應用時間差0，這是我們模型中最簡單的，并且前三個值更新如下：

So how have these been calculated? Well because our example is small we can show the calculations by hand.

那么如何計算這些呢？好吧，因為我們的示例很小，所以我們可以手動顯示計算結果。

So what can we observe at this early stage? Firstly, using TD(0) appears unfair to some states, for example person D, who, at this stage, has gained nothing from the paper reaching the bin two out of three times. Their update has only been affected by the value of the next stage, but this emphasises how the positive and negative rewards propagate outwards from the corner towards the states.

那么在這個早期階段我們可以觀察到什么呢？首先，對于某些州，使用TD(0)似乎是不公平的，例如，人D，在此階段，三分之二的紙都沒有進入紙箱。他們的更新僅受下一階段的價值的影響，但這強調了正面和負面獎勵如何從角落向各州向外傳播。

As we take more episodes the positive and negative terminal rewards will spread out further and further across all states. This is shown roughly in the diagram below where we can see that the two episodes the resulted in a positive result impact the value of states Teacher and G whereas the single negative episode has punished person M.

隨著我們拍攝更多劇集，正面和負面的最終獎勵將在所有州之間越來越廣泛地分布。這大致顯示在下圖中，我們可以看到，這兩個事件所產生的積極結果影響了Teacher和G州的價值，而單個消極事件已經懲罰了人M。

To show this, we can try more episodes. If we repeat the same three paths already given we produce the following state value function:

為了顯示這一點，我們可以嘗試更多的情節。如果我們重復已經給出的相同的三個路徑，我們將產生以下狀態值函數：

(Please note, we have repeated these three episodes for simplicity in this example but the actual model would have episodes where the outcomes are based on the observed transition probability function.)

(請注意，在本示例中，為簡單起見，我們重復了這三個情節，但實際模型中的情節將基于觀察到的過渡概率函數。)

The diagram above shows the terminal rewards propagating outwards from the top right corner to the states. From this, we may decide to update our policy as it is clear that the negative terminal reward passes through person M and therefore B and C are impacted negatively. Therefore, based on V27, for each state we may decide to update our policy by selecting the next best state value for each state as shown in the figure below

上圖顯示了終端獎勵從右上角向外傳播到各州。由此，我們可以決定更新我們的政策，因為很明顯負的終端獎勵是通過人M傳遞的，因此B和C受到了負面影響。因此，基于V27，對于每個狀態，我們可以決定通過為每個狀態選擇下一個最佳狀態值來更新策略，如下圖所示

There are two causes for concerns in this example: the first is that person A’s best action is to throw it into the bin and net a negative reward. This is because none of the episodes have visited this person and emphasises the multi armed bandit problem. In this small example there are very few states so would require many episodes to visit them all, but we need to ensure this is done.

在此示例中，有兩個令人擔憂的原因：第一個是人A的最佳動作是將其扔到垃圾箱中并獲得負面獎勵。這是因為這些事件都沒有拜訪過此人并強調了多武裝匪徒問題。在這個小例子中，只有很少的州，因此需要很多插曲來訪問它們，但是我們需要確保做到這一點。

The reason this action is better for this person is because neither of the terminal states have a value but rather the positive and negative outcomes are in the terminal rewards. We could then, if our situation required it, initialise V0 with figures for the terminal states based on the outcomes.

這個動作對這個人更好的原因是，這兩個終極狀態都不具有價值，而正負結果都在終極獎勵中。然后，如果我們的情況需要，我們可以根據結果用終端狀態的數字初始化V0。

Secondly, the state value of person M is flipping back and forth between -0.03 and -0.51 (approx.) after the episodes and we need to address why this is happening. This is caused by our learning rate, alpha. For now, we have only introduced our parameters (the learning rate alpha and discount rate gamma) but have not explained in detail how they will impact results.

其次，在事件發生之后，人M的狀態值在-0.03和-0.51(大約)之間來回翻轉，我們需要解決這種情況的原因。這是由我們的學習率alpha引起的。目前，我們僅介紹了參數(學習率alpha和折扣率gamma)，但未詳細說明它們將如何影響結果。

A large learning rate may cause the results to oscillate, but conversely it should not be so small that it takes forever to converge. This is shown further in the figure below that demonstrates the total V(s) for every episode and we can clearly see how, although there is a general increasing trend, it is diverging back and forth between episodes. Another good explanation for learning rate is as follows:

較高的學習率可能會導致結果振蕩，但反之，則不應太小而導致永遠收斂。下圖進一步顯示了這一點，該圖演示了每個情節的總V(s)，我們可以清楚地看到，盡管總體趨勢呈上升趨勢，但在情節之間來回變化。學習率的另一個很好的解釋如下：

“In the game of golf when the ball is far away from the hole, the player hits it very hard to get as close as possible to the hole. Later when he reaches the flagged area, he chooses a different stick to get accurate short shot.

“在高爾夫比賽中，當球遠離球洞時，球員很難擊中球，使其盡可能靠近球洞。之后，當他到達標記區域時，他選擇了另一支球桿來獲得準確的短射。

So it’s not that he won’t be able to put the ball in the hole without choosing the short shot stick, he may send the ball ahead of the target two or three times. But it would be best if he plays optimally and uses the right amount of power to reach the hole.”

因此，這并不是說他不選擇短射棒就無法將球放入洞中，而是可以將球傳給目標兩次或三次。但是，如果他發揮最佳狀態并使用適量的力量到達洞口，那將是最好的選擇。”

Learning rate of a Q learning agentThe question how the learning rate influences the convergence rate and convergence itself. If the learning rate is…stackoverflow.com

Q學習代理 的學習率學習率如何影響收斂率和收斂本身的問題。 如果學習率是… stackoverflow.com

There are some complex methods for establishing the optimal learning rate for a problem but, as with any machine learning algorithm, if the environment is simple enough you iterate over different values until convergence is reached. This is also known as stochastic gradient decent. In a recent RL project, I demonstrated the impact of reducing alpha using an animated visual and this is shown below. This demonstrates the oscillation when alpha is large and how this becomes smoothed as alpha is reduced.

有一些復雜的方法可以確定問題的最佳學習率，但是，與任何機器學習算法一樣，如果環境足夠簡單，則可以迭代不同的值，直到達到收斂為止。這也被稱為隨機梯度樣。在最近的RL項目中，我演示了使用動畫效果降低alpha的影響，如下所示。這說明了當alpha很大時的振蕩，以及當alpha減小時如何平滑。

Likewise, we must also have our discount rate to be a number between 0 and 1, oftentimes this is taken to be close to 0.9. The discount factor tells us how important rewards in the future are; a large number indicates that they will be considered important whereas moving this towards 0 will make the model consider future steps less and less.

同樣，我們還必須將折現率設置為0到1之間的數字，通常將其折算為接近0.9。折扣因素告訴我們未來的獎勵有多重要；較大的數字表示它們將被視為重要，而將其移至0將使模型越來越少地考慮將來的步驟。

With both of these in mind, we can change both alpha from 0.5 to 0.2 and gamma from 0.5 to 0.9 and we achieve the following results:

考慮到這兩種情況，我們可以將alpha值從0.5更改為0.2，將gamma值從0.5更改為0.9，我們將獲得以下結果：

Because our learning rate is now much smaller the model takes longer to learn and the values are generally smaller. Most noticeably is for the teacher which is clearly the best state. However, this trade-off for increased computation time means our value for M is no longer oscillating to the degree they were before. We can now see this in the diagram below for the sum of V(s) following our updated parameters. Although it is not perfectly smooth, the total V(s) slowly increases at a much smoother rate than before and appears to converge as we would like but requires approximately 75 episodes to do so.

由于我們的學習率現在小得多，因此該模型需要更長的時間來學習，并且值通常較小。最明顯的是對于老師來說，這顯然是最好的狀態。但是，這種折衷是為了增加計算時間，這意味著我們的M值不再像以前那樣震蕩。現在，我們可以在下圖中看到更新后的參數后的V(s)之和。盡管不是很平滑，但總V(s)的增長速度卻比以前平滑得多，并且似乎可以按照我們的意愿收斂，但大約需要75集。

改變目標結果 (Changing the Goal Outcome)

Another crucial advantage of RL that we haven’t mentioned in too much detail is that we have some control over the environment. Currently, the rewards are based on what we decided would be best to get the model to reach the positive outcome in as few steps as possible.

我們沒有過多詳細提到的RL的另一個關鍵優勢是，我們可以控制環境。目前，獎勵基于我們認為最好的方法，即使模型以盡可能少的步驟達到正面結果。

However, say the teacher changed and the new one didn’t mind the students throwing the paper in the bin so long as it reached it. Then we can change our negative reward around this and the optimal policy will change.

但是，說老師換了，新老師不介意學生只要把紙扔進垃圾箱就把它扔進垃圾箱。然后，我們可以圍繞此改變我們的負面獎勵，最優政策也會改變。

This is particularly useful for business solutions. For example, say you are planning a strategy and know that certain transitions are less desired than others, then this can be taken into account and changed at will.

這對于業務解決方案特別有用。例如，假設您正在計劃一項策略，并且知道某些過渡要比其他過渡少，那么可以考慮并隨意更改。

結論 (Conclusion)

We have now created a simple Reinforcement Learning model from observed data. There are many things that could be improved or taken further, including using a more complex model, but this should be a good introduction for those that wish to try and apply to their own real-life problems.

現在，我們根據觀察到的數據創建了一個簡單的強化學習模型。有許多事情可以改進或進一步，包括使用更復雜的模型，但這對于那些希望嘗試并將其應用于實際問題的人來說應該是一個很好的介紹。

I hope you enjoyed reading this article, if you have any questions please feel free to comment below.

希望您喜歡閱讀本文，如果有任何疑問，請在下面發表評論。

Thanks

謝謝

Sterling

英鎊