ai人工智能程序_簡單解釋:一個AI程序如何掌握Go的古老游戲

ai人工智能程序

by Aman Agarwal

通過阿曼·阿加瓦爾(Aman Agarwal)

簡單解釋:一個AI程序如何掌握Go的古老游戲 (Explained Simply: How an AI program mastered the ancient game of Go)

This is about AlphaGo, Google DeepMind’s Go playing AI that shook the technology world in 2016 by defeating one of the best players in the world, Lee Sedol.

這是關于AlphaGo的 ,這是Google DeepMind的Go播放AI,它在2016年擊敗了世界上最好的播放器之一Lee Sedol ,震驚了整個技術世界。

Go is an ancient board game which has so many possible moves at each step that future positions are hard to predict — and therefore it requires strong intuition and abstract thinking to play. Because of this reason, it was believed that only humans could be good at playing Go. Most researchers thought that it would still take decades to build an AI which could think like that. In fact, I’m releasing this essay today because this week (March 8–15) marks the two-year anniversary of the AlphaGo vs Sedol match!

圍棋是一種古老的棋盤游戲,每一步都有很多可能的動作,因此未來的位置很難預測-因此,玩游戲需要強烈的直覺和抽象思維。 由于這個原因,人們認為只有人類才能玩Go。 大多數研究人員認為,構建像這樣的AI仍需要數十年。 實際上,我今天發布這篇論文是因為本周(3月8日至15日)是AlphaGo與Sedol比賽兩周年!

But AlphaGo didn’t stop there. 8 months later, it played 60 professional games on a Go website under disguise as a player named “Master”, and won every single game, against dozens of world champions, of course without resting between games.

但是AlphaGo并沒有就此停止。 8個月后,它假扮成“大師”,在Go網站上進行了60場專業比賽,并且在與數十位世界冠軍的對決中贏得了每一場比賽,當然,比賽之間沒有休息。

Naturally this was a HUGE achievement in the field of AI and sparked worldwide discussions about whether we should be excited or worried about artificial intelligence.

自然地,這是人工智能領域的一項巨大成就,并引發了關于我們應該對人工智能感到興奮還是擔憂的全球討論。

Today we are going to take the original research paper published by DeepMind in the Nature journal, and break it down paragraph-by-paragraph using simple English.

今天,我們將采用DeepMind在《 自然 》雜志上發表的原始研究論文,然后使用簡單的英語逐段細分。

After this essay, you’ll know very clearly what AlphaGo is, and how it works. I also hope that after reading this you will not believe all the news headlines made by journalists to scare you about AI, and instead feel excited about it.

在這篇文章之后,您將非常清楚地知道AlphaGo是什么以及它是如何工作的。 我還希望閱讀本文后,您不要相信新聞工作者為恐嚇AI而撰寫的所有新聞頭條,而是會對此感到興奮。

Worrying about the growing achievements of AI is like worrying about the growing abilities of Microsoft Powerpoint. Yes, it will get better with time with new features being added to it, but it can’t just uncontrollably grow into some kind of Hollywood monster.

擔心AI的成就不斷增長,就像擔心Microsoft Powerpoint的能力不斷增長一樣。 是的,隨著時間的推移,它將增加新的功能,它將變得更好,但是它不能僅僅不受控制地成長為某種好萊塢怪物。

You DON’T need to know how to play Go to understand this paper. In fact, I myself have only read the first 3–4 lines in Wikipedia’s opening paragraph about it. Instead, surprisingly, I use some examples from basic Chess to explain the algorithms. You just have to know what a 2-player board game is, in which each player takes turns and there is one winner at the end. Beyond that you don’t need to know any physics or advanced math or anything.

您不需要知道如何玩Go就可以了解本文。 實際上,我本人只閱讀了Wikipedia開頭段落中的前3-4行。 相反,出乎意料的是,我使用國際象棋中的一些示例來解釋算法。 您只需要知道2人桌游是什么,其中每個玩家輪流玩,最后有一個贏家。 除此之外,您不需要了解任何物理知識或高級數學。

This will make it more approachable for people who only just now started learning about machine learning or neural networks. And especially for those who don’t use English as their first language (which can make it very difficult to read such papers).

對于剛開始學習機器學習或神經網絡的人們來說,這將使其更易上手。 特別是對于那些不使用英語作為第一語言的人(這會使閱讀此類論文非常困難)。

If you have NO prior knowledge of AI and neural networks, you can read the “Deep Learning” section of one of my previous essays here. After reading that, you’ll be able to get through this essay.

如果您沒有AI和神經網絡的先驗知識,則可以在此處閱讀我的上一篇文章的“深度學習”部分。 閱讀這些內容后,您將可以閱讀這篇文章。

If you want to get a shallow understanding of Reinforcement Learning too (optional reading), you can find it here.

如果您也想對強化學習有所了解(可選閱讀),請在此處找到。

Here’s the original paper if you want to try reading it:

如果您想閱讀的話,這是原始論文:

As for me: Hi I’m Aman, an AI and autonomous robots engineer. I hope that my work will save you a lot of time and effort if you were to study this on your own.

至于我:嗨,我是AI和自動機器人工程師Aman 。 我希望我自己的工作能為您節省很多時間和精力。

Do you speak Japanese? Ryohji Ikebe has kindly written a brief memo about this essay in Japanese, in a series of Tweets.

你會說日語嗎? 池邊良治(Ryohji Ikebe)在一系列推文 中用日語寫了一篇簡短的備忘錄。

讓我們開始吧! (Let’s get started!)

抽象 (Abstract)

As you know, the goal of this research was to train an AI program to play Go at the level of world-class professional human players.

如您所知,這項研究的目標是訓練一個AI程序,使其能夠在世界級的專業人類玩家水平上玩Go。

To understand this challenge, let me first talk about something similar done for Chess. In the early 1990s, IBM came out with the Deep Blue computer which defeated the great champion Garry Kasparov in Chess. (He’s also a very cool guy, make sure to read more about him later!) How did Deep Blue play?

為了理解這一挑戰,讓我首先談一談為國際象棋所做的類似工作。 在1990年代初期,IBM推出了Deep Blue計算機,該計算機在國際象棋中擊敗了偉大的冠軍Garry Kasparov 。 (他也是一個很酷的人,請確保稍后再閱讀有關他的更多信息!)Deep Blue的表現如何?

Well, it used a very brute force method. At each step of the game, it took a look at all the possible legal moves that could be played, and went ahead to explore each and every move to see what would happen. And it would keep exploring move after move for a while, forming a kind of HUGE decision tree of thousands of moves. And then it would come back along that tree, observing which moves seemed most likely to bring a good result. But, what do we mean by “good result”? Well, Deep Blue had many carefully designed chess strategies built into it by expert chess players to help it make better decisions — for example, how to decide whether to protect the king or get advantage somewhere else? They made a specific “evaluation algorithm” for this purpose, to compare how advantageous or disadvantageous different board positions are (IBM hard-coded expert chess strategies into this evaluation function). And finally it chooses a carefully calculated move. On the next turn, it basically goes through the whole thing again.

好吧,它使用了非常強力的方法。 在游戲的每個步驟中,它研究了可以進行的所有可能的法律動作,然后繼續探索每一個動作,以查看會發生什么。 而且它會不斷探索移動動作一段時間,從而形成一種由數千個動作組成的巨大決策樹。 然后它沿著那棵樹回來,觀察哪一步似乎最有可能帶來良好的結果。 但是,“好結果”是什么意思? 好吧,深藍國際象棋棋手內置了許多精心設計的棋子策略,以幫助其做出更好的決策-例如,如何決定是保護國王還是在其他地方獲得優勢? 為此,他們制定了一個特定的“評估算法”,以比較不同董事會職位的有利或不利程度(將IBM硬編碼專家象棋策略納入此評估功能)。 最后,它選擇了經過仔細計算的舉動。 在下一輪中,它基本上會再次遍歷整個過程。

As you can see, this means Deep Blue thought about millions of theoretical positions before playing each move. This was not so impressive in terms of the AI software of Deep Blue, but rather in the hardware — IBM claimed it to be one of the most powerful computers available in the market at that time. It could look at 200 million board positions per second.

正如您所看到的,這意味著Deep Blue在進行每一步之前都要考慮數百萬個理論位置。 對于Deep Blue的AI軟件而言,這并不是那么令人印象深刻,而是在硬件方面-IBM聲稱它是當時市場上功能最強大的計算機之一。 它每秒可以查看2億個董事會職位。

Now we come to Go. Just believe me that this game is much more open-ended, and if you tried the Deep Blue strategy on Go, you wouldn’t be able to play well. There would be SO MANY positions to look at at each step that it would simply be impractical for a computer to go through that hell. For example, at the opening move in Chess there are 20 possible moves. In Go the first player has 361 possible moves, and this scope of choices stays wide throughout the game.

現在我們來吧。 只是相信我,這款游戲更加開放,如果您在Go上嘗試了Deep Blue策略,您將無法發揮出色。 在每個步驟中都有很多位置需要考慮,以至于計算機無法穿越地獄是不切實際的。 例如,國際象棋的開局動作有20種可能的動作。 在Go中,第一個玩家有361個可能的動作,并且此選擇范圍在整個游戲中仍然存在。

This is what they mean by “enormous search space.” Moreover, in Go, it’s not so easy to judge how advantageous or disadvantageous a particular board position is at any specific point in the game — you kinda have to play the whole game for a while before you can determine who is winning. But let’s say you magically had a way to do both of these. And that’s where deep learning comes in!

這就是“巨大的搜索空間”的意思。 此外,在Go中,要判斷一個特定的棋盤位置在游戲中的任何特定位置有利還是不利并不容易-在確定誰是贏家之前,您必須玩整個游戲一段時間。 但是,假設您魔術般地做到了這兩種方式。 這就是深度學習的用武之地!

So in this research, DeepMind used neural networks to do both of these tasks (if you have never read about neural networks yet, here’s the link again). They trained a “policy neural network” to decide which are the most sensible moves in a particular board position (so it’s like following an intuitive strategy to pick moves from any position). And they trained a “value neural network” to estimate how advantageous a particular board arrangement is for the player (or in other words, how likely you are to win the game from this position). They trained these neural networks first with human game examples (your good old ordinary supervised learning). After this the AI was able to mimic human playing to a certain degree, so it acted like a weak human player. And then to train the networks even further, they made the AI play against itself millions of times (this is the “reinforcement learning” part). With this, the AI got better because it had more practice.

因此,在這項研究中,DeepMind使用神經網絡來完成這兩項任務(如果您還沒有閱讀過有關神經網絡的信息,請再次訪問此鏈接 )。 他們訓練了一個“政策神經網絡”,以決定在特定董事會位置上最明智的舉動(因此,就像遵循一種直觀的策略從任何位置挑選舉動一樣)。 他們訓練了“價值神經網絡”,以評估特定棋盤布局對玩家的有利程度(換句話說,從該位置贏得游戲的可能性有多大)。 他們首先通過人類游戲示例(您的優秀老普通監督學習)訓練了這些神經網絡。 此后,AI可以在一定程度上模仿人類的游戲,因此它表現得像一個弱小的人類玩家。 然后,為了進一步培訓網絡,他們使AI與自身競爭了數百萬次(這是“強化學習”部分)。 有了這個,人工智能就變得更好了,因為它有更多的實踐。

With these two networks alone, DeepMind’s AI was able to play well against state-of-the-art Go playing programs that other researchers had built before. These other programs had used an already popular pre-existing game playing algorithm, called the “Monte Carlo Tree Search” (MCTS). More about this later.

僅憑這兩個網絡,DeepMind的AI就能與其他研究人員之前構建的最新Go播放程序很好地兼容。 這些其他程序使用了已經流行的預先存在的游戲算法,稱為“蒙特卡洛樹搜索”(MCTS)。 稍后再詳細介紹。

But guess what, we still haven’t talked about the real deal. DeepMind’s AI isn’t just about the policy and value networks. It doesn’t use these two networks as a replacement of the Monte Carlo Tree Search. Instead, it uses the neural networks to make the MCTS algorithm work better… and it got so much better that it reached superhuman levels. THIS improved variation of MCTS is “AlphaGo”, the AI that beat Lee Sedol and went down in AI history as one of the greatest breakthroughs ever. So essentially, AlphaGo is simply an improved implementation of a very ordinary computer science algorithm. Do you understand now why AI in its current form is absolutely nothing to be scared of?

但是你猜怎么著,我們還沒有談論真正的交易。 DeepMind的AI不僅涉及政策和價值網絡。 它不會使用這兩個網絡來代替 “蒙特卡洛樹搜索”。 取而代之的是,它使用神經網絡使MCTS算法更好地工作……它變得更好,甚至達到了超人的水平。 MCTS的改進版本是“ AlphaGo”,它擊敗了李·塞多爾(Lee Sedol),并跌入了人工智能歷史,成為有史以來最大的突破之一。 因此,從本質上講,AlphaGo只是對非常普通的計算機科學算法的改進實現 。 您現在是否了解為什么現在絕對懼怕當前形式的AI?

Wow, we’ve spent a lot of time on the Abstract alone.

哇,我們僅在摘要上就花了很多時間。

Alright — to understand the paper from this point on, first we’ll talk about a gaming strategy called the Monte Carlo Tree Search algorithm. For now, I’ll just explain this algorithm at enough depth to make sense of this essay. But if you want to learn about it in depth, some smart people have also made excellent videos and blog posts on this:

好的-從現在開始理解本文,首先我們將討論一種稱為蒙特卡洛樹搜索算法的游戲策略。 現在,我將僅以足夠的深度來解釋該算法以使本文有意義。 但是,如果您想深入了解它,一些聰明的人也會在此方面制作出色的視頻和博客文章:

1. A short video series from Udacity2. Jeff Bradberry’s explanation of MCTS3. An MCTS tutorial by Fullstack Academy

1. Udacity的短片系列 2. 杰夫·布拉德貝里(Jeff Bradberry)對MCTS的解釋 3. Fullstack Academy的MCTS教程

The following section is long, but easy to understand (I’ll try my best) and VERY important, so stay with me! The rest of the essay will go much quicker.

下一節很長,但很容易理解(我會盡力而為),并且非常重要,請與我在一起! 本文的其余部分將更快。

Let’s talk about the first paragraph of the essay above. Remember what I said about Deep Blue making a huge tree of millions of board positions and moves at each step of the game? You had to do simulations and look at and compare each and every possible move. As I said before, that was a simple approach and very straightforward approach — if the average software engineer had to design a game playing AI, and had all the strongest computers of the world, he or she would probably design a similar solution.

讓我們談談以上文章的第一段。 還記得我所說的《深藍》在游戲的每個步驟中都制作出數以百萬計的棋盤位置和動作的巨樹嗎? 您必須進行模擬,并查看和比較每個可能的動作。 如我之前所說,這是一種簡單且非常簡單的方法-如果一般的軟件工程師必須設計一款玩AI的游戲,并且擁有世界上所有最強大的計算機,那么他(或她)可能會設計一個類似的解決方案。

But let’s think about how do humans themselves play chess? Let’s say you’re at a particular board position in the middle of the game. By game rules, you can do a dozen different things — move this pawn here, move the queen two squares here or three squares there, and so on. But do you really make a list of all the possible moves you can make with all your pieces, and then select one move from this long list? No — you “intuitively” narrow down to a few key moves (let’s say you come up with 3 sensible moves) that you think make sense, and then you wonder what will happen in the game if you chose one of these 3 moves. You might spend 15–20 seconds considering each of these 3 moves and their future — and note that during these 15 seconds you don’t have to carefully plan out the future of each move; you can just “roll out” a few mental moves guided by your intuition without TOO much careful thought (well, a good player would think farther and more deeply than an average player). This is because you have limited time, and you can’t accurately predict what your opponent will do at each step in that lovely future you’re cooking up in your brain. So you’ll just have to let your gut feeling guide you. I’ll refer to this part of the thinking process as “rollout”, so take note of it!So after “rolling out” your few sensible moves, you finally say screw it and just play the move you find best.

但是,讓我們考慮一下人類自己如何下棋嗎? 假設您在游戲中途處于特定的棋盤位置。 根據游戲規則,您可以做很多不同的事情-將該棋子移到此處,將女王/王后移到這里兩個正方形或那里移到三個正方形,依此類推。 但是,您是否真的列出了所有棋子可能做出的所有舉動的清單,然后從這一長清單中選擇一個舉動? 不,您“直覺”地縮小了一些您認為有意義的關鍵動作(假設您提出了3個明智的動作),然后您想知道如果您從這3個動作中選擇一個,游戲中將會發生什么。 您可能需要花15到20秒來考慮這3個動作中的每個動作及其未來-并請注意,在這15秒鐘內,您不必仔細計劃每個動作的未來; 您可以根據自己的直覺“展開”一些思維上的動作,而無需太仔細地思考(嗯,一個好的玩家會比普通玩家思考得更遠,更深入)。 這是因為您的時間有限, 并且您無法準確預測對手在大腦中蒸蒸日上的美好未來的每一步。 因此,您只需要讓您的直覺引導您即可。 我將思維過程的這一部分稱為“推出”,因此請注意!因此,在“推出”您的一些明智的舉動之后,您最終會說出要擰的腳,然后盡自己所能找到最好的舉動。

Then the opponent makes a move. It might be a move you had already well anticipated, which means you are now pretty confident about what you need to do next. You don’t have to spend too much time on the rollouts again. OR, it could be that your opponent hits you with a pretty cool move that you had not expected, so you have to be even more careful with your next move.This is how the game carries on, and as it gets closer and closer to the finishing point, it would get easier for you to predict the outcome of your moves — so your rollouts don’t take as much time.

然后,對手采取行動。 這可能是您早已預料到的舉動,這意味著您現在對接下來需要做什么很有信心。 您不必花太多時間在發布上。 或者,可能是您的對手以您未曾預料到的非常酷的動作擊中了您,因此您在接下來的動作中要格外小心,這就是游戲進行的方式,并且隨著游戲的進行越來越近達到終點時,您將更容易預測移動的結果-因此,部署不需要花費太多時間。

The purpose of this long story is to describe what the MCTS algorithm does on a superficial level — it mimics the above thinking process by building a “search tree” of moves and positions every time. Again, for more details you should check out the links I mentioned earlier. The innovation here is that instead of going through all the possible moves at each position (which Deep Blue did), it instead intelligently selects a small set of sensible moves and explores those instead. To explore them, it “rolls out” the future of each of these moves and compares them based on their imagined outcomes.(Seriously — this is all I think you need to understand this essay)

這個漫長故事的目的是從表面上描述MCTS算法的作用-它通過每次構建移動和位置的“搜索樹”來模仿上述思考過程。 同樣,有關更多詳細信息,您應該查看我前面提到的鏈接。 這里的創新之處在于,它無需在每個位置上都進行所有可能的動作(深藍色所做的),而是智能地選擇了一組明智的動作并進行了探索。 為了探索它們,它“展示”了這些動作中的每一個的未來,并根據其想象的結果對其進行了比較。(嚴重的是,這就是我認為您需要理解的所有文章)

Now — coming back to the screenshot from the paper. Go is a “perfect information game” (please read the definition in the link, don’t worry it’s not scary). And theoretically, for such games, no matter which particular position you are at in the game (even if you have just played 1–2 moves), it is possible that you can correctly guess who will win or lose (assuming that both players play “perfectly” from that point on). I have no idea who came up with this theory, but it is a fundamental assumption in this research project and it works.

現在-返回本文的屏幕截圖。 Go是一個“ 完美的信息游戲 ”(請閱讀鏈接中的定義,不要擔心它并不可怕)。 從理論上講 ,對于此類游戲,無論您在游戲中處于哪個特定位置(即使您剛剛打過1-2步),都有可能正確猜出誰會贏或輸(假設兩個玩家都玩)從那時起“完美”)。 我不知道是誰提出了這個理論,但這是本研究項目中的一個基本假設,并且可行。

So that means, given a state of the game s, there is a function v*(s) which can predict the outcome, let’s say probability of you winning this game, from 0 to 1. They call it the “optimal value function”. Because some board positions are more likely to result in you winning than other board positions, they can be considered more “valuable” than the others. Let me say it again: Value = Probability between 0 and 1 of you winning the game.

因此,這意味著在給定游戲狀態s的情況下 ,有一個函數v *(s)可以預測結果,比方說您贏得游戲的概率從0到1。它們被稱為“最優值函數” 。 由于某些董事會職位比其他董事會職位更可能導致您獲勝,因此可以認為它們比其他職位更“有價值”。 讓我再說一遍:值=您贏得游戲的0到1之間的概率。

But wait — say there was a girl named Foma sitting next to you while you play Chess, and she keeps telling you at each step if you’re winning or losing. “You’re winning… You’re losing… Nope, still losing…” I think it wouldn’t help you much in choosing which move you need to make. She would also be quite annoying. What would instead help you is if you drew the whole tree of all the possible moves you can make, and the states that those moves would lead to — and then Foma would tell you for the entire tree which states are winning states and which states are losing states. Then you can choose moves which will keep leading you to winning states. All of a sudden Foma is your partner in crime, not an annoying friend. Here, Foma behaves as your optimal value function v*(s). Earlier, it was believed that it’s not possible to have an accurate value function like Foma for the game of Go, because the games had so much uncertainty.

但是,等等-假設在您玩國際象棋時有一個名叫Foma的女孩坐在您旁邊,并且她會在每一步都告訴您是輸還是輸。 “你贏了……你輸了……不,還是輸了……”我認為這對選擇你需要采取的行動沒有多大幫助。 她也很煩。 相反,對您有幫助的是,如果您繪制了所有可能動作的整棵樹,以及這些動作將導致的狀態,然后Foma會告訴您整棵樹中哪些州是獲勝州,哪些州是獲勝州。失去狀態。 然后,您可以選擇將繼續引領您進入獲勝狀態的舉動。 突然之間,福馬是您犯罪的伴侶,而不是一個令人討厭的朋友。 在此,Foma表現為最佳值函數v *(s)。 此前,人們認為在Go游戲中不可能擁有像Foma這樣的準確價值函數,因為這些游戲具有很大的不確定性。

BUT — even if you had the wonderful Foma, this wonderland strategy of drawing out all the possible positions for Foma to evaluate will not work very well in the real world. In a game like Chess or Go, as we said before, if you try to imagine even 7–8 moves into the future, there can be so many possible positions that you don’t have enough time to check all of them with Foma.

但是,即使您擁有出色的Foma,這種為Foma評估所有可能位置進行評估的奇幻世界策略在現實世界中也無法很好地發揮作用。 正如我們之前所說,在象棋或圍棋這樣的游戲中,如果您試圖想象甚至有7–8步入未來,那么可能會有太多可能的位置,因此您沒有足夠的時間用Foma來檢查所有位置。

So Foma is not enough. You need to narrow down the list of moves to a few sensible moves that you can roll out into the future. How will your program do that? Enter Lusha. Lusha is a skilled Chess player and enthusiast who has spent decades watching grand masters play Chess against each other. She can look at your board position, look quickly at all the available moves you can make, and tell you how likely it would be that a Chess expert would make any of those moves if they were sitting at your table. So if you have 50 possible moves at a point, Lusha will tell you the probability that each move would be picked by an expert. Of course, a few sensible moves will have a much higher probability and other pointless moves will have very little probability. For example: if in Chess, let’s say your Queen is in danger in one corner of the game, you might still have the option to move a little pawn in another corner of the game She is your policy function, p(a\s). For a given state s, she can give you probabilities for all the possible moves that an expert would make.

所以Foma還不夠。 您需要將舉動列表縮小到一些明智的舉動,以供將來使用。 您的程序將如何做到這一點? 輸入Lusha。 盧莎(Lusha)是一位熟練的國際象棋棋手和發燒友,他花了數十年的時間觀看大師們相互對弈。 她可以查看您的董事會位置,快速查看您可以進行的所有可用動作,并告訴您,如果國際象棋專家坐在您的桌子旁,那么他們做出任何這些動作的可能性有多大。 因此,如果您在某個時刻有50個可能的舉動,Lusha會告訴您專家會選擇每個舉動的可能性。 當然,一些明智的舉動將具有更高的概率,而其他毫無意義的舉動將具有很小的概率。 例如:如果在國際象棋中,假設您的女王在游戲的一個角落處于危險中,那么您可能仍然可以選擇在游戲的另一個角落移動一個小兵。她是您的策略函數 ,p(a \ s) 。 對于給定的狀態,她可以為您提供專家可能采取的所有可能動作的概率。

Wow — you can take Lusha’s help to guide you in how to select a few sensible moves, and Foma will tell you the likelihood of winning from each of those moves. You can choose the move that both Foma and Lusha approve. Or, if you want to be extra careful, you can roll out the moves selected by Lusha, have Foma evaluate them, pick a few of them to roll out further into the future, and keep letting Foma and Lusha help you predict VERY far into the game’s future — much quicker and more efficient than to go through all the moves at each step into the future. THIS is what they mean by “reducing the search space”. Use a value function (Foma) to predict outcomes, and use a policy function (Lusha) to give you grand-master probabilities to help narrow down the moves you roll out. These are called “Monte Carlo rollouts”. Then while you backtrack from future to present, you can take average values of all the different moves you rolled out, and pick the most suitable action. So far, this has only worked on a weak amateur level in Go, because the policy functions and value functions that they used to guide these rollouts weren’t that great.

哇-您可以借助Lusha的幫助來指導您如何選擇一些明智的舉動,Foma會告訴您從這些舉動中獲勝的可能性。 您可以選擇Foma和Lusha都批準的舉動。 或者,如果您要格外小心,則可以推出Lusha選擇的舉動,讓Foma對其進行評估,從中選擇一些舉動以進一步推廣到未來,并讓Foma和Lusha幫助您預測非常游戲的未來-比在進入未來的每一步中都走得更快,更高效。 這就是他們“縮小搜索空間”的意思。 使用價值函數(Foma)來預測結果,并使用策略函數(Lusha)為您提供大師級的概率,以幫助縮小所采取的措施。 這些稱為“蒙特卡羅推廣”。 然后,當您從將來追溯到現在時,您可以獲取所推出的所有不同動作的平均值,并選擇最合適的動作。 到目前為止,這僅在Go的業余愛好者水平上起作用,因為用于指導這些部署的策略功能和價值功能并不是那么好。

Phew.

ew

The first line is self explanatory. In MCTS, you can start with an unskilled Foma and unskilled Lusha. The more you play, the better they get at predicting solid outcomes and moves. “Narrowing the search to a beam of high probability actions” is just a sophisticated way of saying, “Lusha helps you narrow down the moves you need to roll out by assigning them probabilities that an expert would play them”. Prior work has used this technique to achieve strong amateur level AI players, even with simple (or “shallow” as they call it) policy functions.

第一行是不言自明的。 在MCTS中,您可以從不熟練的Foma和不熟練的Lusha開始。 您玩的越多,他們就越能預測可靠的結果和動作。 “將搜索范圍縮小到一連串高概率動作”只是一種復雜的說法,“ Lusha通過分配專家可以扮演的概率來幫助您縮小需要推出的動作的范圍”。 先前的工作已經使用此技術來實現強大的業余級別的AI玩家,即使具有簡單(或稱其為“淺”)策略功能。

Yeah, convolutional neural networks are great for image processing. And since a neural network takes a particular input and gives an output, it is essentially a function, right? So you can use a neural network to become a complex function. So you can just pass in an image of the board position and let the neural network figure out by itself what’s going on. This means it’s possible to create neural networks which will behave like VERY accurate policy and value functions. The rest is pretty self explanatory.

是的,卷積神經網絡非常適合圖像處理。 而且由于神經網絡接受特定的輸入并給出輸出,因此它本質上是一個函數,對嗎? 因此,您可以使用神經網絡來成為復雜的功能。 因此,您只需傳遞電路板位置的圖像,然后讓神經網絡自己弄清楚發生了什么。 這意味著有可能創建神經網絡,其行為將非常像精確的政策和價值函數。 其余的很容易解釋。

Here we discuss how Foma and Lusha were trained. To train the policy network (predicting for a given position which moves experts would pick), you simply use examples of human games and use them as data for good old supervised learning.

在這里,我們討論如何對Foma和Lusha進行培訓。 要訓??練政策網絡(預測專家會選擇的給定職位),您只需使用人類游戲的示例,并將其用作良好的舊有監督學習的數據。

And you want to train another slightly different version of this policy network to use for rollouts; this one will be smaller and faster. Let’s just say that since Lusha is so experienced, she takes some time to process each position. She’s good to start the narrowing-down process with, but if you try to make her repeat the process , she’ll still take a little too much time. So you train a *faster policy network* for the rollout process (I’ll call it… Lusha’s younger brother Jerry? I know I know, enough with these names). After that, once you’ve trained both of the slow and fast policy networks enough using human player data, you can try letting Lusha play against herself on a Go board for a few days, and get more practice. This is the reinforcement learning part — making a better version of the policy network.

您想培訓該策略網絡的另一個版本,以用于推廣; 這個會更小,更快。 可以說,由于Lusha經驗豐富,她需要一些時間來處理每個職位。 她很樂意開始縮小范圍的過程,但是如果您嘗試讓她重復該過程,她仍然會花費太多時間。 因此,您為推出過程訓練了一個“更快的策略網絡”(我稱它為……盧莎的弟弟杰里?我知道,我知道這些名字就足夠了)。 之后,一旦您使用人類玩家數據對慢速和快速策略網絡都進行了足夠的培訓,就可以嘗試讓Lusha在Go板上與自己對戰幾天,然后進行更多練習。 這是強化學習的一部分-制定更好的政策網絡版本。

Then, you train Foma for value prediction: determining the probability of you winning. You let the AI practice through playing itself again and again in a simulated environment, observe the end result each time, and learn from its mistakes to get better and better.

然后,您訓練Foma進行價值預測:確定獲勝的可能性。 您可以讓AI在一個模擬的環境中一次又一次地練習,每次觀察最終結果,并從錯誤中學習,從而變得越來越好。

I won’t go into details of how these networks are trained. You can read more technical details in the later section of the paper (‘Methods’) which I haven’t covered here. In fact, the real purpose of this particular paper is not to show how they used reinforcement learning on these neural networks. One of DeepMind’s previous papers, in which they taught AI to play ATARI games, has already discussed some reinforcement learning techniques in depth (And I’ve already written an explanation of that paper here). For this paper, as I lightly mentioned in the Abstract and also underlined in the screenshot above, the biggest innovation was the fact that they used RL with neural networks for improving an already popular game-playing algorithm, MCTS. RL is a cool tool in a toolbox that they used to fine-tune the policy and value function neural networks after the regular supervised training. This research paper is about proving how versatile and excellent this tool it is, not about teaching you how to use it. In television lingo, the Atari paper was a RL infomercial and this AlphaGo paper is a commercial.

我不會詳細介紹如何訓練這些網絡。 您可以在本文的后面部分(“方法”)中技術細節,這里沒有介紹。 實際上,這篇特定論文的真正目的并不是要展示他們如何在這些神經網絡上使用強化學習。 DeepMind之前的一篇論文(其中他們教AI玩ATARI游戲)已經深入討論了一些強化學習技術(并且我已經在此處寫了一篇說明性文章)。 就本文而言,正如我在摘要中稍加提及并在上面的屏幕快照中所強調的那樣,最大的創新是他們將RL與神經網絡結合使用,以改善已經流行的游戲算法MCTS。 RL是工具箱中的一個很酷的工具,在進行了定期的有監督的培訓之后,他們曾使用RL來微調政策和價值函數神經網絡。 這份研究論文的目的是證明該工具的多功能性和出色性,而不是教您如何使用它。 在電視術語中,Atari論文是RL商業廣告,而AlphaGo論文是商業廣告。

好了,我們終于完成了“簡介”部分。 到目前為止,您已經對AlphaGo的功能有了很好的了解。 (Alright we’re finally done with the “introduction” parts. By now you already have a very good feel for what AlphaGo was all about.)

接下來,我們將更深入地討論上面討論的每件事。 您可能會看到一些難看且危險的數學方程式和表達式,但它們很簡單(我將全部解釋)。 放松。 (Next, we’ll go slightly deeper into each thing we discussed above. You might see some ugly and dangerous looking mathematical equations and expressions, but they’re simple (I explain them all). Relax.)

A quick note before you move on. Would you like to help me write more such essays explaining cool research papers? If you’re serious, I’d be glad to work with you. Please leave a comment and I’ll get in touch with you.

繼續之前的快速注釋。 您想幫我寫更多這樣的文章來解釋很酷的研究論文嗎? 如果您是認真的人,我很樂意與您合作。 請發表評論,我會與您聯系。

So, the first step is in training our policy NN (Lusha), to predict which moves are likely to be played by an expert. This NN’s goal is to allow the AI to play similar to an expert human. This is a convolutional neural network (as I mentioned before, it’s a special kind of NN that is very useful in image processing) that takes in a simplified image of a board arrangement. “Rectifier nonlinearities” are layers that can be added to the network’s architecture. They give it the ability to learn more complex things. If you’ve ever trained NNs before, you might have used the “ReLU” layer. That’s what these are.

因此,第一步是訓練我們的政策神經網絡(Lusha),以預測專家可能采取哪些動作。 這個NN的目標是讓AI扮演類似于專家的角色。 這是一個卷積神經網絡(正如我之前提到的,它是一種特殊的NN,在圖像處理中非常有用),它可以簡化電路板布置的圖像。 “整流器非線性”是可以添加到網絡體系結構中的層。 他們賦予它學習更復雜事物的能力。 如果您曾經培訓過NN,則可能使用了“ ReLU”層。 這就是這些。

The training data here was in the form of random pairs of board positions, and the labels were the actions chosen by humans when they were in those positions. Just regular supervised learning.

這里的訓練數據是成對的董事會職位,形式是隨機的,標簽是人類在這些職位上選擇的動作。 只是定期的監督學習。

Here they use “stochastic gradient ASCENT”. Well, this is an algorithm for backpropagation. Here, you’re trying to maximise a reward function. And the reward function is just the probability of the action predicted by a human expert; you want to increase this probability. But hey — you don’t really need to think too much about this. Normally you train the network so that it minimises a loss function, which is essentially the error/difference between predicted outcome and actual label. That is called gradient DESCENT. In the actual implementation of this research paper, they have indeed used the regular gradient descent. You can easily find a loss function that behaves opposite to the reward function such that minimising this loss will maximise the reward.

在這里,他們使用“隨機梯度上升”。 好吧,這是用于反向傳播的算法。 在這里,您正在嘗試最大化獎勵功能。 獎勵函數就是人類專家預測動作的概率。 您想增加這種可能性。 但是,嘿-您實際上不需要對此進行過多考慮。 通常,您對網絡進行訓練,以使其最小化損失函數,這實際上是預測結果與實際標簽之間的誤差/差異。 那就是所謂的梯度DESCENT。 在本研究論文的實際實施中,他們確實使用了規則的梯度下降 。 您可以輕松地找到與獎勵函數相反的損失函數,以便將損失最小化將使獎勵最大化。

The policy network has 13 layers, and is called “SL policy” network (SL = supervised learning). The data came from a… I’ll just say it’s a popular website on which millions of people play Go. How good did this SL policy network perform?

策略網絡有13層,稱為“ SL策略”網絡(SL =監督學習)。 數據來自……我只是說這是一個受歡迎的網站,成千上萬的人在其中玩Go。 該SL政策網絡的績效如何?

It was more accurate than what other researchers had done earlier. The rest of the paragraph is quite self-explanatory. As for the “rollout policy”, you do remember from a few paragraphs ago, how Lusha the SL policy network is slow so it can’t integrate well with the MCTS algorithm? And we trained another faster version of Lusha called Jerry who was her younger brother? Well, this refers to Jerry right here. As you can see, Jerry is just half as accurate as Lusha BUT it’s thousands of times faster! It will really help get through rolled out simulations of the future faster, when we apply the MCTS.

它比其他研究人員之前所做的更為準確。 該段的其余部分很不言自明。 關于“推出策略”,您確實記得前幾段內容,Lusha SL策略網絡運行緩慢,因此不能與MCTS算法很好地集成嗎? 我們訓練了另一個更快的Lusha版本,叫做Jerry,她是她的弟弟? 好吧,這里指的是杰里。 如您所見,Jerry的準確度僅為Lusha BUT的一半,但速度卻快了數千倍! 當我們應用MCTS時,它將確實有助于更快地完成對未來的模擬。

For this next section, you don’t *have* to know about Reinforcement Learning already, but then you’ll have to assume that whatever I say works. If you really want to dig into details and make sure of everything, you might want to read a little about RL first.

在下一節中,您沒有*已經*知道強化學習,但是接下來您將不得不假設我所說的一切都是可行的。 如果您真的想深入研究細節并確定所有內容,則可能需要先閱讀一些有關RL的知識。

Once you have the SL network, trained in a supervised manner using human player moves with the human moves data, as I said before you have to let her practice by itself and get better. That’s what we’re doing here. So you just take the SL policy network, save it in a file, and make another copy of it.

一旦有了SL網絡,就可以使用人類動作數據和人類動作數據進行監督訓練,就像我之前說過的那樣,您必須讓她自己練習并變得更好。 這就是我們在這里所做的。 因此,您只需使用SL策略網絡,將其保存在文件中,然后再對其進行復制。

Then you use reinforcement learning to fine-tune it. Here, you make the network play against itself and learn from the outcomes.

然后,您可以使用強化學習對其進行微調。 在這里,您可以使網絡與自身競爭并從結果中學習。

But there’s a problem in this training style.

但是這種訓練方式存在一個問題。

If you only forever practice against ONE opponent, and that opponent is also only practicing with you exclusively, there’s not much of new learning you can do. You’ll just be training to practice how to beat THAT ONE player. This is, you guessed it, overfitting: your techniques play well against one opponent, but don’t generalize well to other opponents. So how do you fix this?

如果您僅與一個對手永遠練習,而該對手也僅與您一起練習,那么您將無法進行很多新的學習。 您將只是在訓練中練習如何擊敗一位玩家。 您猜對了,這太過適合了:您的技術在與一個對手的比賽中表現不錯,但在其他對手中的表現卻不佳。 那么如何解決這個問題?

Well, every time you fine-tune a neural network, it becomes a slightly different kind of player. So you can save this version of the neural network in a list of “players”, who all behave slightly differently right? Great — now while training the neural network, you can randomly make it play against many different older and newer versions of the opponent, chosen from that list. They are versions of the same player, but they all play slightly differently. And the more you train, the MORE players you get to train even more with! Bingo!

好吧,每次您微調神經網絡時,它都會變成一種稍有不同的播放器。 因此,您可以將此神經網絡版本保存在“玩家”列表中,這些玩家的行為略有不同,對嗎? 太好了-現在,在訓練神經網絡時,您可以隨機地使它與從該列表中選擇的許多不同版本的對手進行比賽。 它們是同一播放器的版本,但是它們的播放方式略有不同。 而且您訓練的越多,就可以訓練更多的球員! 答對了!

In this training, the only thing guiding the training process is the ultimate goal, i.e winning or losing. You don’t need to specially train the network to do things like capture more area on the board etc. You just give it all the possible legal moves it can choose from, and say, “you have to win”. And this is why RL is so versatile; it can be used to train policy or value networks for any game, not just Go.

在本次培訓中,唯一指導培訓過程的是最終目標,即獲勝或失敗。 您不需要專門培訓網絡就可以進行諸如在董事會上占據更多區域之類的事情。您只需將其可能選擇的所有可能的法律舉動給予它,然后說“您必須取勝”。 這就是RL如此通用的原因。 它可以用于訓練任何游戲的策略或價值網絡,而不僅僅是Go。

Here, they tested how accurate this RL policy network was, just by itself without any MCTS algorithm. As you would remember, this network can directly take a board position and decide how an expert would play it — so you can use it to single-handedly play games.Well, the result was that the RL fine-tuned network won against the SL network that was only trained on human moves. It also won against other strong Go playing programs.

在這里,他們測試了此RL政策網絡的準確性,僅憑其本身就沒有任何MCTS算法。 您會記得,該網絡可以直接擔任董事會職位,并決定專家的游戲方式-因此您可以使用它來單手玩游戲。結果是,RL精細調整的網絡贏得了SL的勝利僅接受人類動作訓練的網絡。 它也擊敗了其他強大的圍棋比賽程序。

Must note here that even before training this RL policy network, the SL policy network was already better than the state of the art — and now, it has further improved! And we haven’t even come to the other parts of the process like the value network.

在這里必須注意, 即使在訓練此RL策略網絡之前,SL策略網絡也已經比最新技術更好-并且現在,它得到了進一步的改善 ! 而且,我們甚至還沒有涉及價值網絡等流程的其他部分。

Did you know that baby penguins can sneeze louder than a dog can bark? Actually that’s not true, but I thought you’d like a little joke here to distract from the scary-looking equations above. Coming to the essay again: we’re done training Lusha here. Now back to Foma — remember the “optimal value function”: v*(s) -> that only tells you how likely you are to win in your current board position if both players play perfectly from that point on?So obviously, to train an NN to become our value function, we would need a perfect player… which we don’t have. So we just use our strongest player, which happens to be our RL policy network.

您是否知道小企鵝打噴嚏的聲音比狗吠叫的聲音大? 其實這是不對的,但我想您在這里想開個玩笑來轉移上面那令人毛骨悚然的方程式的注意力。 再次回到本文:我們在這里完成了對Lusha的培訓。 現在回到Foma -記住“最優值函數”:v *(s)->僅告訴您如果兩個玩家都從那時開始完美玩耍,您在當前的棋盤位置獲勝的可能性是多少? NN成為我們的價值功能,我們需要一個完美的球員……這是我們所沒有的。 因此,我們只是使用我們最強大的平臺 ,而這恰恰是我們的RL政策網絡。

It takes the current state board state s, and outputs the probability that you will win the game. You play a game and get to know the outcome (win or loss). Each of the game states act as a data sample, and the outcome of that game acts as the label. So by playing a 50-move game, you have 50 data samples for value prediction.

它采用當前的狀態板狀態s,并輸出您將贏得比賽的概率。 您玩游戲并了解結果(勝利或失敗)。 每個游戲狀態都充當數據樣本,而該游戲的結果則充當標簽。 因此,通過玩50步游戲,您將擁有50個用于價值預測的數據樣本。

Lol, no. This approach is naive. You can’t use all 50 moves from the game and add them to the dataset.

哈哈,不。 這種方法很幼稚。 您不能使用游戲中的所有50個動作并將它們添加到數據集中。

The training data set had to be chosen carefully to avoid overfitting. Each move in the game is very similar to the next one, because you only move once and that gives you a new position, right? If you take the states at all 50 of those moves and add them to the training data with the same label, you basically have lots of “kinda duplicate” data, and that causes overfitting. To prevent this, you choose only very distinct-looking game states. So for example, instead of all 50 moves of a game, you only choose 5 of them and add them to the training set. DeepMind took 30 million positions from 30 million different games, to reduce any chances of there being duplicate data. And it worked!

必須仔細選擇培訓數據集,以免過度擬合。 游戲中的每一步都與下一步非常相似,因為您只移動一次,這就會給您一個新的位置,對嗎? 如果您將所有這50個動作的狀態都取下來,并用相同的標簽將它們添加到訓練數據中,則基本上會有大量的“重復數據”數據,這會導致過擬合。 為避免這種情況,您只能選擇外觀非常獨特的游戲狀態。 因此,例如,您只選擇其中5個,而不是將游戲中的全部50個動作添加到訓練集中即可。 DeepMind從3000萬種不同的游戲中獲得了3000萬個職位,以減少重復數據的可能性。 而且有效!

Now, something conceptual here: there are two ways to evaluate the value of a board position. One option is a magical optimal value function (like the one you trained above). The other option is to simply roll out into the future using your current policy (Lusha) and look at the final outcome in this roll out. Obviously, the real game would rarely go by your plans. But DeepMind compared how both of these options do. You can also do a mixture of both these options. We will learn about this “mixing parameter” a little bit later, so make a mental note of this concept!

現在,這里一些概念性的東西 :有兩種方法可以評估董事會職位的價值。 一種選擇是神奇的最優值函數(就像上面訓練過的那樣)。 另一種選擇是使用您當前的政策(Lusha)簡單地將其推廣到未來,并查看此次推廣的最終結果。 顯然,真正的游戲很少會按您的計劃進行。 但是DeepMind比較了這兩種選擇的作用。 您也可以同時使用這兩個選項。 我們稍后將學習此“混合參數”,因此請牢記此概念!

Well, your single neural network trying to approximate the optimal value function is EVEN BETTER than doing thousands of mental simulations using a rollout policy! Foma really kicked ass here. When they replaced the fast rollout policy with the twice-as-accurate (but slow) RL policy Lusha, and did thousands of simulations with that, it did better than Foma. But only slightly better, and too slowly. So Foma is the winner of this competition, she has proved that she can’t be replaced.

好吧,您的單個神經網絡試圖逼近最優值函數比使用推廣策略進行數千次心理模擬還要好! 福瑪真的踢屁股了。 當他們用兩倍準確(但緩慢)的RL策略Lusha代替快速推出策略時,并用它進行了數千次仿真, 結果比Foma還要好。 但是只有稍微好一點,而且太慢了。 因此,福瑪(Foma)是本屆比賽的獲勝者,她證明了自己不能被取代。

Now that we have trained the policy and value functions, we can combine them with MCTS and give birth to our former world champion, destroyer of grand masters, the breakthrough of a generation, weighing two hundred and sixty eight pounds, one and only Alphaaaaa GO!

現在我們已經訓練了政策和價值功能,我們可以將它們與MCTS結合起來,從而誕生我們的前世界冠軍,大師級驅逐艦,一代人的突破,重達168磅,只有一個Alphaaaaa GO !

In this section, ideally you should have a slightly deeper understanding of the inner workings of the MCTS algorithm, but what you have learned so far should be enough to give you a good feel for what’s going on here. The only thing you should note is how we’re using the policy probabilities and value estimations. We combine them during roll outs, to narrow down the number of moves we want to roll out at each step. Q(s,a) represents the value function, and u(s,a) is a stored probability for that position. I’ll explain.

在本節中,理想情況下,您應該對MCTS算法的內部工作有一個更深的了解,但是到目前為止,您所學到的知識應該足以使您對這里發生的事情有很好的了解。 您唯一需要注意的是我們如何使用策略概率和價值估算。 我們在推出期間將它們結合在一起,以縮小我們希望在每個步驟中推出的移動數量。 Q(s,a)表示值函數,而u(s,a)是該位置的存儲概率。 我會解釋。

Remember that the policy network uses supervised learning to predict expert moves? And it doesn’t just give you most likely move, but rather gives you probabilities for each possible move that tell how likely it is to be an expert move. This probability can be stored for each of those actions. Here they call it “prior probability”, and they obviously use it while selecting which actions to explore. So basically, to decide whether or not to explore a particular move, you consider two things: First, by playing this move, how likely are you to win? Yes, we already have our “value network” to answer this first question. And the second question is, how likely is it that an expert would choose this move? (If a move is super unlikely to be chosen by an expert, why even waste time considering it. This we get from the policy network)

還記得政策網絡使用監督學習來預測專家的舉動嗎? 它不僅為您提供了最有可能的舉動,而且還為您提供了每種可能舉動的概率 ,這些概率表明了專家舉動的可能性。 可以為每個動作存儲該概率。 在這里,他們將其稱為“先驗概率”,并且顯然在選擇要探索的動作時會使用它。 因此,基本上,要決定是否探索某個特定的舉動,您需要考慮兩件事:首先,通過玩此舉,您獲勝的可能性有多大? 是的,我們已經有了“價值網絡”來回答第一個問題。 第二個問題是,專家選擇此舉的可能性有多大? (如果專家極不可能選擇此舉,那為什么還要浪費時間考慮它。這是從政策網絡中獲得的)

Then let’s talk about the “mixing parameter” (see came back to it!). As discussed earlier, to evaluate positions, you have two options: one, simply use the value network you have been using to evaluate states all along. And two, you can try to quickly play a rollout game with your current strategy (assuming the other player will play similarly), and see if you win or lose. We saw how the value function was better than doing rollouts in general. Here they combine both. You try giving each prediction 50–50 importance, or 40–60, or 0–100, and so on. If you attach a % of X to the first, you’ll have to attach 100-X to the second. That’s what this mixing parameter means. You’ll see these hit and trial results later in the paper.

然后,我們來討論“混合參數”(請參閱??回來!)。 如前所述,評估職位有兩種選擇:一種是簡單地使用一直以來用于評估狀態的價值網絡。 第二,您可以嘗試使用當前策略快速進行首次發布游戲(假設其他玩家也可以玩),然后看看您是贏還是輸。 我們看到了價值函數通常比首次展示更好。 在這里,它們結合了兩者。 您嘗試為每個預測賦予50–50的重要性,或40–60或0–100的重要性,依此類推。 如果將X的%附加到第一個,則必須將100-X附加到第二個。 這就是這個混合參數的意思。 您將在本文稍后看到這些命中和試用結果。

After each roll out, you update your search tree with whatever information you gained during the simulation, so that your next simulation is more intelligent. And at the end of all simulations, you just pick the best move.

每次推出后,您都可以使用在仿真過程中獲得的任何信息來更新搜索樹,以使下一次仿真更加智能。 在所有模擬結束時,您只需選擇最佳動作即可。

Interesting insight here!

有趣的見識在這里!

Remember how the RL fine-tuned policy NN was better than just the SL human-trained policy NN? But when you put them within the MCTS algorithm of AlphaGo, using the human trained NN proved to be a better choice than the fine-tuned NN. But in the case of the value function (which you would remember uses a strong player to approximate a perfect player), training Foma using the RL policy works better than training her with the SL policy.

Remember how the RL fine-tuned policy NN was better than just the SL human-trained policy NN? But when you put them within the MCTS algorithm of AlphaGo, using the human trained NN proved to be a better choice than the fine-tuned NN. But in the case of the value function (which you would remember uses a strong player to approximate a perfect player), training Foma using the RL policy works better than training her with the SL policy.

“Doing all this evaluation takes a lot of computing power. We really had to bring out the big guns to be able to run these damn programs.”

“Doing all this evaluation takes a lot of computing power. We really had to bring out the big guns to be able to run these damn programs.”

Self explanatory.

Self explanatory.

“LOL, our program literally blew the pants off of every other program that came before us”

“LOL, our program literally blew the pants off of every other program that came before us”

This goes back to that “mixing parameter” again. While evaluating positions, giving equal importance to both the value function and the rollouts performed better than just using one of them. The rest is self explanatory, and reveals an interesting insight!

This goes back to that “mixing parameter” again. While evaluating positions, giving equal importance to both the value function and the rollouts performed better than just using one of them. The rest is self explanatory, and reveals an interesting insight!

Self explanatory.

Self explanatory.

Self explanatory. But read that red underlined sentence again. I hope you can see clearly now that this line right here is pretty much the summary of what this whole research project was all about.

Self explanatory. But read that red underlined sentence again. I hope you can see clearly now that this line right here is pretty much the summary of what this whole research project was all about.

Concluding paragraph. “Let us brag a little more here because we deserve it!” :)

Concluding paragraph. “Let us brag a little more here because we deserve it!” :)

Oh and if you’re a scientist or tech company, and need some help in explaining your science to non-technical people for marketing, PR or training etc, I can help you. Drop me a message on Twitter: @mngrwl

Oh and if you're a scientist or tech company, and need some help in explaining your science to non-technical people for marketing, PR or training etc, I can help you. Drop me a message on Twitter: @mngrwl

翻譯自: https://www.freecodecamp.org/news/explained-simply-how-an-ai-program-mastered-the-ancient-game-of-go-62b8940a9080/

ai人工智能程序

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/394695.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/394695.shtml
英文地址,請注明出處:http://en.pswp.cn/news/394695.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

python提取hbase數據_詳解python操作hbase數據的方法介紹

配置 thrift python使用的包 thrift 個人使用的python 編譯器是pycharm community edition. 在工程中設置中,找到project interpreter, 在相應的工程下,找到package,然后選擇 “” 添加, 搜索 hbase-thrift (Python cl…

工作隨記

td自動換行:設置table 的 style"table-layout:fixed;" 然后設置td的 style"word-wrap:break-word;" white-space: nowrap 文本不換行 Intelidea創建好項目之后,右鍵新建Java class的時候發現沒有改選項,在Project Structure設置源碼目錄 DOM4j中…

qt for android 圖片可拉伸,qt實現九宮格布局,圖片拉伸

在實現qt播放時,調用的mplayer,由于采用的是自定義繪圖,用的是setAttribute(Qt::WA_TranslucentBackground);結果不能正常在上面顯示播放畫面,在默認皮膚下是沒有問題的,決定用九宮格圖片拉伸方式(效果如圖)附件圖片:文…

第一次作業-李純銳201731084433

作業屬于課程 https://edu.cnblogs.com/campus/xnsy/SoftwareEngineeringClass2 作業要求位置 https://edu.cnblogs.com/campus/xnsy/SoftwareEngineeringClass2/homework/2706 我在這個課程的目標是: 認真學習好軟件工程原理以及科學的軟件項目開發與管理方法&…

定點化_mif文件生成

clc; %全屏清零 clear all; %變量清零 N2^8; %設置ROM深度(字變量)的變量參數, s_p0:255; …

zeppelin連接數據源_使用開放源代碼合同(open-zeppelin)創建以太坊令牌

zeppelin連接數據源by Danny通過丹尼 使用開放源代碼合同(open-zeppelin)創建以太坊令牌 (Create an Ethereum token using open source contracts (open-zeppelin)) I want to show you that creating a best practice token is a simple process. To be honest, we are goin…

python不是內部文件_已安裝python,但是出現‘python’不是內部或外部命令,也不是可運行的程序或批處理文件。...

解決方法: 1.打開python shell查看你的python安裝路徑(黃色標注) >>> import sys >>> sys.path [, C:\\Users\\Administrator\\AppData\\Local\\Programs\\Python\\Python37\\Lib\\idlelib, C:\\Users\\Administrator\\App…

使用canvas繪制時鐘

使用canvas繪制時鐘 什么使canvas呢&#xff1f;HTML5 <canvas> 元素用于圖形的繪制&#xff0c;通過腳本 (通常是JavaScript)來完成。<canvas> 標簽只是圖形容器&#xff0c;所以我們必須使用腳本來繪制圖形。通過它可以繪制路徑,盒、圓、字符以及添加圖像等等。 …

Visual Studio 2017創建XAML文件

Visual Studio 2017創建XAML文件在Visual Stuido 2015中&#xff0c;在已經創建好的項目中添加XAML文件&#xff0c;只要右擊項目&#xff0c;單擊“添加”|“新建項”命令&#xff0c;然后從“添加新項”對話框中&#xff0c;選擇“Cross-Platform”|“Forms Xaml Page”選項即…

android 安裝assets中的apk,如何安裝assets下apk,附源碼(原創)

publicstaticvoidInstall(Context ctx, String strLocalFile) {Intent intentInstallnewIntent();String apkPath"/data/data/"ctx.getPackageName()"/files";String apkName"yuan.apk";File filenewFile(apkPath, apkName);try{//assets下對于超…

FtpWebRequest.UsePassive屬性:設置FTP工作模式

默認值&#xff1a;true&#xff0c;被動模式 PASV&#xff08;被動&#xff09;方式的連接過程是&#xff1a;客戶端向服務器的FTP端口&#xff08;默認是21&#xff09;發送連接請求&#xff0c;服務器接受連接&#xff0c;建立一條命令鏈路。 當需要傳送數據時&#xff0c; …

angular面試題及答案_關于最流行的Angular問題的StackOverflow上的48個答案

angular面試題及答案by Shlomi Levi通過Shlomi Levi 關于最流行的Angular問題的StackOverflow上的48個答案 (48 answers on StackOverflow to the most popular Angular questions) I gathered the most common questions and answers from Stackoverflow. These questions we…

c++分治法求最大最小值實現_最優化計算與matlab實現(12)——非線性最小二乘優化問題——G-N法...

參考資料《精通MATLAB最優化計算&#xff08;第二版&#xff09;》編程工具Matlab 2019a目錄石中居士&#xff1a;最優化計算與Matlab實現——目錄?zhuanlan.zhihu.com非線性最小二乘優化問題非線性最小二乘優化也叫無約束極小平方和函數問題&#xff0c;它是如下無約束極小問…

win7 IIS7環境下部署PHP 7.0

最近在本機電腦win7 II7環境下部署PHP 7.0遇到一些問題&#xff0c;將之記錄下來 簡要步驟如下&#xff1a; 1、到php官網下載php&#xff0c;由于是IIS環境要下載非線程安全的版本&#xff0c;我下載的是7.0.13 2、解壓到本地文件目錄下 3、通過控制臺進入到php文件目錄&#…

《Oracle高性能自動化運維》一一3.3 Redo產生場景

3.3 Redo產生場景我們知道&#xff0c;Oracle Redo是以條目&#xff08;Redo Entries/Records&#xff09;的形式記錄數據庫的所有更改操作&#xff08;OP&#xff09;。更改操作主要包括&#xff1a;數據庫物理文件更改&#xff1a;主要指的是數據庫物理文件的增減等操作&…

智能算法(GA、DBO等)求解零空閑流水車間調度問題(NIFSP)

先做一個聲明&#xff1a;文章是由我的個人公眾號中的推送直接復制粘貼而來&#xff0c;因此對智能優化算法感興趣的朋友&#xff0c;可關注我的個人公眾號&#xff1a;啟發式算法討論。我會不定期在公眾號里分享不同的智能優化算法&#xff0c;經典的&#xff0c;或者是近幾年…

《構建之法》讀后感 二

個人感受部分&#xff1a; 01. 過去的我對自己的職業沒有一個規劃&#xff0c;認為讀大學就是拿畢業證&#xff0c;至于以后找到什么樣的工作從來沒有考慮過。在拿到一個軟件作業時&#xff0c;總是在設計階段就把它想得特別完美&#xff0c;想讓他沒有任何出錯的做出來&#x…

android 簡單實現圓角,Android 實現圓角圖片的簡單實例

Android 實現圓角圖片的簡單實例實現效果圖&#xff1a;本來想在網上找個圓角的例子看一看&#xff0c;不盡人意啊&#xff0c;基本都是官方的Demo的那張原理圖&#xff0c;稍后會貼出。于是自己自定義了個View&#xff0c;實現圖片的圓角以及圓形效果。效果圖&#xff1a;Andr…

zookeeper介紹及集群的搭建(利用虛擬機)

ZooKeeper ?   ZooKeeper是一個分布式的&#xff0c;開放源碼&#xff08;apache&#xff09;的分布式應用程序協調服務&#xff0c;是Google的Chubby一個開源的實現&#xff0c;是Hadoop和Hbase、dubbox、kafka的重要組件。它主要用來解決分布式集群中應用系統的一致性問題…

pythondict初始化_利用defaultdict對字典進行全局初始化。

通常我們在操作字典時&#xff0c;如果讀取的鍵未被初始化&#xff0c;則會拋出KeyError的錯誤&#xff0c;這個是我們都很熟悉的。那么一般的解決方式是使用異常處理或者是調用字典的get方法來避免出現這個異常。 可以看到&#xff0c;這兩種寫法都比較繁瑣&#xff0c;第二種…