zt:緩存一致性（Cache Coherency）入門 cach coherency

http://www.infoq.com/cn/articles/cache-coherency-primer

http://www.cnblogs.com/xybaby/p/6641928.html

?english:

http://www.tuicool.com/articles/BVRNZbV

=========================================

yxr注：

?1) 由于曾研究IBM的CPU加速(CAPI)，其提到內存一致性，為了弄清楚其和通過加速卡進行加速的區別，所以與必要弄清楚緩存一致性的概念和好處，才能理解

IBM所說的體系的優勢

?2) 本文內容淺顯易懂，精簡而且循序論述，中文翻譯到位，非常適合primer學習。為便于學習將中文和英文同時轉載

?3) summary:

? ? ?文中先介紹cache的來源，cache line（cache line應該是指一段內存內容，并不一定在cache中）的含義和cache訪問所遵循的定理。指出一致性問題（一致性協議（Coherency protocols））來源于多核所帶的多cache在write-back模式訪問memory造成。在解決一致性問題中提到了最普遍的方法，snoop協議，其物理基礎是

所有cache均能夠同時檢測memory interface的access。為達到cache conherency，引入了MESI狀態（梅西，四個字母代表四種狀態），各個core之間需要相互通信，完成各個狀態的裝換，保證cache的一致性

?===========================================

本文是RAD Game Tools程序員Fabian “ryg” Giesen在其博客上發表的《Cache coherency primer》一文的翻譯，經作者許可分享至InfoQ中文站。該系列共有兩篇，本文系第一篇。

我計劃寫一些關于多核場景下數據組織的文章。寫了第一篇，但我很快意識到有大量的基礎知識我首先需要講一下。在本文中，我就嘗試闡述這些知識。

緩存（Cache）

本文是關于CPU緩存的快速入門。我假設你已經有了基本概念，但你可能不熟悉其中的一些細節。（如果你已經熟悉了，你可以忽略這部分。）

在現代的CPU（大多數）上，所有的內存訪問都需要通過層層的緩存來進行。也有些例外，比如，對映射成內存地址的I/O口、寫合并（Write-combined）內存，這些訪問至少會繞開這個流程的一部分。但這兩者都是罕見的場景（意味著絕大多數的用戶態代碼都不會遇到這兩種情況），所以在本文中，我將忽略這兩者。

CPU的讀/寫（以及取指令）單元正常情況下甚至都不能直接訪問內存——這是物理結構決定的；CPU都沒有管腳直接連到內存。相反，CPU和一級緩存（L1 Cache）通訊，而一級緩存才能和內存通訊。大約二十年前，一級緩存可以直接和內存傳輸數據。如今，更多級別的緩存加入到設計中，一級緩存已經不能直接和內存通訊了，它和二級緩存通訊——而二級緩存才能和內存通訊。或者還可能有三級緩存。你明白這個意思就行。

緩存是分“段”（line）的，一個段對應一塊存儲空間，大小是32（較早的ARM、90年代/2000年代早期的x86和PowerPC）、64（較新的ARM和x86）或128（較新的Power ISA機器）字節。每個緩存段知道自己對應什么范圍的物理內存地址，并且在本文中，我不打算區分物理上的緩存段和它所代表的內存，這聽起來有點草率，但是為了方便起見，還是請熟悉這種提法。具體地說，當我提到“緩存段”的時候，我就是指一段和緩存大小對齊的內存，不關心里面的內容是否真正被緩存進去（就是說保存在任何級別的緩存中）了。

當CPU看到一條讀內存的指令時，它會把內存地址傳遞給一級數據緩存（或可戲稱為L1D$，因為英語中“緩存（cache）”和“現金（cash）”的發音相同）。一級數據緩存會檢查它是否有這個內存地址對應的緩存段。如果沒有，它會把整個緩存段從內存（或者從更高一級的緩存，如果有的話）中加載進來。是的，一次加載整個緩存段，這是基于這樣一個假設：內存訪問傾向于本地化（localized），如果我們當前需要某個地址的數據，那么很可能我們馬上要訪問它的鄰近地址。一旦緩存段被加載到緩存中，讀指令就可以正常進行讀取。

如果我們只處理讀操作，那么事情會很簡單，因為所有級別的緩存都遵守以下規律，我稱之為：

基本定律：在任意時刻，任意級別緩存中的緩存段的內容，等同于它對應的內存中的內容。

一旦我們允許寫操作，事情就變得復雜一點了。這里有兩種基本的寫模式：直寫（write-through）和回寫（write-back）。直寫更簡單一點：我們透過本級緩存，直接把數據寫到下一級緩存（或直接到內存）中，如果對應的段被緩存了，我們同時更新緩存中的內容（甚至直接丟棄），就這么簡單。這也遵守前面的定律：緩存中的段永遠和它對應的內存內容匹配。

回寫模式就有點復雜了。緩存不會立即把寫操作傳遞到下一級，而是僅修改本級緩存中的數據，并且把對應的緩存段標記為“臟”段。臟段會觸發回寫，也就是把里面的內容寫到對應的內存或下一級緩存中。回寫后，臟段又變“干凈”了。當一個臟段被丟棄的時候，總是先要進行一次回寫。回寫所遵循的規律有點不同。

回寫定律：當所有的臟段被回寫后，任意級別緩存中的緩存段的內容，等同于它對應的內存中的內容。

換句話說，回寫模式的定律中，我們去掉了“在任意時刻”這個修飾語，代之以弱化一點的條件：要么緩存段的內容和內存一致（如果緩存段是干凈的話），要么緩存段中的內容最終要回寫到內存中（對于臟緩存段來說）。

直接模式更簡單，但是回寫模式有它的優勢：它能過濾掉對同一地址的反復寫操作，并且，如果大多數緩存段都在回寫模式下工作，那么系統經常可以一下子寫一大片內存，而不是分成小塊來寫，前者的效率更高。

有些（大多數是比較老的）CPU只使用直寫模式，有些只使用回寫模式，還有一些，一級緩存使用直寫而二級緩存使用回寫。這樣做雖然在一級和二級緩存之間產生了不必要的數據流量，但二級緩存和更低級緩存或內存之間依然保留了回寫的優勢。我想說的是，這里涉及到一系列的取舍問題，且不同的設計有不同的解決方案。沒有人規定各級緩存的大小必須一致。舉個例子，我們會看到有CPU的一級緩存是32字節，而二級緩存卻有128字節。

為了簡化問題，我省略了一些內容：緩存關聯性（cache associativity），緩存組（cache sets），使用分配寫（write-allocate）還是非分配寫（上面我描述的直寫是和分配寫相結合的，而回寫是和非分配寫相結合的），非對齊的訪問（unaligned access），基于虛擬地址的緩存。如果你感興趣，所有這些內容都可以去查查資料，但我不準備在這里講了。

一致性協議（Coherency protocols）

只要系統只有一個CPU核在工作，一切都沒問題。如果有多個核，每個核又都有自己的緩存，那么我們就遇到問題了：如果某個CPU緩存段中對應的內存內容被另外一個CPU偷偷改了，會發生什么？

好吧，答案很簡單：什么也不會發生。這很糟糕。因為如果一個CPU緩存了某塊內存，那么在其他CPU修改這塊內存的時候，我們希望得到通知。我們擁有多組緩存的時候，真的需要它們保持同步。或者說，系統的內存在各個CPU之間無法做到與生俱來的同步，我們實際上是需要一個大家都能遵守的方法來達到同步的目的。

注意，這個問題的根源是我們擁有多組緩存，而不是多個CPU核。我們也可以這樣解決問題，讓多個CPU核共用一組緩存：也就是說只有一塊一級緩存，所有處理器都必須共用它。在每一個指令周期，只有一個幸運的CPU能通過一級緩存做內存操作，運行它的指令。

這本身沒問題。唯一的問題就是太慢了，因為這下處理器的時間都花在排隊等待使用一級緩存了（并且處理器會做大量的這種操作，至少每個讀寫指令都要做一次）。我指出這一點是因為它表明了問題不是由多核引起的，而是由多緩存引起的。我們知道了只有一組緩存也能工作，只是太慢了，接下來最好就是能做到：使用多組緩存，但使它們的行為看起來就像只有一組緩存那樣。緩存一致性協議就是為了做到這一點而設計的。就像名稱所暗示的那樣，這類協議就是要使多組緩存的內容保持一致。

緩存一致性協議有多種，但是你日常處理的大多數計算機設備使用的都屬于“窺探（snooping）”協議，這也是我這里要講的。（還有一種叫“基于目錄的（directory-based）”協議，這種協議的延遲性較大，但是在擁有很多個處理器的系統中，它有更好的可擴展性。）

“窺探”背后的基本思想是，所有內存傳輸都發生在一條共享的總線上，而所有的處理器都能看到這條總線：緩存本身是獨立的，但是內存是共享資源，所有的內存訪問都要經過仲裁（arbitrate）：同一個指令周期中，只有一個緩存可以讀寫內存。窺探協議的思想是，緩存不僅僅在做內存傳輸的時候才和總線打交道，而是不停地在窺探總線上發生的數據交換，跟蹤其他緩存在做什么。所以當一個緩存代表它所屬的處理器去讀寫內存時，其他處理器都會得到通知，它們以此來使自己的緩存保持同步。只要某個處理器一寫內存，其他處理器馬上就知道這塊內存在它們自己的緩存中對應的段已經失效。

在直寫模式下，這是很直接的，因為寫操作一旦發生，它的效果馬上會被“公布”出去。但是如果混著回寫模式，就有問題了。因為有可能在寫指令執行過后很久，數據才會被真正回寫到物理內存中——在這段時間內，其他處理器的緩存也可能會傻乎乎地去寫同一塊內存地址，導致沖突。在回寫模型中，簡單把內存寫操作的信息廣播給其他處理器是不夠的，我們需要做的是，在修改本地緩存之前，就要告知其他處理器。搞懂了細節，就找到了處理回寫模式這個問題的最簡單方案，我們通常叫做MESI協議（譯者注：MESI是Modified、Exclusive、Shared、Invalid的首字母縮寫，代表四種緩存狀態，下面的譯文中可能會以單個字母指代相應的狀態）。

MESI以及衍生協議

本節叫做“MESI以及衍生協議”，是因為MESI衍生了一系列緊密相關的一致性協議。我們先從原生的MESI協議開始：MESI是四種緩存段狀態的首字母縮寫，任何多核系統中的緩存段都處于這四種狀態之一。我將以相反的順序逐個講解，因為這個順序更合理：

失效（Invalid）緩存段，要么已經不在緩存中，要么它的內容已經過時。為了達到緩存的目的，這種狀態的段將會被忽略。一旦緩存段被標記為失效，那效果就等同于它從來沒被加載到緩存中。
共享（Shared）緩存段，它是和主內存內容保持一致的一份拷貝，在這種狀態下的緩存段只能被讀取，不能被寫入。多組緩存可以同時擁有針對同一內存地址的共享緩存段，這就是名稱的由來。
獨占（Exclusive）緩存段，和S狀態一樣，也是和主內存內容保持一致的一份拷貝。區別在于，如果一個處理器持有了某個E狀態的緩存段，那其他處理器就不能同時持有它，所以叫“獨占”。這意味著，如果其他處理器原本也持有同一緩存段，那么它會馬上變成“失效”狀態。
已修改（Modified）緩存段，屬于臟段，它們已經被所屬的處理器修改了。如果一個段處于已修改狀態，那么它在其他處理器緩存中的拷貝馬上會變成失效狀態，這個規律和E狀態一樣。此外，已修改緩存段如果被丟棄或標記為失效，那么先要把它的內容回寫到內存中——這和回寫模式下常規的臟段處理方式一樣。

如果把以上這些狀態和單核系統中回寫模式的緩存做對比，你會發現I、S和M狀態已經有對應的概念：失效/未載入、干凈以及臟的緩存段。所以這里的新知識只有E狀態，代表獨占式訪問。這個狀態解決了“在我們開始修改某塊內存之前，我們需要告訴其他處理器”這一問題：只有當緩存段處于E或M狀態時，處理器才能去寫它，也就是說只有這兩種狀態下，處理器是獨占這個緩存段的。當處理器想寫某個緩存段時，如果它沒有獨占權，它必須先發送一條“我要獨占權”的請求給總線，這會通知其他處理器，把它們擁有的同一緩存段的拷貝失效（如果它們有的話）。只有在獲得獨占權后，處理器才能開始修改數據——并且此時，這個處理器知道，這個緩存段只有一份拷貝，在我自己的緩存里，所以不會有任何沖突。

反之，如果有其他處理器想讀取這個緩存段（我們馬上能知道，因為我們一直在窺探總線），獨占或已修改的緩存段必須先回到“共享”狀態。如果是已修改的緩存段，那么還要先把內容回寫到內存中。

MESI協議是一個合適的狀態機，既能處理來自本地處理器的請求，也能把信息廣播到總線上。我不打算講更多關于狀態圖的細節以及不同的狀態轉換類型。如果你感興趣的話，可以在關于硬件架構的書中找到更多的深度內容，但對于本文來說，講這些東西有點過了。作為一個軟件開發者，你只要理解以下兩點，就大有可為：

第一，在多核系統中，讀取某個緩存段，實際上會牽涉到和其他處理器的通訊，并且可能導致它們發生內存傳輸。寫某個緩存段需要多個步驟：在你寫任何東西之前，你首先要獲得獨占權，以及所請求的緩存段的當前內容的拷貝（所謂的“帶權限獲取的讀（Read For Ownership）”請求）。

第二，盡管我們為了一致性問題做了額外的工作，但是最終結果還是非常有保證的。即它遵守以下定理，我稱之為：

MESI定律：在所有的臟緩存段（M狀態）被回寫后，任意緩存級別的所有緩存段中的內容，和它們對應的內存中的內容一致。此外，在任意時刻，當某個位置的內存被一個處理器加載入獨占緩存段時（E狀態），那它就不會再出現在其他任何處理器的緩存中。

注意，這其實就是我們已經講過的回寫定律加上獨占規則而已。我認為MESI協議或多核系統的存在根本沒有弱化我們現有的內存模型。

好了，至此我們（粗略）講了原生MESI協議（以及使用它的CPU，比如ARM）。其他處理器使用MESI擴展后的變種。常見的擴展包括“O”（Owned）狀態，它和E狀態類似，也是保證緩存間一致性的手段，但它直接共享臟段的內容，而不需要先把它們回寫到內存中（“臟段共享”），由此產生了MOSEI協議。還有MERSI和MESIF，這兩個名字代表同一種思想，即指定某個處理器專門處理針對某個緩存段的讀操作。當多個處理器同時擁有某個S狀態的緩存段的時候，只有被指定的那個處理器（對應的緩存段為R或F狀態）才能對讀操作做出回應，而不是每個處理器都能這么做。這種設計可以降低總線的數據流量。當然你可以同時加入R/F狀態和O狀態，或者更多的狀態。這些都屬于優化，沒有一種會改變基本定律，也沒有一種會改變MESI協議所確保的結果。

我不是這方面的專家，很有可能有系統在使用其他協議，這些協議并不能完全保證一致性，不過如果有，我沒有注意到它們，或者沒有看到有什么流行的處理器在使用它們。所以為了達到我們的目的，我們真的就可以假設一致性協議能保證緩存的一致性。不是基本一致，不是“寫入一會兒后才能保持一致”——而是完全的一致。從這個層面上說，除非硬件有問題，內存的狀態總是一致的。用技術術語來說，MESI以及它的衍生協議，至少在原理上，提供了完整的順序一致性（sequential consistency），在C++ 11的內存模型中，這是最強的一種確保內存順序的模型。這也引出了問題，為什么我們需要弱一點的內存模型，以及“什么時候會用到它們”？

內存模型

不同的體系結構提供不同的內存模型。到本文寫作的時候為止，ARM和POWER體系結構的機器擁有相對較弱的內存模型：這類CPU在讀寫指令重排序（reordering）方面有相當大的自由度，這種重排序有可能會改變程序在多核環境下的語義。通過“內存屏障（memory barrier）”，程序可以對此加以限制：“重排序操作不允許越過這條邊界”。相反，x86則擁有較強的內存模型。

我不打算在這里深入到內存模型的細節中，這很容易陷入堆砌技術術語中，而且也超出了本文的范圍。但是我想說一點關于“他們如何發生”的內容——也就是，弱內存模型如何保證正確性（相比較于MESI協議給緩存帶來的順序一致性），以及為什么。當然，一切都歸結于性能。

規則是這樣的：如果滿足下面的條件，你就可以得到完全的順序一致性：第一，緩存一收到總線事件，就可以在當前指令周期中迅速做出響應。第二，處理器如實地按程序的順序，把內存操作指令送到緩存，并且等前一條執行完后才能發送下一條。當然，實際上現代處理器一般都無法滿足以上條件：

緩存不會及時響應總線事件。如果總線上發來一條消息，要使某個緩存段失效，但是如果此時緩存正在處理其他事情（比如和CPU傳輸數據），那這個消息可能無法在當前的指令周期中得到處理，而會進入所謂的“失效隊列（invalidation queue）”，這個消息等在隊列中直到緩存有空為止。
處理器一般不會嚴格按照程序的順序向緩存發送內存操作指令。當然，有亂序執行（Out-of-Order execution）功能的處理器肯定是這樣的。順序執行（in-order execution）的處理器有時候也無法完全保證內存操作的順序（比如想要的內存不在緩存中時，CPU就不能為了載入緩存而停止工作）。
寫操作尤其特殊，因為它分為兩階段操作：在寫之前我們先要得到緩存段的獨占權。如果我們當前沒有獨占權，我們先要和其他處理器協商，這也需要一些時間。同理，在這種場景下讓處理器閑著無所事事是一種資源浪費。實際上，寫操作首先發起獲得獨占權的請求，然后就進入所謂的由“寫緩沖（store buffer）”組成的隊列（有些地方使用“寫緩沖”指代整個隊列，我這里使用它指代隊列的一條入口）。寫操作在隊列中等待，直到緩存準備好處理它，此時寫緩沖就被“清空（drained）”了，緩沖區被回收用于處理新的寫操作。

這些特性意味著，默認情況下，讀操作有可能會讀到過時的數據（如果對應失效請求還等在隊列中沒執行），寫操作真正完成的時間有可能比它們在代碼中的位置晚，一旦牽涉到亂序執行，一切都變得模棱兩可。回到內存模型，本質上只有兩大陣營：

在弱內存模型的體系結構中，處理器為了開發者能寫出正確的代碼而做的工作是最小化的，指令重排序和各種緩沖的步驟都是被正式允許的，也就是說沒有任何保證。如果你需要確保某種結果，你需要自己插入合適的內存屏障——它能防止重排序，并且等待隊列中的操作全部完成。

使用強一點的內存模型的體系結構則會在內部做很多記錄工作。比如，x86會跟蹤所有在等待中的內存操作，這些操作都還沒有完全完成（稱為“退休（retired）”）。它會把它們的信息保存在芯片內部的MOB（“memory ordering buffer”，內存排序緩沖）。x86作為部分支持亂序執行的體系結構，在出問題的時候能把尚未“退休”的指令撤銷掉——比如發生頁錯誤（page fault），或者分支預測失敗（branch mispredict）的時候。我已經在我以前的文章“好奇地說”中提到過一些細節，以及和內存子系統的一些交互。主旨是x86處理器會主動地監控外部事件（比如緩存失效），有些已經執行完的操作會因為這些事件而被撤銷，但不算“退休”。這就是說，x86知道自己的內存模型應該是什么樣子的，當發生了一件和這個模型沖突的事，處理器會回退到上一個與內存模型兼容的狀態。這就是我在以前另一篇文章中提到的“清除內存排序機（memory ordering machine clear）”。最后的結果是，x86處理器為內存操作提供了很強的一致性保證——雖然沒有達到完美的順序一致性。

無論如何，一篇文章講這么多已經夠了。我把它放在我的博客上。我的想法是將來的文章只要引用它就行了。我們看效果吧。感謝閱讀！

查看參考原文：http://fgiesen.wordpress.com/2014/07/07/cache-coherency/

I’m planning to write a bit about data organization for multi-core scenarios. I started writing a first post but quickly realized that there’s a bunch of basics I need to cover first. In this post, I’ll try just that.

Caches

This is a whirlwhind primer on CPU caches. I’m assuming you know the basic concept, but you might not be familiar with some of the details. (If you are, feel free to skip this section.)

In modern CPUs (almost) all memory accesses go through the cache hierarchy; there’s some exceptions for memory-mapped IO and write-combined memory that bypass at least parts of this process, but both of these are corner cases (in the sense that the vast majority of user-mode code will never see either), so I’ll ignore them in this post.

The CPU core’s load/store (and instruction fetch) units normally can’t even access memory directly – it’s physically impossible; the necessary wires don’t exist! Instead, they talk to their L1 caches which are supposed to handle it. And about 20 years ago, the L1 caches would indeed talk to memory directly. At this point, there’s generally more cache levels involved; this means the L1 cache doesn’t talk to memory directly anymore, it talks to a L2 cache – which in turns talks to memory. Or maybe to a L3 cache. You get the idea.

Caches are organized into “lines”, corresponding to aligned blocks of either 32 (older ARMs, 90s/early 2000s x86s/PowerPCs), 64 (newer ARMs and x86s) or 128 (newer Power ISA machines) bytes of memory. Each cache line knows what physical memory address range it corresponds to, and in this article I’m not going to differentiate between the physical cache line and the memory it represents – this is sloppy, but conventional usage, so better get used to it. In particular, I’m going to say “cache line” to mean a suitably aligned group of bytes in memory, no matter whether these bytes are currently cached (i.e. present in any of the cache levels) or not.

When the CPU core sees a memory load instruction, it passes the address to the L1 data cache (or “L1D$”, playing on the “cache” being pronounced the same way as “cash”). The L1D$ checks whether it contains the corresponding cache line. If not, the whole cache line is brought in from memory (or the next-deeper cache level, if present) – yes, the whole cache line; the assumption being that memory accesses are localized, so if we’re looking at some byte in memory we’re likely to access its neighbors soon. Once the cache line is present in the L1D$, the load instruction can go ahead and perform its memory read.

And as long as we’re dealing with read-only access, it’s all really simple, since all cache levels obey what I’ll call the

Basic invariant?: the contents of all cache lines present in any of the cache levels are identical to the values in memory at the corresponding addresses, at all times.

Things gets a bit more complicated once we allow stores, i.e. memory writes. There’s two basic approaches here: write-through and write-back. Write-through is the easier one: we just pass stores through to the next-level cache (or memory). If we have the corresponding line cached, we update our copy (or maybe even just discard it), but that’s it. This preserves the same invariant as before: if a cache line is present in the cache, its contents match memory, always.

Write-back is a bit trickier. The cache doesn’t pass writes on immediately. Instead, such modifications are applied locally to the cached data, and the corresponding cache lines are flagged “dirty”. Dirty cache lines can trigger a write-back, at which points their contents are written back to memory or the next cache level. After a write-back, dirty cache lines are “clean” again. When a dirty cache line is evicted (usually to make space for something else in the cache), it always needs to perform a write-back first. The invariant for write-back caches is slightly different.

Write-back invariant?:?after writing back all dirty cache lines?, the contents of all cache lines present in any of the cache levels are identical to the values in memory at the corresponding addresses.

In other words, in write-back caches we lose the “at all times” qualifier and replace it with a weaker condition: either the cache contents match memory (this is true for all clean cache lines), or they contain values that eventually need to get written back to memory (for dirty cache lines).

Write-through caches are simpler, but write-back has some advantages: it can filter repeated writes to the same location, and if most of the cache line changes on a write-back, it can issue one large memory transaction instead of several small ones, which is more efficient.

Some (mostly older) CPUs use write-through caches everywhere; some use write-back caches everywhere; some have a simpler write-through L1$ backed by a write-back L2$. This may generate redundant traffic between L1$ and L2$ but gets the write-back benefits for transfers to lower cache levels or memory. My point being that there’s a whole set of trade-offs here, and different designs use different solutions. Nor is there a requirement that cache line sizes be the same at all levels – it’s not unheard-of for CPUs to have 32-byte lines in L1$ but 128-byte lines in L2$ for example.

Omitted for simplicity in this section: cache associativity/sets; write-allocate or not (I described write-through without write-allocate and write-back with, which is the most common usage); unaligned accesses; virtually-addressed caches. These are all things you can look up if you’re interested, but I’m not going to go that deep here.

Coherency protocols

As long as that single CPU core is alone in the system, this all works just fine. Add more cores, each with their own caches, and we have a problem: what happens if some other core modifies data that’s in one of our caches?

Well, the answer is quite simple: nothing happens. And that’s bad, because we?wantsomething to happen when someone else modifies memory that we have a cached copy of. Once we have multiple caches, we really need to keep them synchronized, or we don’t really have a “shared memory” system, more like a “shared general idea of what’s in memory” system.

Note that the problem really is that we have multiple caches, not that we have multiple cores. We could solve the entire problem by sharing all caches between all cores: there’s only one L1$, and all processors have to share it. Each cycle, the L1$ picks one lucky core that gets to do a memory operation this cycle, and runs it.

This works just fine. The only problem is that it’s also slow, because cores now spend most of their time waiting in line for their next turn at a L1$ request (and processors do a?lot?of those, at least one for every load/store instruction). I’m pointing this out because it shows that the problem really isn’t so much a multi-?core?problem as it is a multi-?cache?problem. We know that one set of caches works, but when that’s too slow, the next best thing is to have multiple caches and then make them behave?as if?there was only one cache. This is what cache coherency protocols are for: as the name suggests, they ensure that the contents of multiple caches stay coherent.

There are multiple types of coherency protocols, but most computing devices you deal with daily fall into the category of “snooping” protocols, and that’s what I’ll cover here. (The primary alternative, directory-based systems, has higher latency but scales better to systems with lots of cores).

The basic idea behind snooping is that all memory transactions take place on a shared bus that’s visible to all cores: the caches themselves are independent, but memory itself is a shared resource, and memory access needs to be arbitrated: only one cache gets to read data from, or write back to, memory in any given cycle. Now the idea in a snooping protocol is that the caches don’t just interact with the bus when they want to do a memory transaction themselves; instead, each cache continuously snoops on bus traffic to keep track of what the other caches are doing. So if one cache wants to read from or write to memory on behalf of its core, all the other cores notice, and that allows them to keep their caches synchronized. As soon as one core writes to a memory location, the other cores know that their copies of the corresponding cache line are now stale and hence invalid.

With write-through caches, this is fairly straightforward, since writes get “published” as soon as they happen. But if there are write-back caches in the mix, this doesn’t work, since the physical write-back to memory can happen a long time after the core executed the corresponding store – and for the intervening time, the other cores and their caches are none the wiser, and might themselves try to write to the same location, causing a conflict. So with a write-back model, it’s not enough to broadcast just the writes to memory when they happen; if we want to avoid conflicts, we need to tell other cores about our intention to write?before?we start changing anything in our local copy. Working out the details, the easiest solution that fits the bill and works for write-back caches is what’s commonly called the?MESI protocol?.

MESI and friends

This section is called “MESI and friends” because MESI spawned a whole host of closely related coherency protocols. Let’s start with the original though: MESI are the initials for the four states a cache line can be in for any of the multiple cores in a multi-core system. I’m gonna cover them in reverse order, because that’s the better order to explain them in:

I?nvalid lines are cache lines that are either not present in the cache, or whose contents are known to be stale. For the purposes of caching, these are ignored. Once a cache line is invalidated, it’s as if it wasn’t in the cache in the first place.
S?hared lines are clean copies of the contents of main memory. Cache lines in the shared state can be used to serve reads but they can’t be written to. Multiple caches are allowed to have a copy of the same memory location in “shared” state at the same time, hence the name.
E?xclusive lines are also clean copies of the contents of main memory, just like the S state. The difference is that when one core holds a line in E state, no other core may hold it at the same time, hence “exclusive”. That is, the same line must be in the I state in the caches of all other cores.
M?odified lines are dirty; they have been locally modified. If a line is in the M state, it must be in the I state for all other cores, same as E. In addition, modified cache lines need to be written back to memory when they get evicted or invalidated – same as the regular dirty state in a write-back cache.

If you compare this to the presentation of write-back caches in the single-core case above, you’ll see that the I, S and M states already had their equivalents: invalid/not present, clean, and dirty cache lines, respectively. So what’s new is the E state denoting exclusive access. This state solves the “we need to tell other cores before we start modifying memory” problem: each core may only write to cache lines if their caches hold them in the E or M states, i.e. they’re exclusively owned. If a core does not have exclusive access to a cache line when it wants to write, it first needs to send an “I want exclusive access” request to the bus. This tells all other cores to invalidate their copies of that cache line, if they have any. Only once that exclusive access is granted may the core start modifying data – and at that point, the core knows that the only copies of that cache line are in its own caches, so there can’t be any conflicts.

Conversely, once some other core wants to read from that cache line (which we learn immediately because we’re snooping the bus), exclusive and modified cache lines have to revert back to the “shared” (S) state. In the case of modified cache lines, this also involves writing their data back to memory first.

The MESI protocol is a proper state machine that responds both to requests coming from the local core, and to messages on the bus. I’m not going to go into detail about the full state diagram and what the different transition types are; you can find more in-depth information in books on hardware architecture if you care, but for our purposes this is overkill. As a software developer, you’ll get pretty far knowing only two things:

Firstly, in a multi-core system, getting read access to a cache line involves talking to the other cores, and might cause them to perform memory transactions.Writing to a cache line is a multi-step process: before you can write anything, you first need to acquire both exclusive ownership of the cache line and a copy of its existing contents (a so-called “Read For Ownership” request).

And secondly, while we have to do some extra gymnastics, the end result actually does provide some pretty strong guarantees. Namely, it obeys what I’ll call the

MESI invariant?: after writing back all dirty (?M-state?) cache lines, the contents of all cache lines present in any of the cache levels are identical to the values in memory at the corresponding addresses. In addition,?at all times, when a memory location is exclusively cached (in E or M state) by one core, it is not present in any of the other core’s caches.?.

Note that this is really just the write-back invariant we already saw with the additional exclusivity rule thrown in. My point being that the presence of MESI or multiple cores does not necessarily weaken our memory model at all.

Okay, so that (very roughly) covers vanilla MESI (and hence also CPUs that use it, ARMs for example). Other processors use extended variants. Popular extensions include an “O” (Owned) state similar to “E” that allows sharing of dirty cache lines without having to write them back to memory first (“dirty sharing”), yielding MOESI, and MERSI/MESIF, which are different names for the same idea, namely making one core the designated responder for read requests to a given cache line. When multiple cores hold a cache line in Shared state, only the designated responder (which holds the cache line in “R” or “F” state) replies to read requests, rather than everyone who holds the cache line in S state. This reduces bus traffic. And of course you can add both the R/F states and the O state, or get even fancier. All these are optimizations, but none of them change the basic invariants provided or guarantees made by the protocol.

I’m no expert on the topic, and it’s quite possible that there are other protocols in use that only provide substantially weaker guarantees, but if so I’m not aware of them, or any popular CPU core that uses them. So for our purposes, we really can assume that coherency protocols keep caches coherent, period. Not mostly-coherent, not “coherent except for a short window after a change” – properly coherent. At that level, barring hardware malfunction, there is always agreement on what the current state of memory should be. In technical terms, MESI and all its variants can, in principle anyway, provide full?sequential consistency?, the strongest memory ordering guarantee specified in the C++11 memory model. Which begs the question, why do we have weaker memory models, and “where do they happen”?

Memory models

Different architectures provide different memory models. As of this writing, ARM and POWER architecture machines have comparatively “weak” memory models: the CPU core has considerable leeway in reordering load and store operations in ways that might change the semantics of programs in a multi-core context, along with “memory barrier” instructions that can be used by the program to specify constraints: “do not reorder memory operations across this line”. By contrast, x86 comes with a quite strong memory model.

I won’t go into the details of memory models here; it quickly gets really technical, and is outside the scope of this article. But I do want to talk a bit about “how they happen” – that is, where the weakened guarantees (compared to the full sequential consistency we can get from MESI etc.) come from, and why. And as usual, it all boils down to performance.

So here’s the deal: you will indeed get full sequential consistency if a) the cache immediately responds to bus events on the very cycle it receives them, and b) the core dutifully sends each memory operation to the cache, in program order, and wait for it to complete before you send the next one. And of course, in practice modern CPUs normally do none of these things:

Caches do?not?respond to bus events immediately. If a bus message triggering a cache line invalidation arrives while the cache is busy doing other things (sending data to the core for example), it might not get processed that cycle. Instead, it will enter a so-called “invalidation queue”, where it sits for a while until the cache has time to process it.
Cores do not, in general, send memory operations to the cache in strict program order; this is certainly the case for cores with Out-of-Order execution, but even otherwise in-order cores may have somewhat weaker ordering guarantees for memory operations (for example, to ensure that a single cache miss doesn’t immediately make the entire core grind to a halt).
In particular, stores are special, because they’re a two-phase operation: we first need to acquire exclusive ownership of a cache line before a store can go through. And if we don’t already have exclusive ownership, we need to talk to the other cores, which takes a while. Again, having the core idle and twiddling thumbs while this is happening is not a good use of execution resources. Instead, what happens is that stores start the process of getting exclusive ownership, then get entered into a queue of so-called “store buffers” (some refer to the entire queue as “store buffer”, but I’m going to use the term to refer to the entries). They stay around in this queue for a while until the cache is ready to actually perform the store operation, at which point the corresponding store buffer is “drained” and can be recycled to hold a new pending store.

The implication of all these things is that, by default, loads can fetch stale data (if a corresponding invalidation request was sitting in the invalidation queue), stores actually finish later than their position in the code would suggest, and everything gets even more vague when Out of Order execution is involved. So going back to memory models, there are essentially two camps:

Architectures with a weak memory model do the minimum amount of work necessary in the core that allows software developers to write correct code. Instruction reordering and the various buffering stages are officially permitted; there are no guarantees. If you need guarantees, you need to insert the appropriate memory barriers – which will prevent reordering and drain queues of pending operations where required.

Architectures with stronger memory models do a lot more bookkeeping on the inside. For example, x86 processors keep track of all pending memory operations that are not fully finished (“retired”) yet, in a chip-internal data structure that’s called the MOB (“memory ordering buffer”). As part of the Out of Order infrastructure, x86 cores can roll back non-retired operations if there’s a problem – say an exception like a page fault, or a branch mispredict. I covered some of the details, as well as some of the interactions with the memory subsystem, in my earlier article “Speculatively speaking?“. The gist of it is that x86 processors actively watch out for external events (such as cache invalidations) that would retroactively invalidate the results of some of the operations that have already executed, but not been retired yet. That is, x86 processors know what their memory model is, and when an event happens that’s inconsistent within that model, the machine state is rolled back to the last time when it was still consistent with the rules of the memory model. This is the “memory ordering machine clear” I covered in?yet another earlier post?. The end result is that x86 processors provide very strong guarantees for all memory operations – not quite sequential consistency, though.

So, weaker memory models make for simpler (and potentially lower-power) cores. Stronger memory models make the design of cores (and their memory subsystems) more complex, but are easier to write code for. In theory, the weaker models allow for more scheduling freedom and can be potentially faster; in practice, x86s seem to be doing fine on the performance of memory operations, for the time being at least. So it’s hard for me to call a definite winner so far. Certainly, as a software developer I’m happy to take the stronger x86 memory model when I can get it.

Anyway. That’s plenty for one post. And now that I have all this written up on my blog, the idea is that future posts can just reference it. We’ll see how that goes. Thanks for reading!

轉載于:https://www.cnblogs.com/e-shannon/p/6677627.html