引子

下面給出兩個極其相似的代碼，運行出的時間卻是有很大差別：
代碼一

#include <stdio.h>
#include <pthread.h>
#include <stdint.h>
#include <assert.h>
#include<chrono>const uint32_t MAX_THREADS = 16;void* ThreadFunc(void* pArg)
{for (int i = 0; i < 1000000000; ++i) // 10億次累加操作{++*(uint64_t*)pArg;}return NULL;
}int main() {static uint64_t aulArr[MAX_THREADS * 8];pthread_t aulThreadID[MAX_THREADS];auto begin = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());for (int i = 0; i < MAX_THREADS; ++i){assert(0 == pthread_create(&aulThreadID[i], nullptr, ThreadFunc, &aulArr[i]));}for (int i = 0; i < MAX_THREADS; ++i){assert(0 == pthread_join(aulThreadID[i], nullptr));}auto end = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());printf("%lld",end.count() - begin.count());
}

耗時： 26396ms

代碼二

#include <stdio.h>
#include <pthread.h>
#include <stdint.h>
#include <assert.h>
#include<chrono>const uint32_t MAX_THREADS = 16;void* ThreadFunc(void* pArg)
{for (int i = 0; i < 1000000000; ++i) // 10億次累加操作{++*(uint64_t*)pArg;}return NULL;
}int main() {static uint64_t aulArr[MAX_THREADS * 8];pthread_t aulThreadID[MAX_THREADS];auto begin = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());for (int i = 0; i < MAX_THREADS; ++i){assert(0 == pthread_create(&aulThreadID[i], nullptr, ThreadFunc, &aulArr[i * 8]));}for (int i = 0; i < MAX_THREADS; ++i){assert(0 == pthread_join(aulThreadID[i], nullptr));}auto end = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());printf("%lld",end.count() - begin.count());
}

耗時： 6762ms

這兩者的主要差別就在于pthread_create傳入的一個是aulArr[i]一個是aulArr[i * 8]

CPU Cache對于并發的影響

cpu cache在做數據同步的時候，有個最小的單位：cache line，當前主流CPU為64字節。
多個CPU讀寫相同的Cache line的時候需要做一致性同步，多CPU訪問相同的Cache Line地址，數據會被反復寫臟，頻繁進行一致性同步。當多CPU訪問不同的Cache Line地址時，無需一致性同步。
在上面的程序中：
static uint64_t aulArr[MAX_THREADS * 8];
占用的數據長度為：8byte * 8 * 16；
8byte * 8=64byte
程序一，每個線程在當前CPU讀取數據時,訪問的是同一塊cache line
程序二，每個線程在當前CPU讀取數據時，訪問的是不同塊的cache line,避免了對一個流水線的反復擦寫，效率直線提升。
在這里插入圖片描述

讀寫順序對性能的影響

CPU會有一個預讀，順帶著將需要的塊兒旁邊的塊兒一起讀出來放到cache中。所以當我們順序讀的時候就不需要從內存里面讀了，可以直接在緩存里面讀。
順序讀

#include <stdio.h>
#include <stdint.h>
#include <assert.h>
#include<chrono>
#include "string.h"int main() {const uint32_t BLOCK_SIZE = 8 << 20;// 64字節地址對齊，保證每一塊正好是一個CacheLinestatic char memory[BLOCK_SIZE][64] __attribute__((aligned(64)));assert((uint64_t)memory % 64 == 0);memset(memory, 0x3c, sizeof(memory));int n = 10;auto begin = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());while (n--){char result = 0;for (int i = 0; i < BLOCK_SIZE; ++i){for (int j = 0; j < 64; ++j){result ^= memory[i][j];}}}auto end = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());printf("%lld",end.count() - begin.count());
}

亂序讀

#include <stdio.h>
#include <stdint.h>
#include <assert.h>
#include<chrono>
#include "string.h"int main() {const uint32_t BLOCK_SIZE = 8 << 20;// 64字節地址對齊，保證每一塊正好是一個CacheLinestatic char memory[BLOCK_SIZE][64] __attribute__((aligned(64)));assert((uint64_t)memory % 64 == 0);memset(memory, 0x3c, sizeof(memory));int n = 10;auto begin = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());while (n--){char result = 0;for (int i = 0; i < BLOCK_SIZE; ++i){int k = i * 5183 % BLOCK_SIZE;  // 人為打亂順序for (int j = 0; j < 64; ++j){result ^= memory[k][j];}}}auto end = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());printf("%lld",end.count() - begin.count());
}

順序讀耗時13547ms，隨機亂序讀耗時21395ms。
如果一定要隨機讀的話該怎么優化呢？
如果我們知道我們下一輪讀取的數據，并且不是要立即訪問這個地址的話，使用_mm_prefetch指令優化，告訴CPU提前預讀下一輪循環的cacheline
有關該指令可以參考官方文檔：https://docs.microsoft.com/en-us/previous-versions/visualstudio/visual-studio-2010/84szxsww(v=vs.100)
使用該命令后，再看看運行時間：

#include <stdio.h>
#include <stdint.h>
#include <assert.h>
#include<chrono>
#include "string.h"
#include "xmmintrin.h"int main() {const uint32_t BLOCK_SIZE = 8 << 20;// 64字節地址對齊，保證每一塊正好是一個CacheLinestatic char memory[BLOCK_SIZE][64] __attribute__((aligned(64)));assert((uint64_t)memory % 64 == 0);memset(memory, 0x3c, sizeof(memory));int n = 10;auto begin = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());while (n--){char result = 0;for (int i = 0; i < BLOCK_SIZE; ++i){int next_k = (i + 1) * 5183 % BLOCK_SIZE;_mm_prefetch(&memory[next_k][0], _MM_HINT_T0);int k = i * 5183 % BLOCK_SIZE;  // 人為打亂順序for (int j = 0; j < 64; ++j){result ^= memory[k][j];}}}auto end = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());printf("%lld",end.count() - begin.count());
}

從原來的21395ms優化到15291ms

字節對齊對Cache的影響

在2GB內存，int64為單元進行26億次異或。分別測試地址對齊與非對齊在順序訪問和隨機訪問下的耗時

	非地址對齊	地址對齊	耗時比
順序訪問	7.8s	7.7s	1.01:1
隨機訪問	90s	80s	1.125:1

在順序訪問時，Cache命中率高，且CPU預讀，此時差別不大。
在隨機訪問的情況下，Cache命中率幾乎為0，有1/8概率橫跨2個cacheline，此時需讀兩次內存，此時耗時比大概為：7 / 8 * 1 + 1 / 8 * 2 = 1.125
結論就是：
1、cacheline 內部訪問非字節對齊變量差別不大
2、跨cacheline訪問代價主要為額外的內存讀取開銷
所以除了網絡協議以外，避免出現1字節對齊的情況。可以通過調整成員順序，減少內存開銷。

小結

1、多線程盡量避免讀寫相同的cache line內存
2、線程訪問對象盡可能與cacheline地址對齊
3、盡可能對內存做順序讀寫，否則可使用CPU預讀指令
4、變量保持地址對齊

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/376670.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/376670.shtml
英文地址，請注明出處：http://en.pswp.cn/news/376670.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！