perf 中的 cpu-cycles event 介紹
cycles簡介
cycles事件記錄處理器核心執行的時鐘周期數。每個時鐘周期代表處理器內部時鐘振蕩器的一個周期。這個事件通常用于衡量處理器的執行速度,因為它直接反映了指令執行所需的時間。一個較高的cycles計數可能意味著代碼執行較慢,因為需要更多的時鐘周期來完成任務。
instructions 和 cycles的關系
instructions事件則記錄處理器執行的指令數。這通常用來評估指令級別的效率,因為它顯示了程序執行了多少條指令。如果一條指令需要多個時鐘周期來執行,那么instructions與cycles之間的比率可以用來估算指令級的效率。一個較低的instructions/cycle比率表示更高的指令級并行性或更有效的代碼。
在性能分析中,通常會關注這兩個指標的比值,即instructions per cycle (IPC),來評估代碼的執行效率。IPC越高,表示每時鐘周期執行的指令越多,程序的執行效率也就越高。如果一個程序的IPC下降,可能是因為出現了分支預測錯誤、內存訪問延遲或其他性能瓶頸。
perf cycles 分析
看環境是否支持 cycles 采集
[root@localhost ~]# perf list | grep cycles
...cpu-cycles OR cycles [Hardware event]
...
查看當前環境cpu頻率: 2.6GHz
[root@localhost ~]# cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq
2600000
寫一個簡單的程序,讓cpu利用率跑100%:
// test.cpp
int main() {while(1);return 0;
}
[root@localhost ~]# g++ test.cpp -o test
[root@localhost ~]# ./test
[root@localhost ~]# perf stat -e cycles -p `pidof test` sleep 1Performance counter stats for process id '515011':2,601,831,429 cycles1.000756985 seconds time elapsed
可以看出這個值近似等于2.6G
進一步測試
寫一個程序,讓控制cpu利用率在20%左右
#include <iostream>
#include <chrono>
#include <unistd.h>int main() {int ratio = 20;int base_time = 1000;int sleeptime = base_time * (100-ratio);int runtime = base_time * ratio;while(true) {auto start = std::chrono::high_resolution_clock::now();while(std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::high_resolution_clock::now() - start).count() < runtime) {}usleep(sleeptime);}return 0;
}
[root@localhost ~]# perf stat -e cycles -p `pidof test` sleep 1Performance counter stats for process id '515142':520,289,676 cycles1.000767149 seconds time elapsed
520,289,676/2,601,831,429=0.199
可以看出,結合cycles 和 系統頻率可以換算出cpu利用率,利用perf采集各個線程的cycles,可以計算這個線程的負載。
采用perf record 的 方式
控制采集頻率為 50Hz
[root@localhost ~]# perf record -h-a, --all-cpus system-wide collection from all CPUs-c, --count <n> event period to sample-e, --event <event> event selector. use 'perf list' to list available events-F, --freq <freq or 'max'>profile at this frequency
[root@localhost ~]# perf record -e cycles -F 1 -p `pidof test` sleep 50
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.014 MB perf.data (12 samples) ]
[root@localhost ~]# perf script -i perf.datatest 515142 4507231.759994: 1 cycles: ffff50d8eeff379c finish_task_switch+0x74 ([kernel.kallsyms])test 515142 4507231.759996: 1 cycles: ffff50d8eeff379c finish_task_switch+0x74 ([kernel.kallsyms])test 515142 4507231.759997: 1 cycles: ffff50d8eeff379c finish_task_switch+0x74 ([kernel.kallsyms])test 515142 4507231.760015: 45456 cycles: ffff904e03cc __kernel_clock_gettime+0xcc ([vdso])test 515142 4507243.582136: 6176126731 cycles: 400a40 std::chrono::operator-<long, std::ratio<1l, 1000000000l>, long, std::ratio<1l, 1000000000l> >+0x58 (/home/test)test 515142 4507248.582382: 2593218597 cycles: 4009d4 std::chrono::duration<long, std::ratio<1l, 1000000l> >::count+0x4 (/home/test)test 515142 4507253.582381: 2592326656 cycles: 400a28 std::chrono::operator-<long, std::ratio<1l, 1000000000l>, long, std::ratio<1l, 1000000000l> >+0x40 (/home/test)test 515142 4507258.581960: 2591492485 cycles: ffff904e03f4 __kernel_clock_gettime+0xf4 ([vdso])test 515142 4507263.581284: 2591325615 cycles: 400a28 std::chrono::operator-<long, std::ratio<1l, 1000000000l>, long, std::ratio<1l, 1000000000l> >+0x40 (/home/test)test 515142 4507268.580381: 2590221715 cycles: ffff904e03e4 __kernel_clock_gettime+0xe4 ([vdso])test 515142 4507273.501923: 2596489317 cycles: 400964 std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >::time_since_epoch+0x4 (>test 515142 4507278.502390: 2594059695 cycles: ffff904e040c __kernel_clock_gettime+0x10c ([vdso])
以1Hz的頻率采樣,可以看出,當perf穩定下來后,cycles穩定在 2.59e6,相鄰的數據事件間隔5s,換算過后,也是相當于20%的cpu占用率。但是這似乎與我的預期不符,我的程序1s中實際會在運行狀態下多次,理論上每秒都會采到,采樣率1Hz,cpu利用率,采樣時間50s, 實際的樣本個數才有12個,似乎是 樣本個數約等于采樣間隔 * 采樣頻率 * 線程cpu利用率。
通過此方法計算線程的利用率,必須考慮時間戳,或者計算的周期要比采樣周期大很多,如果采樣1s,每1s計算下占用率,那么就會出現每5s的計算的線程占用率100%,其余是0%;
將程序綁在 core 10 上運行,觀察現象。現象基本一致。
[root@localhost ~]# perf record -e cycles -F 1 -C 10 sleep 50
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.417 MB perf.data (11 samples) ]
[root@localhost ~]# perf script -i perf.datatest 515307 [010] 4509711.079092: 1 cycles: 400a10 std::chrono::operator-<long, std::ratio<1l, 1000000000l>, long, std::ratio<1l, 1000000000l> >+0x28 (/home/test)test 515307 [010] 4509711.079094: 1 cycles: 400a10 std::chrono::operator-<long, std::ratio<1l, 1000000000l>, long, std::ratio<1l, 1000000000l> >+0x28 (/home/test)test 515307 [010] 4509711.079095: 1 cycles: 400a10 std::chrono::operator-<long, std::ratio<1l, 1000000000l>, long, std::ratio<1l, 1000000000l> >+0x28 (/home/test)test 515307 [010] 4509711.079121: 67206 cycles: fffd5d4503f4 __kernel_clock_gettime+0xf4 ([vdso])test 515307 [010] 4509726.498642: 8000773091 cycles: fffd5d4503f4 __kernel_clock_gettime+0xf4 ([vdso])test 515307 [010] 4509731.213642: 2447049195 cycles: fffd5d4503e0 __kernel_clock_gettime+0xe0 ([vdso])test 515307 [010] 4509736.170329: 2574579536 cycles: fffd5d4503e4 __kernel_clock_gettime+0xe4 ([vdso])test 515307 [010] 4509741.066237: 2540232188 cycles: fffd5d4503f4 __kernel_clock_gettime+0xf4 ([vdso])test 515307 [010] 4509746.031759: 2575910682 cycles: fffd5d2f9ba0 clock_gettime@plt+0x0 (/usr/lib64/libstdc++.so.6.0.24)test 515307 [010] 4509750.937614: 2546808087 cycles: fffd5d45040c __kernel_clock_gettime+0x10c ([vdso])test 515307 [010] 4509755.894792: 2574924584 cycles: 400878 main+0x74 (/home/test)
以固定周期數采樣
[root@localhost ~]# perf record -e cycles -c 520000000 -p `pidof test` sleep 10
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.014 MB perf.data (9 samples) ]
[root@localhost ~]# perf script -i perf.datatest 515307 4510198.258890: 520000000 cycles: 400a94 std::chrono::__duration_cast_impl<std::chrono::duration<long, std::ratio<1l, 1000000l> >, std::ratio<1l, 1000l>, long, true, false>::__cast<long, std::ratio<1l, 1000000000l> >>test 515307 4510199.263843: 520000000 cycles: fffd5d4503e4 __kernel_clock_gettime+0xe4 ([vdso])test 515307 4510200.268764: 520000000 cycles: fffd5d4503f4 __kernel_clock_gettime+0xf4 ([vdso])test 515307 4510201.273860: 520000000 cycles: fffd5d325304 std::chrono::_V2::system_clock::now+0x4c (/usr/lib64/libstdc++.so.6.0.24)test 515307 4510202.278955: 520000000 cycles: 400a8c std::chrono::__duration_cast_impl<std::chrono::duration<long, std::ratio<1l, 1000000l> >, std::ratio<1l, 1000l>, long, true, false>::__cast<long, std::ratio<1l, 1000000000l> >>test 515307 4510203.275996: 520000000 cycles: fffd5d4503f4 __kernel_clock_gettime+0xf4 ([vdso])test 515307 4510204.281150: 520000000 cycles: fffd5d4503f4 __kernel_clock_gettime+0xf4 ([vdso])test 515307 4510205.286123: 520000000 cycles: fffd5d4503f4 __kernel_clock_gettime+0xf4 ([vdso])test 515307 4510206.291034: 520000000 cycles: 400ab0 std::chrono::duration<long, std::ratio<1l, 1000000l> >::duration<long, void>+0x14 (/home/test)
這樣每秒都可以采到。