使用oprofile分析性能瓶頸

1. 概述

oprofile 是 Linux 平臺上，類似 INTEL VTune 的一個功能強大的性能分析工具。

其支持兩種采樣(sampling)方式：基于事件的采樣(event based)和基于時間的采樣(time based)。

基于事件的采樣是oprofile只記錄特定事件（比如L2 cache miss）的發生次數，當達到用戶設定的
定值時oprofile 就記錄一下（采一個樣）。這種方式需要CPU 內部有性能計數器(performace counter)。
現代CPU內部一般都有性能計數器，龍芯2E內部亦內置了2個性能計數器。

基于時間的采樣是oprofile 借助OS 時鐘中斷的機制，每個時鐘中斷 oprofile 都會記錄一次(采一次樣）。
引入的目的在于，提供對沒有性能計數器 CPU 的支持。其精度相對于基于事件的采樣要低。因為要借助 OS
時鐘中斷的支持，對禁用中斷的代碼oprofile不能對其進行分析。

oprofile 在Linux 上分兩部分，一個是內核模塊(oprofile.ko)，一個為用戶空間的守護進程(oprofiled)。
前者負責訪問性能計數器或者注冊基于時間采樣的函數(使用register_timer_hook注冊之，使時鐘中斷處理
程序最后執行profile_tick 時可以訪問之)，并采樣置于內核的緩沖區內。后者在后臺運行，負責從內核空
間收集數據，寫入文件。

2. oprofile 的安裝

以龍芯2E平臺為例，要使用oprofile 首先得采用打開oprofile支持的內核啟動。然后安裝下面3個軟件包：
oprofile, oprofile-common, oprofile-gui，其中核心軟件包是oprofile-common，其包括以下工具集：

? ? ? ? /usr/bin/oprofiled? ? ? ? ? ? ? ? 守護進程
? ? ? ? /usr/bin/opcontrol? ? ? ? ? ? ? ? 控制前端，負責控制與用戶交互，用得最多? ? ? ?
? ? ? ? /usr/bin/opannotate? ? ? ? ? ? ? ? 根據搜集到的數據，在源碼或者匯編層面上注釋并呈現給用戶
? ? ? ? /usr/bin/opreport? ? ? ? ? ? ? ? 生成二進制鏡像或符號的概覽
? ? ? ? /usr/bin/ophelp? ? ? ? ? ? ? ? ? ? ? ? 列出oprofile支持的事件
? ? ? ? /usr/bin/opgprof? ? ? ? ? ? ? ? 生成gprof格式的剖析數據
? ? ? ? ...

目前oprofile 在龍芯2E上已經移植好了，包括用戶空間的工具集軟件包，亦可用矣。

一個測試用的內核，已經打開 oprofile ，位于 http://people.openrays.org/~comcat/godson/vmlinux-2.6.18-oprofile

用戶空間工具集deb 包位于： http://people.openrays.org/~comcat/godson/oprofile-0.9.2/

3. oprofile 快速上手

a. 初始化

? ? ? ? opcontrol --init

? ? ? ? 該命令會加載oprofile.ko模塊，mount oprofilefs。成功后會在/dev/oprofile/目錄下導出
? ? ? ? 一些文件和目錄如： cpu_type, dump, enable, pointer_size, stats/

b. 配置

? ? ? ? 主要設置計數事件和樣本計數，以及計數的CPU模式（用戶態、核心態）
? ? ? ?
? ? ? ? opcontrol --setup --event=CYCLES:1000::0:1

? ? ? ? 則是設置計數事件為CYCLES，即對處理器時鐘周期進行計數
? ? ? ? 樣本計數為1000，即每1000個時鐘周期，oprofile 取樣一次。
? ? ? ? 處理器運行于核心態則不計數
? ? ? ? 運行于用戶態則計數

? ? ? ? --event=name:count:unitmask:kernel:user

? ?? ?name:? ???event name, e.g. CYCLES or ICACHE_MISSES
? ?? ?count:? ? reset counter value e.g. 100000
? ?? ?unitmask: hardware unit mask e.g. 0x0f
? ?? ?kernel:? ?whether to profile kernel: 0 or 1
? ?? ?user:? ???whether to profile userspace: 0 or 1

c. 啟動

? ? ? ? opcontrol --start

d. 運行待分析之程序

? ? ? ? ./ffmpeg -c cif -vcodec mpeg4 -i /root/paris.yuv paris.avi

e. 取出數據

? ? ? ? opcontrol --dump
? ? ? ? opcontrol --stop

f. 分析結果

? ? ? ? opreport -l ./ffmpeg

則會輸出如下結果：

CPU: GODSON2E, speed 0 MHz (estimated)
Counted CYCLES events (Cycles) with a unit mask of 0x00 (No unit mask) count 10000
samples??%? ?? ???symbol name
11739? ? 27.0148??pix_abs16_c
6052? ???13.9274??pix_abs16_xy2_c
4439? ???10.2154??ff_jpeg_fdct_islow
2574? ?? ?5.9235??pix_abs16_y2_c
2555? ?? ?5.8798??dct_quantize_c
2514? ?? ?5.7854??pix_abs8_c
2358? ?? ?5.4264??pix_abs16_x2_c
1388? ?? ?3.1942??diff_pixels_c
964? ?? ? 2.2184??ff_estimate_p_frame_motion
852? ?? ? 1.9607??simple_idct_add
768? ?? ? 1.7674??sse16_c
751? ?? ? 1.7283??ff_epzs_motion_search
735? ?? ? 1.6914??pix_norm1_c
619? ?? ? 1.4245??pix_sum_c
561? ?? ? 1.2910??mpeg4_encode_blocks
558? ?? ? 1.2841??encode_thread
269? ?? ? 0.6190??put_no_rnd_pixels16_c
255? ?? ? 0.5868??dct_unquantize_h263_inter_c

......

4. 例子

oprofile 可以分析處理器周期、TLB 失誤、分支預測失誤、緩存失誤、中斷處理程序，等等。
你可以使用 opcontrol --list-events 列出當前處理器上可監視事件列表。

下面分析一個編寫不當的例子：

[帶有cache問題的代碼cache.c]
+++++++++++++++++++++++++++++++++++++++++++++++

int matrix[2047][7];

void bad_access()
{
? ? int k, j, sum = 0;

? ? for(k = 0; k < 7; k++)
? ?? ???for(j = 0; j < 2047; j++)
? ?? ?? ?? ?sum += matrix[j][k] * 1024;

}

int main()
{
? ? ? ? int i;

? ? ? ? for(i = 0; i< 100000; i++)
? ? ? ? ? ? ? ? bad_access();

? ? ? ? return 0;

}

+++++++++++++++++++++++++++++++++++++++++++++++

編譯之： gcc -g cache.c -o cache

使用oprofile 分析之：

opcontrol --init

opcontrol --setup --event=DCACHE_MISSES:500::0:1

opcontrol --start && ./cache && opcontrol --dump && opcontrol --stop

使用 opannotate 分析結果為：

/*
* Command line: opannotate --source ./cachee
*
* Interpretation of command line:
* Output annotated source file with samples
* Output all files
*
* CPU: GODSON2E, speed 0 MHz (estimated)
* Counted ICACHE_MISSES events (Instruction Cache misses number ) with a unit mask of 0x00 (No unit mask) count 500
*/
/*
* Total samples for file : "/comcat/test/pmc.test/cachee.c"
*
*? ???34 100.000
*/

? ?? ?? ?? ?? ?:int matrix[2047][7];
? ?? ?? ?? ?? ?:
? ?? ?? ?? ?? ?:void bad_access()
? ?? ?? ?? ?? ?:{ /* bad_access total:? ???33 97.0588 */
? ?? ?? ?? ?? ?:? ? int k, j, sum = 0;
? ?? ?? ?? ?? ?:
? ?? ?? ?? ?? ?:? ? for(k = 0; k < 7; k++)
? ? 33 97.0588 :? ?? ???for(j = 0; j < 2047; j++)
? ?? ?? ?? ?? ?:? ?? ?? ?? ?sum += matrix[j][k] * 1024;
? ?? ?? ?? ?? ?:
? ?? ?? ?? ?? ?:}
? ?? ?? ?? ?? ?:
? ?? ?? ?? ?? ?:int main()
? ?? ?? ?? ?? ?:{ /* main total:? ?? ?1??2.9412 */
? ?? ?? ?? ?? ?:? ? int i;
? ?? ?? ?? ?? ?:
? ???1??2.9412 :? ? for(i = 0; i< 10000; i++)
? ?? ?? ?? ?? ?:? ?? ?? ?? ?? ? bad_access();
? ?? ?? ?? ?? ?:
? ?? ?? ?? ?? ?:? ? return 0;
? ?? ?? ?? ?? ?:
? ?? ?? ?? ?? ?:}
? ?? ?? ?? ?? ?:

opreport 解析的結果為：

GodSonSmall:/comcat/test/pmc.test# opreport -l ./cache
CPU: GODSON2E, speed 0 MHz (estimated)
Counted ICACHE_MISSES events (Instruction Cache misses number ) with a unit mask of 0x00 (No unit mask) count 500
samples??%? ?? ???symbol name
33? ?? ? 97.0588??bad_access
1? ?? ?? ?2.9412??main

可以看到bad_access() cache miss 事件的樣本共有33個，占總數的97%

改進 bad_access() 為 good_access() 后：

void good_access()
{
? ? int k, j, sum = 0;

? ? for(k = 0; k < 2047; k++)
? ?? ???for(j = 0; j < 7; j++)
? ?? ?? ?? ?sum += matrix[k][j] * 1024;

}

CPU: GODSON2E, speed 0 MHz (estimated)
Counted ICACHE_MISSES events (Instruction Cache misses number ) with a unit mask of 0x00 (No unit mask) count 500
samples??%? ?? ???symbol name
22? ?? ? 95.6522??good_access
1? ?? ?? ?4.3478??main

可以看到改進后 cache miss 事件的樣本減少為22個，占總數的95%
可以使用gprof, 編譯你程序時加 -pg -g

運行之會在當前目錄產生 gmon.out

gprof ./your_program_name

就可以看到了

----------------------------------------

使用oprofile 更精確：

opcontrol --reset
opcontrol --init
opcontrol --setup --event=CYCLES:1000
opcontrol --start && ./your_program_name && opcontrol --dump && opcontrol --stop

opreport -l ./your_program_name

就可以看到了，使用oprofile，編譯時只要加 -g 就可以了

使用oprofile分析性能瓶頸

使用oprofile分析性能瓶頸

相關文章

什么是死鎖

python數據工程師面試題_阿里P7工程師耗時兩天整理的292道python大廠面試題，內含解析！...

數組復制

IntelliJ IDEA 對于generated source的處理

產生死鎖的原因

fabric shim安裝合約_hyperledger fabric 開發第一個智能合約

不能干一輩子開發？？？

分布式緩存的25個優秀實踐與線上案例 done

服務器性能估算參考(硬件-應用服務器)

產生死鎖的四個必要條件

下拉選擇_在管理Excel中實現聯動下拉選擇

圖片預覽

風雨20年：我所積累的20條編程經驗

JS跨域（ajax跨域、iframe跨域）解決方法及原理詳解（jsonp）

預防死鎖可以破壞哪些死鎖的必要條件

xenserver 安裝新硬盤_給Xenserver添加新硬盤

go-study

什么是系統安全狀態

SQL零基礎學習筆記(一）

WPF 列表虛擬化時的滾動方式