Linux 詳細介紹strace命令

system call(系統調用)是程序向內核請求服務的一種編程方式，strace是一個功能強大的工具，可以跟蹤用戶進程和 Linux 內核之間的交互。

要了解操作系統如何工作，首先需要了解系統調用如何工作。操作系統的主要功能之一是為用戶程序提供了一個抽象。

操作系統大致可以分為兩種模式：

內核模式(Kernel mode:）：操作系統內核使用的特權且強大的模式
用戶模式(User mode)：大多數用戶應用程序運行的地方
用戶主要使用命令行程序和圖形用戶界面 (GUI) 來完成日常任務。系統調用在后臺默默工作，與內核交互以完成工作。

system call(系統調用)與function call(函數調用)非常相似，都接受并處理參數和返回值。唯一的區別是system call進入內核，而function call則不進入內核。從用戶空間切換到內核空間是使用特殊的trap機制完成的。

下面將通過一些通用命令來使用 strace 分析每個命令進行的系統調用，并探索一些實際示例。范例將使用 Red Hat Enterprise Linux，這些命令在其他 Linux 發行版上的工作方式應該也相同：

[root@sandbox ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.7 (Maipo)
[root@sandbox ~]# 
[root@sandbox ~]# uname -r
3.10.0-1062.el7.x86_64
[root@sandbox ~]#

首先，確保系統上安裝了所需的工具。使用下面的 RPM 命令驗證是否安裝了 strace；并使用 -V 選項檢查 strace 實用程序版本號：

[root@sandbox ~]# rpm -qa | grep -i strace
strace-4.12-9.el7.x86_64
[root@sandbox ~]# 
[root@sandbox ~]# strace -V
strace -- version 4.12
[root@sandbox ~]#

如果沒有安裝，使用如下命令安裝

yum install strace

基于演示的目的，在 /tmp 中創建一個測試目錄，并使用 touch 命令創建兩個文件：

[root@sandbox ~]# cd /tmp/
[root@sandbox tmp]# 
[root@sandbox tmp]# mkdir testdir
[root@sandbox tmp]# 
[root@sandbox tmp]# touch testdir/file1
[root@sandbox tmp]# touch testdir/file2
[root@sandbox tmp]#

我使用 /tmp 目錄是因為每個人都可以訪問它，其實也可以選擇其他目錄。）

驗證文件是否是在 testdir 目錄上使用 ls 命令創建的：

[root@sandbox tmp]# ls testdir/
file1  file2
[root@sandbox tmp]#

可能每天都使用 ls 命令，但其實他的工作是基于到系統調用之上。它的工作原理如下：
Command-line utility -> Invokes functions from system libraries (glibc) -> Invokes system calls
ls 命令在 Linux 上內部調用系統庫（又名 glibc）中的函數。這些庫調用完成大部分工作的系統調用。
如果想知道從 glibc 庫調用了哪些函數，可以使用 ltrace 命令，后跟常規 ls testdir/ 命令：

ltrace ls testdir/

如果沒有安裝，可以使用如下命令安裝

yum install ltrace

一堆輸出將被轉儲到屏幕上；不用擔心——只要跟著做就可以了。 ltrace 命令輸出中與本示例相關的一些重要庫函數包括：

opendir("testdir/")                                  = { 3 }
readdir({ 3 })                                       = { 101879119, "." }
readdir({ 3 })                                       = { 134, ".." }
readdir({ 3 })                                       = { 101879120, "file1" }
strlen("file1")                                      = 5
memcpy(0x1665be0, "file1\0", 6)                      = 0x1665be0
readdir({ 3 })                                       = { 101879122, "file2" }
strlen("file2")                                      = 5
memcpy(0x166dcb0, "file2\0", 6)                      = 0x166dcb0
readdir({ 3 })                                       = nil
closedir({ 3 })

通過查看上面的輸出，可能可以理解發生了什么。 opendir 庫函數正在打開一個名為 testdir 的目錄，然后調用 readdir 函數來讀取該目錄的內容。最后，調用 closedir 函數，該函數關閉之前打開的目錄。暫時忽略其他 strlen 和 memcpy 函數。

可以看到正在調用哪些庫函數，但本文將重點討論由系統庫函數調用的系統調用。

與上面類似，要了解調用了哪些系統調用，只需將 strace 放在 ls testdir 命令之前即可，如下所示。再次，一堆亂碼將被轉儲到您的屏幕上

[root@sandbox tmp]# strace ls testdir/
execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0
brk(NULL)                               = 0x1f12000
<<< truncated strace output >>>
write(1, "file1  file2\n", 13file1  file2
)          = 13
close(1)                                = 0
munmap(0x7fd002c8d000, 4096)            = 0
close(2)                                = 0
exit_group(0)                           = ?
+++ exited with 0 +++
[root@sandbox tmp]#

運行 strace 命令后屏幕上的輸出只是為運行 ls 命令而進行的系統調用。每個系統調用都有特定的操作系統用途，它們可以大致分為以下幾部分：
1.Process management system calls
2.File management system calls
3.Directory and filesystem management system calls
4.Other system calls
分析轉儲到屏幕上的信息的一種更簡單的方法是使用 strace 的 -o 選項將輸出記錄到文件中。在 -o 標志后添加合適的文件名并再次運行命令：

[root@sandbox tmp]# strace -o trace.log ls testdir/
file1  file2
[root@sandbox tmp]#

這次，沒有輸出轉儲到屏幕上 - ls 命令按預期工作，顯示文件名并將所有輸出記錄到文件 trace.log 中。僅一個簡單的 ls 命令，該文件就有近 100 行內容：

[root@sandbox tmp]# ls -l trace.log 
-rw-r--r--. 1 root root 7809 Oct 12 13:52 trace.log
[root@sandbox tmp]# 
[root@sandbox tmp]# wc -l trace.log 
114 trace.log
[root@sandbox tmp]#

看一下示例的trace.log 中的第一行：

execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0

1.該行的 execve 是正在執行的系統調用的名稱。
2.括號內的文本是提供給系統調用的參數。
3.= 符號后面的數字（在本例中為 0）是 execve 系統調用返回的值。

這只是范本解釋，可以應用相同的邏輯來理解其他行。

現在，將注意力集中到調用的單個命令，即 ls testdir。知道命令 ls 使用的目錄名稱，grep testdir trace.log 詳細查看每一行結果：

[root@sandbox tmp]# grep testdir trace.log
execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0
stat("testdir/", {st_mode=S_IFDIR|0755, st_size=32, ...}) = 0
openat(AT_FDCWD, "testdir/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
[root@sandbox tmp]#

回顧一下上面對execve的分析，能看出這個系統調用是做什么的嗎？

execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0

不需要記住所有的系統調用或它們的作用，因為您可以在需要時參考文檔。手冊頁來救援！在運行 man 命令之前確保安裝了以下軟件包：

[root@sandbox tmp]# rpm -qa | grep -i man-pages
man-pages-3.53-5.el7.noarch
[root@sandbox tmp]#

請記住，需要在 man 命令和需要查詢的系統調用名稱之間添加 2。如果使用 man man 閱讀 man 的手冊頁，可以看到第 2 部分是為系統調用保留的。同樣，如果需要庫函數的信息，則需要在man和庫函數名之間添加3。

[root@sandbox tmp]# man man1   Executable programs or shell commands2   System calls (functions provided by the kernel)3   Library calls (functions within program libraries)4   Special files (usually found in /dev)5   File formats and conventions eg /etc/passwd6   Games7   Miscellaneous (including macro packages and conventions), e.g. man(7), groff(7)8   System administration commands (usually only for root)9   Kernel routines [Non standard]

以下 man 命令查看上面范例中涉及的execve系統調用的文檔：

man 2 execve

輸出如下：

EXECVE(2)                  Linux Programmer’s Manual                 EXECVE(2)DESCRIPTIONexecve()  executes  the  program  pointed to by filename.  filename must be either a binary executable, or a script starting with a line of the form "#! interpreter [arg]".  In the latter case, the interpreter must be avalid pathname for an executable which is not itself a script, which will be invoked as interpreter [arg] filename

根據 execve 手冊頁說名，這個execve系統調用會執行一個作為參數傳入的程序（在本例中為 ls）。還可以向 ls 提供其他參數，例如本示例中的 testdir。因此，該系統調用僅以 testdir 作為參數運行 ls：

下一個名為 stat 的系統調用使用 testdir 參數：

stat("testdir/", {st_mode=S_IFDIR|0755, st_size=32, ...}) = 0

使用 man 2 stat 訪問文檔。 stat 是獲取文件狀態的系統調用 - 請記住，Linux 中的所有內容都是文件，包括目錄。

接下來，openat 系統調用打開 testdir。留意返回的 3。這是文件描述，后面的系統調用會用到：

openat(AT_FDCWD, "testdir/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3

到目前為止，一切都很好。現在，打開trace.log 文件并轉到openat 系統調用后面的行。將看到 getdents 系統調用被調用，它執行執行 ls testdir 命令所需的大部分操作。現在，從trace.log 文件中grep getdents：

[root@sandbox tmp]# grep getdents trace.log 
getdents(3, /* 4 entries */, 32768)     = 112
getdents(3, /* 0 entries */, 32768)     = 0
[root@sandbox tmp]#

man getdents,我們將得知getdents = get directory entries，這里注意，getdents 的參數是 3，它是上面 openat 系統調用中的文件描述符。

現在您已經有了目錄列表，需要一種在終端中顯示它的方法。因此，grep 另一個系統調用 write，用于在將輸出的內容寫入終端：

[root@sandbox tmp]# grep write trace.log
write(1, "file1  file2\n", 13)          = 13
[root@sandbox tmp]#

在這些參數中，您可以看到將顯示的文件名：file1 和 file2。關于第一個參數（1），請記住在 Linux 中，當任何進程運行時，默認情況下會為其打開三個文件描述符。以下是默認的文件描述符：
0 - Standard input
1 - Standard out
2 - Standard error

因此，write 系統調用正在"1 - Standard out"上顯示 file1 和 file2，默認將輸出到顯示屏幕上，由 1 標識。

現在知道哪些系統調用完成了 ls testdir/ 命令的大部分工作。但是trace.log 文件中的其他100 多個系統調用又如何呢？操作系統必須執行大量內部工作才能運行進程，因此在日志文件中看到的很多內容都是initialization 和cleanup。閱讀整個trace.log 文件并嘗試了解發生了什么使ls 命令正常工作。

現在知道如何分析給定命令的系統調用，可以將此知識用于其他命令來了解正在執行哪些系統調用。 strace 提供了許多有用的命令行選項，使strace過程更輕松，下面介紹了其中一些選項。

默認情況下，strace 不包含所有系統調用信息。但是，它有一個方便的 -v verbose 選項，可以提供有關每個系統調用的附加信息：

strace -v ls testdir

運行 strace 命令時始終使用 -f 選項是一個很好的做法。它允許 strace 跟蹤當前正在跟蹤的進程創建的任何子進程：

strace -f ls testdir

假設只需要系統調用的名稱、它們運行的次數以及每個系統調用所花費的時間百分比。您可以使用 -c 選項來獲取這些統計信息：

strace -c ls testdir/

假設想專注于特定的系統調用，例如專注于open系統調用而忽略其余的。您可以使用 -e 選項后跟系統調用名稱：

[root@sandbox tmp]# strace -e open ls testdir
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libcap.so.2", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libacl.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libpcre.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libattr.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
file1  file2
+++ exited with 0 +++
[root@sandbox tmp]#

如果您想專注于多個系統調用，可以使用相同的 -e 命令行選項，并在兩個系統調用之間使用逗號。例如，要查看 write 和 getdents 系統調用：

[root@sandbox tmp]# strace -e write,getdents ls testdir
getdents(3, /* 4 entries */, 32768)     = 112
getdents(3, /* 0 entries */, 32768)     = 0
write(1, "file1  file2\n", 13file1  file2
)          = 13
+++ exited with 0 +++
[root@sandbox tmp]#

到目前為止的示例都是顯式地跟蹤運行的命令。但是那些已經運行并且正在執行的命令呢？例如，如果想跟蹤只是長時間運行的進程的daemon(守護進程），該怎么辦？為此，strace 提供了一個特殊的 -p 選項，可以向其提供進程 ID。

替代在一個守護進程上運行strace命令，這里我們以cat 命令為例進行演示，如果提供文件名給cat命令作為參數，這個命令通常會顯示文件的內容。如果沒有給出參數，cat 命令只是在終端等待用戶輸入文本。輸入文本后，它會重復給定的文本，直到用戶按 Ctrl+C 退出。

從一個終端運行 cat 命令；它會顯示一個提示，然后只需等待（記住 cat 仍在運行并且尚未退出）：

[root@sandbox tmp]# cat

從另一個終端，使用 ps 命令查找進程標識符 (PID)：

[root@sandbox ~]# ps -ef | grep cat
root      22443  20164  0 14:19 pts/0    00:00:00 cat
root      22482  20300  0 14:20 pts/1    00:00:00 grep --color=auto cat
[root@sandbox ~]#

現在，使用 -p 選項和 PID（上面使用 ps 找到的）對正在運行的進程運行 strace。運行 strace 后，輸出會顯示進程所附加的內容以及 PID 號。現在，strace 正在跟蹤 cat 命令發出的系統調用。您看到的第一個系統調用是 read，它正在等待來自 0 或標準輸入的輸入，這是運行 cat 命令的終端：

[root@sandbox ~]# strace -p 22443
strace: Process 22443 attached
read(0,

現在，返回到運行 cat 命令的終端并輸入一些文本。我輸入 x0x0 是出于演示目的。請注意 cat 是如何簡單地重復我輸入的內容的；因此，x0x0 出現兩次。我輸入第一個，第二個是 cat 命令重復的輸出：

[root@sandbox tmp]# cat
x0x0
x0x0

返回到 strace 連接到 cat 進程的終端。現在會看到兩個額外的系統調用：之前的 read 系統調用，現在在終端中讀取 x0x0，另一個用于 write，它將 x0x0 寫回終端，還有一個新的 read，它正在等待從終端讀取。請注意，標準輸入 (0) 和標準輸出 (1) 均位于同一終端中：

[root@sandbox ~]# strace -p 22443
strace: Process 22443 attached
read(0, "x0x0\n", 65536)                = 5
write(1, "x0x0\n", 5)                   = 5
read(0,

想象一下，當對守護進程運行 strace 以查看它在后臺執行的所有操作時，這有多么有用。按 Ctrl+C 退出cat 命令；這也將會終止strace 會話，因為該進程不再運行。
如果想查看所有系統調用的時間戳，只需將 -t 選項與 strace 結合使用即可：

14:24:47 execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0
14:24:47 brk(NULL)                      = 0x1f07000
14:24:47 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f2530bc8000
14:24:47 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
14:24:47 open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3

如果想知道系統調用之間所花費的時間， strace 有一個方便的 -r 命令，可以顯示執行每個系統調用所花費的時間。非常有用

[root@sandbox ~]#strace -r ls testdir/
0.000000 execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0
0.000368 brk(NULL)                 = 0x1966000
0.000073 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb6b1155000
0.000047 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
0.000119 open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3