Linux Seccomp 簡介

文章目錄

一、簡介
二、架構
三、Original/Strict Mode
四、Seccomp-bpf
五、seccomp系統調用
六、Linux Capabilities and Seccomp
- 6.1 Linux Capabilities
- 6.2 Linux Seccomp
參考資料

一、簡介

Seccomp（secure computing）是Linux內核中的一項計算機安全功能，是在Linux內核版本2.6.12中首次引入的一種簡單沙箱機制。它允許進程進行一次性的狀態轉換，進入一種“安全”狀態，在此狀態下，進程只能執行exit()、sigreturn()、read()和write()這幾個與已打開文件描述符相關的系統調用。如果進程嘗試執行其他系統調用，內核將僅僅記錄該事件，或者終止進程并發送SIGKILL或SIGSYS信號。
隨著時間的推移，seccomp得到了擴展：不再是一個固定且非常有限的系統調用集合，而是發展為一種過濾機制，允許進程指定一個任意的系統調用過濾器（以Berkeley Packet Filter程序表示），用于禁止特定的系統調用。這可以用于實現不同類型的安全機制；例如，Chromium網絡瀏覽器的Linux版本就支持此功能，以在沙箱中運行插件。

// /kernel/seccomp.c/** Secure computing mode 1 allows only read/write/exit/sigreturn.* To be fully secure this must be combined with rlimit* to limit the stack allocations too.*/
static const int mode1_syscalls[] = {__NR_seccomp_read, __NR_seccomp_write, __NR_seccomp_exit, __NR_seccomp_sigreturn,-1, /* negative terminated */
};static void __secure_computing_strict(int this_syscall)
{const int *allowed_syscalls = mode1_syscalls;do {if (*allowed_syscalls == this_syscall)return;} while (*++allowed_syscalls != -1);seccomp_log(this_syscall, SIGKILL, SECCOMP_RET_KILL_THREAD, true);do_exit(SIGKILL);
}

從這個意義上說，seccomp并不是虛擬化系統資源，而是完全隔離進程與系統資源的接觸。它通過限制進程能夠執行的系統調用來實現安全隔離，從而減小進程的攻擊面。

Seccomp常用于對安全性要求較高的環境，例如沙盒應用程序、容器運行時等場景。通過強制限制系統調用的使用，seccomp可以顯著增強系統的安全性。

Seccomp模式可以通過prctl(2)系統調用使用PR_SET_SECCOMP參數啟用，或者（自Linux內核3.17版本起）通過seccomp(2)系統調用啟用。在過去的內核版本中(Linux 2.6.12 to Linux 2.6.22)，可以通過寫入/proc/self/seccomp文件來啟用seccomp模式，但是這種方法在Linux 2.6.23已被prctl()方法取代。在某些內核版本中，seccomp會禁用RDTSC x86指令，該指令用于返回自上電以來經過的處理器周期數，用于高精度計時。

seccomp-bpf是seccomp的擴展，它允許使用可配置的策略通過Berkeley Packet Filter規則來過濾系統調用。它被OpenSSH和vsftpd等軟件以及Google Chrome/Chromium瀏覽器在ChromeOS和Linux上使用。相比之前不再支持Linux的systrace，seccomp-bpf實現了類似的功能，但具有更高的靈活性和性能。

seccomp-bpf通過使用BPF（Berkeley Packet Filter）規則來定義策略，允許對系統調用進行細粒度的過濾和控制。這些規則可以基于進程的需求和安全策略來定義，從而限制進程的系統調用。通過使用BPF規則，seccomp-bpf提供了更大的靈活性，使開發人員能夠定義自定義的系統調用策略。

進程的seccomp模式（自Linux 3.8起）。0表示SECCOMP_MODE_DISABLED（禁用），1表示SECCOMP_MODE_STRICT（嚴格模式），2表示SECCOMP_MODE_FILTER（過濾器模式）。僅當內核編譯時啟用了CONFIG_SECCOMP內核配置選項時，才提供此字段。

CONFIG_SECCOMP=y
CONFIG_SECCOMP_FILTER=y

在進程的/proc/pid/status文件中，Seccomp字段用于顯示進程的seccomp模式。它指示進程當前的seccomp設置。具體取值如下：

（1）0：表示進程禁用了seccomp，即未啟用任何seccomp模式。
（2）1：表示進程處于嚴格模式（SECCOMP_MODE_STRICT）。在此模式下，進程只能調用一組有限的系統調用。
（3）2：表示進程使用了過濾器模式（SECCOMP_MODE_FILTER）。在此模式下，進程使用BPF過濾器定義了允許的系統調用。

# cat /proc/2/status | grep -i seccomp
Seccomp:        0
Seccomp_filters:        0

Linux內核版本 2.6.12 中首次引入Seccomp功能。
Linux內核版本 3.5 中x86-64架構support for seccomp BPF（ARM since Linux 3.8、ARM-64 – since Linux 3.19）
Linux內核版本 3.17 中 add “seccomp” syscall。

二、架構

seccomp的基本思想非常簡單。下面的圖片展示了它的工作原理。首先，進程需要將seccomp策略設置為嚴格模式或過濾器模式。這會導致內核在task_struct結構中設置seccomp標志。如果進程設置了過濾器模式，內核會將程序添加到task_struct中的seccomp過濾器列表中。然后，對于進程發起的每個系統調用，內核都會基于seccomp過濾器進行檢查。

// include/linux/seccomp.h/*** struct seccomp - the state of a seccomp'ed process** @mode:  indicates one of the valid values above for controlled*         system calls available to a process.* @filter: must always point to a valid seccomp-filter or NULL as it is*          accessed without locking during system call entry.**          @filter must only be accessed from the context of current as there*          is no read locking.*/
struct seccomp {int mode;atomic_t filter_count;struct seccomp_filter *filter;
};

// include/linux/sched.hstruct task_struct {......struct seccomp			seccomp;......
}

（1）進程設置seccomp策略：進程通過seccomp策略指定希望以嚴格模式還是過濾器模式運行。通常，通過使用適當的參數調用seccomp()系統調用來完成此操作。

（2）內核設置seccomp標志：當進程設置seccomp策略時，內核會更新task_struct數據結構，以指示該進程啟用了seccomp。內核使用此標志來確定是否需要對進程發起的系統調用進行seccomp過濾。

（3）內核將程序添加到seccomp過濾器列表（過濾器模式）：如果進程將seccomp策略設置為過濾器模式，它會提供一個定義允許的系統調用及其過濾規則的BPF程序。內核將此程序添加到與task_struct相關聯的seccomp過濾器列表中。BPF程序指定了進程發起的系統調用的過濾邏輯。

（4）內核針對每個系統調用檢查seccomp過濾器：當進程發起系統調用時，內核會檢查與進程相關聯的seccomp過濾器。如果進程處于嚴格模式，內核會強制執行預定義的允許的系統調用，并拒絕其他系統調用。如果進程處于過濾器模式，內核會根據過濾規則評估進程提供的BPF程序，以確定是否根據過濾規則允許或拒絕系統調用。

在這里插入圖片描述

三、Original/Strict Mode

在這種模式下，Seccomp僅允許使用已打開文件描述符的exit()、sigreturn()、read()和write()系統調用。如果進行了任何其他系統調用，進程將被使用SIGKILL信號終止。

Seccomp模式可以通過prctl(2)系統調用使用PR_SET_SECCOMP參數啟用，或者（自Linux內核3.17版本起）通過seccomp(2)系統調用啟用。

以prctl(2)系統調用為例子：

NAMEprctl - operations on a processSYNOPSIS#include <sys/prctl.h>int prctl(int option, unsigned long arg2, unsigned long arg3,unsigned long arg4, unsigned long arg5);DESCRIPTIONprctl() is called with a first argument describing what to do (with values defined in <linux/prctl.h>), and further arguments with a significance depending on the firstone.  The first argument can be:

PR_SET_SECCOMP (since Linux 2.6.23)Set  the secure computing (seccomp) mode for the calling thread, to limit the available system calls.  The more recent seccomp(2) system call provides a supersetof the functionality of PR_SET_SECCOMP.The seccomp mode is selected via arg2.  (The seccomp constants are defined in <linux/seccomp.h>.)With arg2 set to SECCOMP_MODE_STRICT, the only system calls that the thread is permitted to make are read(2), write(2), _exit(2) (but not exit_group(2)), and si‐greturn(2).   Other  system  calls result in the delivery of a SIGKILL signal.  Strict secure computing mode is useful for number-crunching applications that mayneed to execute untrusted byte code, perhaps obtained by reading from a pipe or socket.  This operation is available only if the kernel is configured  with  CON‐FIG_SECCOMP enabled.With arg2 set to SECCOMP_MODE_FILTER (since Linux 3.5), the system calls allowed are defined by a pointer to a Berkeley Packet Filter passed in arg3.  This argu‐ment is a pointer to struct sock_fprog; it can be designed to filter arbitrary system calls and system call arguments.  This mode is available only if the kernelis configured with CONFIG_SECCOMP_FILTER enabled.If  SECCOMP_MODE_FILTER  filters  permit  fork(2), then the seccomp mode is inherited by children created by fork(2); if execve(2) is permitted, then the seccompmode is preserved across execve(2).  If the filters permit prctl() calls, then additional filters can be added; they are run in order until the  first  non-allowresult is seen.

#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <linux/seccomp.h>
#include <sys/prctl.h>int main(int argc, char **argv)
{int output = open("output.txt", O_WRONLY);const char *val = "test";//enables strict seccomp modeprintf("Calling prctl() to set seccomp strict mode...\n");prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);//This is allowed as the file was already openedprintf("Writing to an already open file...\n");write(output, val, strlen(val)+1);//This isn't allowedprintf("Trying to open file for reading...\n");int input = open("output.txt", O_RDONLY);printf("You will not see this message--the process will be killed first\n");return 0;
}

# gcc seccomp.c
# ./a.out
Calling prctl() to set seccomp strict mode...
Writing to an already open file...
Trying to open file for reading...
Killed

四、Seccomp-bpf

每個用戶空間進程都暴露了大量的系統調用，其中許多在進程的整個生命周期中都未被使用。隨著系統調用的變化和成熟，會發現并消除一些錯誤。某些特定的用戶空間應用程序通過減少可用系統調用的集合來受益。這樣的集合減少了應用程序所暴露給內核的總體界面。系統調用過濾用于這些應用程序。

Seccomp過濾提供了一種進程可以指定傳入系統調用過濾器的方法。該過濾器以Berkeley Packet Filter（BPF）程序的形式表示，與套接字過濾器類似，但操作的數據與所進行的系統調用相關：系統調用編號和系統調用參數。這允許使用具有長期被用戶空間使用和簡單數據集的過濾程序語言對系統調用進行表達式過濾。

此外，BPF使得seccomp的用戶無法成為時鐘檢查-使用時間（TOCTOU）攻擊的受害者，而這在系統調用攔截框架中很常見。BPF程序不允許解引用指針，這限制了所有過濾器僅能直接評估系統調用參數。

系統調用過濾并不是一個沙箱。它提供了一種明確定義的機制，用于最小化內核暴露的接口。它旨在成為沙箱開發者使用的工具。除此之外，邏輯行為和信息流的策略應該通過其他系統加固技術的組合以及可能的自選LSM（Linux安全模塊）進行管理。表達能力強、動態的過濾器為這條路徑提供了進一步的選項，這可能被錯誤地解釋為更完整的沙箱解決方案。

這種模式允許使用可配置的策略，通過使用基于Berkeley Packet Filter規則的實現來過濾系統調用。

通過配置策略和使用Berkeley Packet Filter規則，可以對系統調用進行過濾。Berkeley Packet Filter（BPF）是一種靈活的過濾機制，可以根據需要定義系統調用的訪問規則。

通過定義適當的BPF規則，可以實現對系統調用的細粒度控制。可以選擇允許或禁止特定的系統調用，限制對敏感資源的訪問，或根據自定義的安全策略來選擇合適的系統調用。這種可配置性使得seccomp能夠適應各種不同的安全需求。

（1）

apt-get install libseccomp-dev

#include <seccomp.h>
#include <unistd.h>
#include <stdio.h>
#include <errno.h>int main(int argc, char **argv)
{/* initialize the libseccomp context */scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL);/* allow exiting */printf("Adding rule : Allow exit_group\n");seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);/* allow getting the current pid *///printf("Adding rule : Allow getpid\n");//seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(getpid), 0);printf("Adding rule : Deny getpid\n");seccomp_rule_add(ctx, SCMP_ACT_ERRNO(EBADF), SCMP_SYS(getpid), 0);/* allow changing data segment size, as required by glibc */printf("Adding rule : Allow brk\n");seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(brk), 0);/* allow writing up to 512 bytes to fd 1 */printf("Adding rule : Allow write upto 512 bytes to FD 1\n");seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 2,SCMP_A0(SCMP_CMP_EQ, 1),SCMP_A2(SCMP_CMP_LE, 512));/* if writing to any other fd, return -EBADF */printf("Adding rule : Deny write to any FD except 1 \n");seccomp_rule_add(ctx, SCMP_ACT_ERRNO(EBADF), SCMP_SYS(write), 1,SCMP_A0(SCMP_CMP_NE, 1));/* load and enforce the filters */printf("Load rules and enforce \n");seccomp_load(ctx);seccomp_release(ctx);//Get the getpid is denied, a weird number will be returned like//this process is -9printf("this process is %d\n", getpid());return 0;
}

# gcc seccomp1.c -lseccomp
# ./a.out
Adding rule : Allow exit_group
Adding rule : Deny getpid
Adding rule : Allow brk
Adding rule : Allow write upto 512 bytes to FD 1
Adding rule : Deny write to any FD except 1
Load rules and enforce
this process is -9

（2）
新增了一個額外的seccomp模式，并使用與嚴格模式相同的prctl(2)調用進行啟用。如果體系結構具有CONFIG_HAVE_ARCH_SECCOMP_FILTER，那么可以添加以下過濾器：

CONFIG_HAVE_ARCH_SECCOMP_FILTER=y

PR_SET_SECCOMP：
現在接受一個額外的參數，用于指定使用BPF程序的新過濾器。BPF程序將在反映系統調用編號、參數和其他元數據的struct seccomp_data上執行。然后，BPF程序必須返回可接受的值之一，以通知內核應采取的操作。

使用方法：

prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog);
'prog’參數是指向struct sock_fprog的指針，其中包含過濾器程序。如果程序無效，調用將返回-1并將errno設置為EINVAL。

如果允許fork/clone和execve通過@prog，任何子進程將受到與父進程相同的過濾器和系統調用ABI的限制。

在使用之前，任務必須調用prctl(PR_SET_NO_NEW_PRIVS, 1)或以其命名空間中的CAP_SYS_ADMIN特權運行。如果這些條件不成立，將返回-EACCES。此要求確保過濾器程序不能應用于比安裝它們的任務擁有更高特權的子進程。

此外，如果附加的過濾器允許prctl(2)，則可以疊加其他過濾器，這將增加評估時間，但允許在進程執行期間進一步減少攻擊面。

以上調用成功時返回0，錯誤時返回非零值。

五、seccomp系統調用

NAMEseccomp - operate on Secure Computing state of the processSYNOPSIS#include <linux/seccomp.h>  /* Definition of SECCOMP_* constants */#include <linux/filter.h>   /* Definition of struct sock_fprog */#include <linux/audit.h>    /* Definition of AUDIT_* constants */#include <linux/signal.h>   /* Definition of SIG* constants */#include <sys/ptrace.h>     /* Definition of PTRACE_* constants */#include <sys/syscall.h>    /* Definition of SYS_* constants */#include <unistd.h>int syscall(SYS_seccomp, unsigned int operation, unsigned int flags,void *args);Note: glibc provides no wrapper for seccomp(), necessitating theuse of syscall(2).DESCRIPTIONThe seccomp() system call operates on the Secure Computing (seccomp) state of the calling process.

// kernel/seccomp.c/* Common entry point for both prctl and syscall. */
static long do_seccomp(unsigned int op, unsigned int flags,void __user *uargs)
{switch (op) {case SECCOMP_SET_MODE_STRICT:if (flags != 0 || uargs != NULL)return -EINVAL;return seccomp_set_mode_strict();case SECCOMP_SET_MODE_FILTER:return seccomp_set_mode_filter(flags, uargs);......}
}SYSCALL_DEFINE3(seccomp, unsigned int, op, unsigned int, flags,void __user *, uargs)
{return do_seccomp(op, flags, uargs);
}

seccomp()系統調用用于操作調用進程的安全計算（seccomp）狀態。它支持以下操作值：
（1）
SECCOMP_SET_MODE_STRICT：將進程設置為嚴格的安全計算模式。在此模式下，調用線程只能調用一組有限的系統調用，包括read(2)、write(2)、_exit(2)（但不包括exit_group(2)）和sigreturn(2)。其他系統調用將導致發送SIGKILL信號。嚴格的安全計算模式適用于可能需要執行不受信任的字節碼（例如從管道或套接字讀取）的數值計算應用程序。

需要注意的是，雖然調用線程不能再調用sigprocmask(2)，但它可以使用sigreturn(2)來阻塞除了SIGKILL和SIGSTOP之外的所有信號。這意味著alarm(2)（例如）不足以限制進程的執行時間。為可靠地終止進程，必須使用SIGKILL。可以通過使用帶有SIGEV_SIGNAL和sigev_signo設置為SIGKILL的timer_create(2)，或者使用setrlimit(2)設置RLIMIT_CPU的硬限制來實現。

（2）
SECCOMP_SET_MODE_FILTER：將進程設置為通過指針傳遞的伯克利數據包過濾器（BPF）定義的系統調用模式。args參數是指向struct sock_fprog的指針，該結構可以設計用于過濾任意系統調用和系統調用參數的BPF程序。如果過濾器無效，seccomp()調用將失敗，并返回EINVAL錯誤。

如果過濾器允許fork(2)或clone(2)，則子進程將受到與父進程相同的系統調用過濾器的限制。如果允許execve(2)，則在調用execve(2)后仍將保留現有的過濾器。

要使用SECCOMP_SET_MODE_FILTER操作，調用線程必須具有用戶命名空間中的CAP_SYS_ADMIN特權，或者線程必須已經設置了no_new_privs位。如果no_new_privs位不是由該線程的祖先進程設置的，線程必須執行以下調用：

prctl(PR_SET_NO_NEW_PRIVS, 1);

否則，SECCOMP_SET_MODE_FILTER操作將失敗，并返回EACCES錯誤。此要求確保非特權進程不能應用惡意過濾器，然后使用execve(2)調用調用特權程序，從而潛在地危害該程序的安全性。（例如，這樣的惡意過濾器可能會導致對setuid(2)的調用將調用者的用戶ID設置為非零值，但實際上返回0而不進行系統調用。因此，程序可能會被欺騙，在可能影響它執行危險操作的情況下保留超級用戶特權。）

如果附加的過濾器允許prctl(2)或seccomp()，則可以添加其他過濾器。這將增加評估時間，但允許在線程執行期間進一步減少攻擊面。

此操作僅在內核配置了啟用CONFIG_SECCOMP_FILTER時可用。

#include <linux/audit.h>
#include <linux/filter.h>
#include <linux/seccomp.h>
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <unistd.h>#define X32_SYSCALL_BIT 0x40000000
#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))static int
install_filter(int syscall_nr, unsigned int t_arch, int f_errno)
{unsigned int upper_nr_limit = 0xffffffff;/* Assume that AUDIT_ARCH_X86_64 means the normal x86-64 ABI(in the x32 ABI, all system calls have bit 30 set in the'nr' field, meaning the numbers are >= X32_SYSCALL_BIT). */if (t_arch == AUDIT_ARCH_X86_64)upper_nr_limit = X32_SYSCALL_BIT - 1;struct sock_filter filter[] = {/* [0] Load architecture from 'seccomp_data' buffer intoaccumulator. */BPF_STMT(BPF_LD | BPF_W | BPF_ABS,(offsetof(struct seccomp_data, arch))),/* [1] Jump forward 5 instructions if architecture does notmatch 't_arch'. */BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, t_arch, 0, 5),/* [2] Load system call number from 'seccomp_data' buffer intoaccumulator. */BPF_STMT(BPF_LD | BPF_W | BPF_ABS,(offsetof(struct seccomp_data, nr))),/* [3] Check ABI - only needed for x86-64 in deny-list usecases.  Use BPF_JGT instead of checking against the bitmask to avoid having to reload the syscall number. */BPF_JUMP(BPF_JMP | BPF_JGT | BPF_K, upper_nr_limit, 3, 0),/* [4] Jump forward 1 instruction if system call numberdoes not match 'syscall_nr'. */BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, syscall_nr, 0, 1),/* [5] Matching architecture and system call: don't executethe system call, and return 'f_errno' in 'errno'. */BPF_STMT(BPF_RET | BPF_K,SECCOMP_RET_ERRNO | (f_errno & SECCOMP_RET_DATA)),/* [6] Destination of system call number mismatch: allow othersystem calls. */BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),/* [7] Destination of architecture mismatch: kill process. */BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),};struct sock_fprog prog = {.len = ARRAY_SIZE(filter),.filter = filter,};if (syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER, 0, &prog)) {perror("seccomp");return 1;}return 0;
}int
main(int argc, char *argv[])
{if (argc < 5) {fprintf(stderr, "Usage: ""%s <syscall_nr> <arch> <errno> <prog> [<args>]\n""Hint for <arch>: AUDIT_ARCH_I386: 0x%X\n""                 AUDIT_ARCH_X86_64: 0x%X\n""\n", argv[0], AUDIT_ARCH_I386, AUDIT_ARCH_X86_64);exit(EXIT_FAILURE);}if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {perror("prctl");exit(EXIT_FAILURE);}if (install_filter(strtol(argv[1], NULL, 0),strtoul(argv[2], NULL, 0),strtol(argv[3], NULL, 0)))exit(EXIT_FAILURE);execv(argv[4], &argv[4]);perror("execv");exit(EXIT_FAILURE);
}

# ./a.out
Usage: ./a.out <syscall_nr> <arch> <errno> <prog> [<args>]
Hint for <arch>: AUDIT_ARCH_I386: 0x40000003AUDIT_ARCH_X86_64: 0xC000003E# ./a.out 59 0xC000003E 99 /bin/whoami
execv: Cannot assign requested address
# ./a.out 1 0xC000003E 99 /bin/whoami
# ./a.out 295 0xC000003E 99 /bin/whoami
root

六、Linux Capabilities and Seccomp

6.1 Linux Capabilities

Linux capabilities是一項功能，它將以root用戶身份運行的進程的權限分解為更小的權限組。這樣，具有root特權的進程可以被限制只獲取執行其操作所需的最小權限。

在傳統的UNIX系統中，root用戶擁有完全的系統權限，這可能會導致潛在的安全風險。為了減少對系統的潛在威脅，Linux引入了capabilities功能。

通過使用capabilities，root用戶可以被劃分為多個小組權限，每個權限組只包含一組特定的權限。這使得可以將root特權限制為只獲取執行特定操作所需的最小權限集合。

例如，常見的capabilities包括：

CAP_NET_ADMIN：允許進行網絡管理操作，如配置網絡接口、設置防火墻規則等。
CAP_SYS_ADMIN：允許進行系統管理操作，如掛載文件系統、更改主機名等。
CAP_DAC_OVERRIDE：允許繞過文件權限檢查，訪問任何文件。
CAP_SETUID：允許更改進程的有效用戶ID（UID）。

通過將這些capabilities分配給進程，即使以root特權運行，進程也只能執行與其所需操作相關的特定權限，而不是完整的root權限。

這種細粒度的權限控制有助于減少潛在的安全漏洞和提高系統的安全性。它使管理員能夠更好地控制進程的權限，并將特權最小化，從而降低了攻擊者濫用root權限造成的風險。

詳細請參考：Linux 安全 - Capabilities機制

6.2 Linux Seccomp

安全計算模式（seccomp）是一種內核功能，允許您對容器內的系統調用進行過濾。受限和允許的調用的組合以配置文件的形式存在，您可以將不同的配置文件傳遞給不同的容器。seccomp提供比Capabilities更精細的控制，使攻擊者只能從容器中使用有限數量的系統調用。

seccomp通過定義配置文件來工作，配置文件指定了允許或拒絕進程進行的系統調用。這些配置文件可以根據不同容器的具體要求進行自定義。通過為不同的容器應用不同的配置文件，可以根據其特定需求實施不同級別的系統調用過濾。

seccomp的配置文件可以根據系統調用的調用號、參數或其他屬性來允許或拒絕特定的系統調用。這種細粒度的控制使管理員能夠準確地定義容器內進程的系統調用行為。

與Capabilities相比，seccomp在更低的層級上通過過濾實際的系統調用來進行操作。即使運行在容器中的進程具有root特權或提升的能力，通過seccomp可以將其限制為一組有限的允許的系統調用，有效地減小了潛在的攻擊面。

通過將seccomp與其他安全機制（如Capabilities、namespace isolation和mandatory access control）結合使用，可以構建更健壯和安全的容器環境，減小安全漏洞的影響，并限制潛在攻擊者的能力。

如下圖所示：
在這里插入圖片描述

參考資料

linux 安全模塊 – seccomp 詳解
https://www.man7.org/linux/man-pages/man2/seccomp.2.html
https://www.kernel.org/doc/html/latest/userspace-api/seccomp_filter.html
https://book.hacktricks.xyz/linux-hardening/privilege-escalation/docker-security/seccomp
https://opensource.com/article/21/8/container-linux-technology