userfaultfd內核線程D狀態問題排查

問題現象

運維反應機器上出現了很多D狀態進程，也kill不掉,然后將現場保留下來進行排查。
在這里插入圖片描述

在這里插入圖片描述

排查過程

都是內核線程，先看下內核棧D在哪了，發現D在了userfaultfd的pagefault流程。
在這里插入圖片描述

uffd知識補充

uffd探究
uffd在firecracker與e2b的架構下使用方式如下：
1.firecracker注冊uffd共享給orchestrator，并將guest內存地址空間提交給orchestrator。
2.guest觸發缺頁，vm-exit出來，創建內核線程通知orchestrator有需要處理的pf請求，然后等待處理。
3.orchestrator會調用ioctl從內存快照文件中將數據寫入到對應的內存頁中
在這里插入圖片描述
查看handle_userfault代碼進行分析。

vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
{struct vm_area_struct *vma = vmf->vma;struct mm_struct *mm = vma->vm_mm;struct userfaultfd_ctx *ctx;struct userfaultfd_wait_queue uwq;vm_fault_t ret = VM_FAULT_SIGBUS;bool must_wait;unsigned int blocking_state;/** We don't do userfault handling for the final child pid update.** We also don't do userfault handling during* coredumping. hugetlbfs has the special* hugetlb_follow_page_mask() to skip missing pages in the* FOLL_DUMP case, anon memory also checks for FOLL_DUMP with* the no_page_table() helper in follow_page_mask(), but the* shmem_vm_ops->fault method is invoked even during* coredumping and it ends up here.*/if (current->flags & (PF_EXITING|PF_DUMPCORE))goto out;assert_fault_locked(vmf);ctx = vma->vm_userfaultfd_ctx.ctx;if (!ctx)goto out;BUG_ON(ctx->mm != mm);/* Any unrecognized flag is a bug. */VM_BUG_ON(reason & ~__VM_UFFD_FLAGS);/* 0 or > 1 flags set is a bug; we expect exactly 1. */VM_BUG_ON(!reason || (reason & (reason - 1)));if (ctx->features & UFFD_FEATURE_SIGBUS)goto out;if (!(vmf->flags & FAULT_FLAG_USER) && (ctx->flags & UFFD_USER_MODE_ONLY))goto out;/** If it's already released don't get it. This avoids to loop* in __get_user_pages if userfaultfd_release waits on the* caller of handle_userfault to release the mmap_lock.*/if (unlikely(READ_ONCE(ctx->released))) {/** Don't return VM_FAULT_SIGBUS in this case, so a non* cooperative manager can close the uffd after the* last UFFDIO_COPY, without risking to trigger an* involuntary SIGBUS if the process was starting the* userfaultfd while the userfaultfd was still armed* (but after the last UFFDIO_COPY). If the uffd* wasn't already closed when the userfault reached* this point, that would normally be solved by* userfaultfd_must_wait returning 'false'.** If we were to return VM_FAULT_SIGBUS here, the non* cooperative manager would be instead forced to* always call UFFDIO_UNREGISTER before it can safely* close the uffd.*/ret = VM_FAULT_NOPAGE;goto out;}/** Check that we can return VM_FAULT_RETRY.** NOTE: it should become possible to return VM_FAULT_RETRY* even if FAULT_FLAG_TRIED is set without leading to gup()* -EBUSY failures, if the userfaultfd is to be extended for* VM_UFFD_WP tracking and we intend to arm the userfault* without first stopping userland access to the memory. For* VM_UFFD_MISSING userfaults this is enough for now.*/if (unlikely(!(vmf->flags & FAULT_FLAG_ALLOW_RETRY))) {/** Validate the invariant that nowait must allow retry* to be sure not to return SIGBUS erroneously on* nowait invocations.*/BUG_ON(vmf->flags & FAULT_FLAG_RETRY_NOWAIT);
#ifdef CONFIG_DEBUG_VMif (printk_ratelimit()) {printk(KERN_WARNING"FAULT_FLAG_ALLOW_RETRY missing %x\n",vmf->flags);dump_stack();}
#endifgoto out;}/** Handle nowait, not much to do other than tell it to retry* and wait.*/ret = VM_FAULT_RETRY;if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)goto out;/* take the reference before dropping the mmap_lock */userfaultfd_ctx_get(ctx);init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);uwq.wq.private = current;uwq.msg = userfault_msg(vmf->address, vmf->real_address, vmf->flags,reason, ctx->features);uwq.ctx = ctx;uwq.waken = false;blocking_state = userfaultfd_get_blocking_state(vmf->flags);/** Take the vma lock now, in order to safely call* userfaultfd_huge_must_wait() later. Since acquiring the* (sleepable) vma lock can modify the current task state, that* must be before explicitly calling set_current_state().*/if (is_vm_hugetlb_page(vma))hugetlb_vma_lock_read(vma);spin_lock_irq(&ctx->fault_pending_wqh.lock);/** After the __add_wait_queue the uwq is visible to userland* through poll/read().*/__add_wait_queue(&ctx->fault_pending_wqh, &uwq.wq);/** The smp_mb() after __set_current_state prevents the reads* following the spin_unlock to happen before the list_add in* __add_wait_queue.*/set_current_state(blocking_state);spin_unlock_irq(&ctx->fault_pending_wqh.lock);if (!is_vm_hugetlb_page(vma))must_wait = userfaultfd_must_wait(ctx, vmf, reason);elsemust_wait = userfaultfd_huge_must_wait(ctx, vmf, reason);if (is_vm_hugetlb_page(vma))hugetlb_vma_unlock_read(vma);release_fault_lock(vmf);if (likely(must_wait && !READ_ONCE(ctx->released))) {wake_up_poll(&ctx->fd_wqh, EPOLLIN);schedule();}__set_current_state(TASK_RUNNING);/** Here we race with the list_del; list_add in* userfaultfd_ctx_read(), however because we don't ever run* list_del_init() to refile across the two lists, the prev* and next pointers will never point to self. list_add also* would never let any of the two pointers to point to* self. So list_empty_careful won't risk to see both pointers* pointing to self at any time during the list refile. The* only case where list_del_init() is called is the full* removal in the wake function and there we don't re-list_add* and it's fine not to block on the spinlock. The uwq on this* kernel stack can be released after the list_del_init.*/if (!list_empty_careful(&uwq.wq.entry)) {spin_lock_irq(&ctx->fault_pending_wqh.lock);/** No need of list_del_init(), the uwq on the stack* will be freed shortly anyway.*/list_del(&uwq.wq.entry);spin_unlock_irq(&ctx->fault_pending_wqh.lock);}/** ctx may go away after this if the userfault pseudo fd is* already released.*/userfaultfd_ctx_put(ctx);out:return ret;
}

看起來像是不知道什么原因導致調度出去后，一直沒有被喚醒
在這里插入圖片描述
使用crash進一步驗證猜想，隨便找一個bt看下，確實是schedule調度出去之后沒有再被喚醒。

先看一下uffd ctx里的這幾個工作隊列情況

需要找到ctx地址
vm_fault->vm_area_struct->vm_userfaultfd_ctx，vm_fault結構體是第一個參數傳進來的，
在這里插入圖片描述
handle_userfault這個函數的匯編比較復雜，往上找找，在上層函數把vmf變量定義在了棧里，

bt -f查看棧幀，hugetlb_handle_userfault幀里的前幾個地址看起來像是給結構體賦值的參數，嘗試解析一下

vm_ops解析出了<hugetlb_vm_ops>，看起來沒啥問題，那解析下ctx
在這里插入圖片描述
fault_pending_wqh隊列的next！=prev，說明有pf請求沒有被處理，所以D住了，其他隊列的next=prev，都是空的。

waitq看一下pending隊列，結構體的第一個元素，地址就是結構體的地址，可以看到這5個kworker就是對應的pf請求線程，由于沒有被處理，導致D住了，那么接下來就要看一下pf請求為什么沒有被處理。
在這里插入圖片描述
同時也觀察到，引用計數為6，但是5個kworker+firecracker+orchestrator應該是7才對，看一下firecracker和orchestrator的狀態，由于是pf請求沒有被處理，著重排查orchestrator

通過mm找到對應的firecracker進程

找到task_struct中的pid，就是對應的firecracker進程號
在這里插入圖片描述
firecracker進程還在，引用計數那就是少了orchestrator的，可能是close了，也可能是orchestrator進程退了

ps一看發現orchestrator服務的啟動時間居然在firecracker之后，
而且查看e2b代碼發現，orchestrator服務close uffd之前會先kill掉firecracker進程，猜測可能是orchestrator重啟了，而且沒有走正常的關閉sandbox的流程，導致這些firecracker進程殘留了，同時也沒法再處理這些firecracker的pagefault請求，導致內核線程進入了D狀態。

與運維確認是我們發布了新版本，orchestrator服務確實重啟過了，問題確認清楚了，解決辦法先完善發布流程，升級重啟前先進行排水。

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/bicheng/81272.shtml
繁體地址，請注明出處：http://hk.pswp.cn/bicheng/81272.shtml
英文地址，請注明出處：http://en.pswp.cn/bicheng/81272.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！