【K8S源碼之Pod漂移】整體概況分析 controller-manager 中的 nodelifecycle controller（Pod的驅逐）

參考

k8s 污點驅逐詳解-源碼分析 - 掘金
k8s驅逐篇(5)-kube-controller-manager驅逐 - 良凱爾 - 博客園
k8s驅逐篇(6)-kube-controller-manager驅逐-NodeLifecycleController源碼分析 - 良凱爾 - 博客園
k8s驅逐篇(7)-kube-controller-manager驅逐-taintManager源碼分析 - 良凱爾 - 博客園

整體概況分析

基于 k8s 1.19 版本分析

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-h6S3bs1J-1692352728103)(img/nodelifecycle筆記/image-20230818164014721.png)]

TaintManager 與非TaintManager

TaintManager 模式
- 發現 Node Unhealthy 后（也就是 Node Ready Condition = False 或 Unknown），會更新 Pod Ready Condition 為 False（表示 Pod 不健康），也會給 Node 打上 NoExecute Effect 的 Taint
- 之后 TaintManager 根據 Pod 的 Toleration 判斷，是否有設置容忍 NoExecute Effect Taint 的 Toleration
  - 沒有 Toleration 的話，就立即驅逐
  - 有 Toleration ，會根據 Toleration 設置的時長，定時刪除該 Pod
  - 默認情況下，會設置個 5min 的Toleration，也就是 5min 后會刪除此 Pod
非 TaintManager 模式（默認模式）
- 發現 Node Unhealthy 后，會更新 Pod Ready Condition 為 False（表示 Pod 不健康）
- 之后會記錄該 Node，等待 PodTimeout（5min） - nodegracePeriod（40s) 時間后，驅逐該 Node 上所有 Pod（Node級別驅逐），之后標記該 Node 為 evicted 狀態（此處是代碼中標記，資源上沒有此狀態）
- 之后便只考慮單 Pod 的驅逐（可能考慮部分 Pod 失敗等）
  - 若 Node 已經被標記為 evicted 狀態，那么可以進行單 Pod 的驅逐
  - 若 Node 沒有被標記為 evicted 狀態，那將 Node 標記為 tobeevicted 狀態，等待上面 Node 級別的驅逐

代碼中的幾個存儲結構


nodeEvictionMap *nodeEvictionMap	// nodeEvictionMap stores evictionStatus data for each node. type nodeEvictionMap struct { lock sync.Mutex nodeEvictions map[string]evictionStatus }	記錄所有 node 的狀態 1. 健康 unmarked 2. 等待驅逐 tobeevicted 3. 驅逐完成 evicted
zoneStates map[string]ZoneState	type ZoneState string	記錄 zone 的健康狀態 1. 新zone Initial 2. 健康的zone Normal 3. 部分健康zone PartialDisruption 4. 完全不健康 FullDisruption 這個是用于設置該zone 的驅逐速率
zonePodEvictor map[string]*scheduler.RateLimitedTimedQueue		失聯（不健康）的 Node 會放入此結構中，等待被驅逐，之后nodeEvictionMap 對應的狀態記錄會被設置為 evicted 1. 該結構，key 為zone，value 為限速隊列處理（也就是上面驅逐效率起作用的地方） 2. 當一個 node 不健康，首先會計算出該 node 對應的zone 3. 然后放入該結構中
nodeHealthMap *nodeHealthMap	type nodeHealthMap struct { lock sync.RWMutex nodeHealths map[string]*nodeHealthData }
	type nodeHealthData struct { probeTimestamp metav1.Time readyTransitionTimestamp metav1.Time status v1.NodeStatus lease coordv1.Lease }	記錄每個node的健康狀態，主要在 monitorHealth 函數中使用 1. 其中 probeTimestamp 最關鍵，該參數記錄該 Node 最后一次健康的時間，也就是失聯前最后一個 lease 的時間 2. 之后根據 probeTimestamp 和寬限時間 gracePeriod，判斷該 node 是否真正失聯，并設置為 unknown 狀態

整體代碼流程分析

// Run starts an asynchronous loop that monitors the status of cluster nodes.
func (nc *Controller) Run(stopCh <-chan struct{}) {defer utilruntime.HandleCrash()
?klog.Infof("Starting node controller")defer klog.Infof("Shutting down node controller")// 1.等待leaseInformer、nodeInformer、podInformerSynced、daemonSetInformerSynced同步完成。if !cache.WaitForNamedCacheSync("taint", stopCh, nc.leaseInformerSynced, nc.nodeInformerSynced, nc.podInformerSynced, nc.daemonSetInformerSynced) {return}// 2.如果enable-taint-manager=true,開啟nc.taintManager.Runif nc.runTaintManager {go nc.taintManager.Run(stopCh)}// Close node update queue to cleanup go routine.defer nc.nodeUpdateQueue.ShutDown()defer nc.podUpdateQueue.ShutDown()// 3.執行doNodeProcessingPassWorker，這個是處理nodeUpdateQueue隊列的node// Start workers to reconcile labels and/or update NoSchedule taint for nodes.for i := 0; i < scheduler.UpdateWorkerSize; i++ {// Thanks to "workqueue", each worker just need to get item from queue, because// the item is flagged when got from queue: if new event come, the new item will// be re-queued until "Done", so no more than one worker handle the same item and// no event missed.go wait.Until(nc.doNodeProcessingPassWorker, time.Second, stopCh)}// 4.doPodProcessingWorker，這個是處理podUpdateQueue隊列的podfor i := 0; i < podUpdateWorkerSize; i++ {go wait.Until(nc.doPodProcessingWorker, time.Second, stopCh)}// 5. 如果開啟了feature-gates=TaintBasedEvictions=true，執行doNoExecuteTaintingPass函數。否則執行doEvictionPass函數if nc.useTaintBasedEvictions {// Handling taint based evictions. Because we don't want a dedicated logic in TaintManager for NC-originated// taints and we normally don't rate limit evictions caused by taints, we need to rate limit adding taints.go wait.Until(nc.doNoExecuteTaintingPass, scheduler.NodeEvictionPeriod, stopCh)} else {// Managing eviction of nodes:// When we delete pods off a node, if the node was not empty at the time we then// queue an eviction watcher. If we hit an error, retry deletion.go wait.Until(nc.doEvictionPass, scheduler.NodeEvictionPeriod, stopCh)}// 6.一直監聽node狀態是否健康// Incorporate the results of node health signal pushed from kubelet to master.go wait.Until(func() {if err := nc.monitorNodeHealth(); err != nil {klog.Errorf("Error monitoring node health: %v", err)}}, nc.nodeMonitorPeriod, stopCh)<-stopCh
}

MonitorNodeHealth

在這里插入圖片描述

此部分有如下幾個作用

讀取 Node 的 Label，用于確定 Node 屬于哪個 zone；若該 zone 是新增的，就注冊到 zonePodEvictor 或 zoneNoExecuteTainter (TaintManager 模式)

zonePodEvictor 后續用于該 zone 中失聯的 Node，用于 Node 級別驅逐（就是驅逐 Node 上所有 Pod，并設置為 evicted 狀態，此部分參見）

// pkg/controller/nodelifecycle/node_lifecycle_controller.go
// addPodEvictorForNewZone checks if new zone appeared, and if so add new evictor.
// dfy: 若出現新的 zone ，初始化 zonePodEvictor 或 zoneNoExecuteTainter
func (nc *Controller) addPodEvictorForNewZone(node *v1.Node) {nc.evictorLock.Lock()defer nc.evictorLock.Unlock()zone := utilnode.GetZoneKey(node)// dfy: 若出現新的 zone ，初始化 zonePodEvictor 或 zoneNoExecuteTainterif _, found := nc.zoneStates[zone]; !found {// dfy: 沒有找到 zone value，設置為 Initialnc.zoneStates[zone] = stateInitial// dfy: 沒有 TaintManager，創建一個 限速隊列，不太清楚有什么作用？？？if !nc.runTaintManager {// dfy: zonePodEvictor 負責將 pod 從無響應的節點驅逐出去nc.zonePodEvictor[zone] =scheduler.NewRateLimitedTimedQueue(flowcontrol.NewTokenBucketRateLimiter(nc.evictionLimiterQPS, scheduler.EvictionRateLimiterBurst))} else {// dfy: zoneNoExecuteTainter 負責為 node 打上污點 taintnc.zoneNoExecuteTainter[zone] =scheduler.NewRateLimitedTimedQueue(flowcontrol.NewTokenBucketRateLimiter(nc.evictionLimiterQPS, scheduler.EvictionRateLimiterBurst))}// Init the metric for the new zone.klog.Infof("Initializing eviction metric for zone: %v", zone)evictionsNumber.WithLabelValues(zone).Add(0)}
}func (nc *Controller) doEvictionPass() {nc.evictorLock.Lock()defer nc.evictorLock.Unlock()for k := range nc.zonePodEvictor {// Function should return 'false' and a time after which it should be retried, or 'true' if it shouldn't (it succeeded).nc.zonePodEvictor[k].Try(func(value scheduler.TimedValue) (bool, time.Duration) {// dfy: 此處 value.Value 存儲的是 Cluster Namenode, err := nc.nodeLister.Get(value.Value)if apierrors.IsNotFound(err) {klog.Warningf("Node %v no longer present in nodeLister!", value.Value)} else if err != nil {klog.Warningf("Failed to get Node %v from the nodeLister: %v", value.Value, err)}nodeUID, _ := value.UID.(string)// dfy: 獲得分配到該節點上的 Podpods, err := nc.getPodsAssignedToNode(value.Value)if err != nil {utilruntime.HandleError(fmt.Errorf("unable to list pods from node %q: %v", value.Value, err))return false, 0}// dfy: 刪除 Podremaining, err := nodeutil.DeletePods(nc.kubeClient, pods, nc.recorder, value.Value, nodeUID, nc.daemonSetStore)if err != nil {// We are not setting eviction status here.// New pods will be handled by zonePodEvictor retry// instead of immediate pod eviction.utilruntime.HandleError(fmt.Errorf("unable to evict node %q: %v", value.Value, err))return false, 0}// dfy: 在nodeEvictionMap設置node的狀態為evictedif !nc.nodeEvictionMap.setStatus(value.Value, evicted) {klog.V(2).Infof("node %v was unregistered in the meantime - skipping setting status", value.Value)}if remaining {klog.Infof("Pods awaiting deletion due to Controller eviction")}if node != nil {zone := utilnode.GetZoneKey(node)evictionsNumber.WithLabelValues(zone).Inc()}return true, 0})}
}

監聽 Node 健康狀態（通過監聽 Node Lease 進行判別）

若 Lease 不更新，且超過了容忍時間 gracePeriod，認為該 Node 失聯（更新 Status Ready Condition 為 Unknown）

// tryUpdateNodeHealth checks a given node's conditions and tries to update it. Returns grace period to
// which given node is entitled, state of current and last observed Ready Condition, and an error if it occurred.
func (nc *Controller) tryUpdateNodeHealth(node *v1.Node) (time.Duration, v1.NodeCondition, *v1.NodeCondition, error) {// 省略一大部分 probeTimestamp 更新邏輯// dfy: 通過 lease 更新，來更新 probeTimestampobservedLease, _ := nc.leaseLister.Leases(v1.NamespaceNodeLease).Get(node.Name)if observedLease != nil && (savedLease == nil || savedLease.Spec.RenewTime.Before(observedLease.Spec.RenewTime)) {nodeHealth.lease = observedLeasenodeHealth.probeTimestamp = nc.now()}// dfy: 注意此處， Lease 沒更新，導致 probeTimestamp 沒變動，因此 現在時間超過了容忍時間，將此 Node 視作失聯 Nodeif nc.now().After(nodeHealth.probeTimestamp.Add(gracePeriod)) {// NodeReady condition or lease was last set longer ago than gracePeriod, so// update it to Unknown (regardless of its current value) in the master.nodeConditionTypes := []v1.NodeConditionType{v1.NodeReady,v1.NodeMemoryPressure,v1.NodeDiskPressure,v1.NodePIDPressure,// We don't change 'NodeNetworkUnavailable' condition, as it's managed on a control plane level.// v1.NodeNetworkUnavailable,}nowTimestamp := nc.now()// dfy: 尋找 node 是否有上面幾個異常狀態for _, nodeConditionType := range nodeConditionTypes {// dfy: 具有異常狀態，就進行記錄_, currentCondition := nodeutil.GetNodeCondition(&node.Status, nodeConditionType)if currentCondition == nil {klog.V(2).Infof("Condition %v of node %v was never updated by kubelet", nodeConditionType, node.Name)node.Status.Conditions = append(node.Status.Conditions, v1.NodeCondition{Type:               nodeConditionType,Status:             v1.ConditionUnknown,Reason:             "NodeStatusNeverUpdated",Message:            "Kubelet never posted node status.",LastHeartbeatTime:  node.CreationTimestamp,LastTransitionTime: nowTimestamp,})} else {klog.V(2).Infof("node %v hasn't been updated for %+v. Last %v is: %+v",node.Name, nc.now().Time.Sub(nodeHealth.probeTimestamp.Time), nodeConditionType, currentCondition)if currentCondition.Status != v1.ConditionUnknown {currentCondition.Status = v1.ConditionUnknowncurrentCondition.Reason = "NodeStatusUnknown"currentCondition.Message = "Kubelet stopped posting node status."currentCondition.LastTransitionTime = nowTimestamp}}}// We need to update currentReadyCondition due to its value potentially changed._, currentReadyCondition = nodeutil.GetNodeCondition(&node.Status, v1.NodeReady)if !apiequality.Semantic.DeepEqual(currentReadyCondition, &observedReadyCondition) {if _, err := nc.kubeClient.CoreV1().Nodes().UpdateStatus(context.TODO(), node, metav1.UpdateOptions{}); err != nil {klog.Errorf("Error updating node %s: %v", node.Name, err)return gracePeriod, observedReadyCondition, currentReadyCondition, err}nodeHealth = &nodeHealthData{status:                   &node.Status,probeTimestamp:           nodeHealth.probeTimestamp,readyTransitionTimestamp: nc.now(),lease:                    observedLease,}return gracePeriod, observedReadyCondition, currentReadyCondition, nil}}return gracePeriod, observedReadyCondition, currentReadyCondition, nil
}

根據 zone 設置驅逐速率

每個 zone 有不同數量的 Node，根據該 zone 中 Node 失聯數量的占比，設置不同的驅逐速率

// dfy： 1. 計算 zone 不健康程度； 2. 根據 zone 不健康程度設置不同的驅逐速率
func (nc *Controller) handleDisruption(zoneToNodeConditions map[string][]*v1.NodeCondition, nodes []*v1.Node) {newZoneStates := map[string]ZoneState{}allAreFullyDisrupted := truefor k, v := range zoneToNodeConditions {zoneSize.WithLabelValues(k).Set(float64(len(v)))// dfy: 計算該 zone 的不健康程度（就是失聯 node 的占比）// nc.computeZoneStateFunc = nc.ComputeZoneStateunhealthy, newState := nc.computeZoneStateFunc(v)zoneHealth.WithLabelValues(k).Set(float64(100*(len(v)-unhealthy)) / float64(len(v)))unhealthyNodes.WithLabelValues(k).Set(float64(unhealthy))if newState != stateFullDisruption {allAreFullyDisrupted = false}newZoneStates[k] = newStateif _, had := nc.zoneStates[k]; !had {klog.Errorf("Setting initial state for unseen zone: %v", k)nc.zoneStates[k] = stateInitial}}allWasFullyDisrupted := truefor k, v := range nc.zoneStates {if _, have := zoneToNodeConditions[k]; !have {zoneSize.WithLabelValues(k).Set(0)zoneHealth.WithLabelValues(k).Set(100)unhealthyNodes.WithLabelValues(k).Set(0)delete(nc.zoneStates, k)continue}if v != stateFullDisruption {allWasFullyDisrupted = falsebreak}}// At least one node was responding in previous pass or in the current pass. Semantics is as follows:// - if the new state is "partialDisruption" we call a user defined function that returns a new limiter to use,// - if the new state is "normal" we resume normal operation (go back to default limiter settings),// - if new state is "fullDisruption" we restore normal eviction rate,//   - unless all zones in the cluster are in "fullDisruption" - in that case we stop all evictions.if !allAreFullyDisrupted || !allWasFullyDisrupted {// We're switching to full disruption modeif allAreFullyDisrupted {klog.V(0).Info("Controller detected that all Nodes are not-Ready. Entering master disruption mode.")for i := range nodes {if nc.runTaintManager {_, err := nc.markNodeAsReachable(nodes[i])if err != nil {klog.Errorf("Failed to remove taints from Node %v", nodes[i].Name)}} else {nc.cancelPodEviction(nodes[i])}}// We stop all evictions.for k := range nc.zoneStates {if nc.runTaintManager {nc.zoneNoExecuteTainter[k].SwapLimiter(0)} else {nc.zonePodEvictor[k].SwapLimiter(0)}}for k := range nc.zoneStates {nc.zoneStates[k] = stateFullDisruption}// All rate limiters are updated, so we can return early here.return}// We're exiting full disruption modeif allWasFullyDisrupted {klog.V(0).Info("Controller detected that some Nodes are Ready. Exiting master disruption mode.")// When exiting disruption mode update probe timestamps on all Nodes.now := nc.now()for i := range nodes {v := nc.nodeHealthMap.getDeepCopy(nodes[i].Name)v.probeTimestamp = nowv.readyTransitionTimestamp = nownc.nodeHealthMap.set(nodes[i].Name, v)}// We reset all rate limiters to settings appropriate for the given state.for k := range nc.zoneStates {// dfy: 設置 zone 的驅逐速率nc.setLimiterInZone(k, len(zoneToNodeConditions[k]), newZoneStates[k])nc.zoneStates[k] = newZoneStates[k]}return}// We know that there's at least one not-fully disrupted so,// we can use default behavior for rate limitersfor k, v := range nc.zoneStates {newState := newZoneStates[k]if v == newState {continue}klog.V(0).Infof("Controller detected that zone %v is now in state %v.", k, newState// dfy: 設置 zone 的驅逐速率nc.setLimiterInZone(k, len(zoneToNodeConditions[k]), newState)nc.zoneStates[k] = newState}}
}// ComputeZoneState returns a slice of NodeReadyConditions for all Nodes in a given zone.
// The zone is considered:
// - fullyDisrupted if there're no Ready Nodes,
// - partiallyDisrupted if at least than nc.unhealthyZoneThreshold percent of Nodes are not Ready,
// - normal otherwise
func (nc *Controller) ComputeZoneState(nodeReadyConditions []*v1.NodeCondition) (int, ZoneState) {readyNodes := 0notReadyNodes := 0for i := range nodeReadyConditions {if nodeReadyConditions[i] != nil && nodeReadyConditions[i].Status == v1.ConditionTrue {readyNodes++} else {notReadyNodes++}}switch {case readyNodes == 0 && notReadyNodes > 0:return notReadyNodes, stateFullDisruptioncase notReadyNodes > 2 && float32(notReadyNodes)/float32(notReadyNodes+readyNodes) >= nc.unhealthyZoneThreshold:return notReadyNodes, statePartialDisruptiondefault:return notReadyNodes, stateNormal}
}// dfy: 根據該 zone 健康狀態（也就是健康比例），設置驅逐效率(頻率）
func (nc *Controller) setLimiterInZone(zone string, zoneSize int, state ZoneState) {switch state {case stateNormal:if nc.runTaintManager {nc.zoneNoExecuteTainter[zone].SwapLimiter(nc.evictionLimiterQPS)} else {nc.zonePodEvictor[zone].SwapLimiter(nc.evictionLimiterQPS)}case statePartialDisruption:if nc.runTaintManager {nc.zoneNoExecuteTainter[zone].SwapLimiter(nc.enterPartialDisruptionFunc(zoneSize))} else {nc.zonePodEvictor[zone].SwapLimiter(nc.enterPartialDisruptionFunc(zoneSize))}case stateFullDisruption:if nc.runTaintManager {nc.zoneNoExecuteTainter[zone].SwapLimiter(nc.enterFullDisruptionFunc(zoneSize))} else {nc.zonePodEvictor[zone].SwapLimiter(nc.enterFullDisruptionFunc(zoneSize))}}
}

進行 Pod 驅逐的處理 proceeNoTaintBaseEviction

TaintManger.Run

TainManager 的驅逐邏輯，看代碼不難理解，大概說明
1. 若開啟 TaintManager 模式，所有 Pod、Node 的改變都會被放入，nc.tc.podUpdateQueue 和 nc.tc.nodeUpdateQueue 中
2. 當 Node 失聯時，會被打上 NoExecute Effect Taint（不在此處，在 main Controller.Run 函數中）
3. 此處會先處理 nc.tc.nodeUpdateQueue 的驅逐
  - 首先會檢查 Node 是否有 NoExecute Effect Taint；沒有就取消驅逐
  - 有的話，進行 Pod 的逐個驅逐，檢查 Pod 是否有該 Taint 的 toleration，有的話，就根據 toleration 設置 pod 的定時刪除；沒有 Toleration，就立即刪除
4. 接下來處理 nc.tc.podUpdateQueue 的驅逐
  - 進行 Pod 的逐個驅逐，檢查 Pod 是否有該 Taint 的 toleration，有的話，就根據 toleration 設置 pod 的定時刪除；沒有 Toleration，就立即刪除