k8s優雅重啟

理論上處于terminating狀態的pod，k8s 就會把它從service中移除了，只用配置一個優雅停機時長就行了。kubectl get endpoints 驗證

因此，優雅重新的核心問題，是怎么讓空閑長連接關閉，再等待處理中的請求執行完。
一些底層 HTTP 服務器（如 uvicorn），在收到SIGTERM 信號后會優雅地關閉進程，這包括清理所有的活動連接（包括空閑的 HTTP Keep-Alive 長連接），可以通過以下方法驗證：

telnet <ip> <port># 輸入以下內容按兩次Enter
GET /health HTTP/1.1
Host: <ip>
Connection: keep-alive

你將看到正常的HTTP響應，且連接沒有被關閉：

date: Fri, 24 Jan 2025 02:05:43 GMT
server: uvicorn
content-length: 4
content-type: application/json"ok"

這個時候你去讓這個pod處于terminating狀態，你會發現這個連接被關閉了：Connection closed by foreign host.

簡介

使用kubernetes啟動容器時，一般都會配置一些探針來保證pod的健康，并通過terminationGracePeriodSeconds控制pod 在接收到終止信號后等待完成清理的最大時間。

apiVersion: apps/v1
kind: Deployment
metadata:name: my-applabels:app: my-app
spec:replicas: 3selector:matchLabels:app: my-apptemplate:metadata:labels:app: my-appspec:terminationGracePeriodSeconds: 60containers:- name: my-app-containerimage: my-app:latestports:- containerPort: 8080readinessProbe:httpGet:path: /healthport: 8080initialDelaySeconds: 5periodSeconds: 10timeoutSeconds: 2successThreshold: 1failureThreshold: 3livenessProbe:tcpSocket:port: 8080initialDelaySeconds: 10periodSeconds: 10timeoutSeconds: 2successThreshold: 1failureThreshold: 10

通過就緒探針和存活探針，使得容器啟動就緒后才會有流量轉發進來，容器故障后也能自動重啟。
但對于請求成功率要求較為嚴格的應用，這種方式存在一個較為嚴重問題：
pod滾動發布的過程中，雖然terminationGracePeriodSeconds讓容器在一定時間后再退出，給了執行中的請求一些處理時間。但是terminating的過程中還是不斷會有新請求進來，最終還是會有些請求受影響。

優雅重啟原理

優雅重啟最核心的問題就是pod在銷毀過程中，不要再轉發新請求進來。pod切換到terminating狀態時，會發送一個SIG_TERM信號，應用端需要捕獲到這個信號，將就緒探針的健康檢查接口返回400+的狀態碼（503表示未準備好），這樣失敗failureThreshold次后，k8s就不會再轉發新請求進來，在給一定時間讓在途請求處理完成。

簡介中給的yaml示例，pod在收到SIG_TERM信號后，將健康檢查接口標記為不可用，就緒探針每10秒檢查一次，連續3次失敗就不再轉發流量到該pod(30-40秒)，terminationGracePeriodSeconds配置的是60秒，執行的請求此刻則還剩20-30秒時間處理。如果你覺得時間不夠，可以考慮加大terminationGracePeriodSeconds的值。

優雅重啟示例

python

python可以使用signal這個內置庫來監聽信號。

stop_event = threading.Event()def _handler_termination_signal(signum, frame, app: FastAPI) -> None:match signum:case signal.SIGINT:logging.info("Received SIGINT signal, mark service to unhealthy.")case signal.SIGTERM:logging.info("Received SIGTERM signal, mark service to unhealthy.")case _:logging.warning(f"Received unexpected signal: {signum}")returnsignal.signal(signal.SIGTERM, partial(_handler_termination_signal, app=app))
signal.signal(signal.SIGINT, partial(_handler_termination_signal, app=app))  # ctrl + c 停止@app.get("/health")
async def health_check(request: Request):if stop_event.is_set():return PlainTextResponse("stopped", status_code=503)return "ok"

gunicorn

gunicorn會管理自己的主進程和worker進程，代碼中使用signal無法捕獲SIG_TERM信號，需要按照它的語法規范去捕獲。

新建gunicorn_config.py文件

import logging
import signal# 處理 SIGTERM 信號的函數
def handle_sigterm(signum, frame):from main import stop_eventlogging.info("Worker received SIGTERM, setting health to unhealthy...")stop_event.set()# Worker 初始化時設置信號處理器
def post_worker_init(worker):signal.signal(signal.SIGTERM, handle_sigterm)logging.info("Signal handler for SIGTERM set in worker")

gunicorn啟動時設置config類

gunicorn -c gunicorn_config.py main:app

main.py的健康檢查接口使用stop_event

import threading
from flask import Responsestop_event = threading.Event()@app.route("/health")
def health():if stop_event.is_set():return Response(json.dumps({"pid": os.getpid(), "status": "unhealthy"}),status=503,content_type="application/json",)else:return Response(json.dumps({"pid": os.getpid(), "status": "ok"}),status=200,content_type="application/json",)

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/web/67020.shtml
繁體地址，請注明出處：http://hk.pswp.cn/web/67020.shtml
英文地址，請注明出處：http://en.pswp.cn/web/67020.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！