k8s默認的健康檢查機制是,每個容器都有一個監控進程,如果進程退出時返回碼非零,則認為容器發生故障。
存活探測
監測pod是否處于運行狀態,當liveness probe探測失敗的時候,根據重啟策略判斷是否需要重啟。適用于需要在容器發生故障時需要立即重啟的狀態。
用指定的方式(exec,tcp,http等)監測pod中的容器是否正常運行
yaml文件如下:
[root@master yam_files]# cat live-http.yaml
apiVersion: v1
kind: Pod
metadata:name: liveness-httpnamespace: defaultlabels:app: nginxspec:containers:- name: livenessimage: nginximagePullPolicy: IfNotPresentports:- containerPort: 80livenessProbe:httpGet: path: /index.htmlport: 80initialDelaySeconds: 5periodSeconds: 10readinessProbe:httpGet:path: /index.htmlport: 80initialDelaySeconds: 5periodSeconds: 10restartPolicy: Always
新建的pod里面運行一個nginx,而且通過http訪問,設定啟動探測為5s,周期為10s,存活探測周期為10s,啟動探測為5s。
啟動pod后,破壞他,刪除這個探測的路徑index.html
kubectl exec -it liveness-http -- /bin/bash
root@liveness-http:/# cd /usr/share/nginx/html
root@liveness-http:/usr/share/nginx/html# ls
50x.html index.html
root@liveness-http:/usr/share/nginx/html# rm index.html
?探針發現錯誤后重啟pod
kubectl get pods -l app=nginx -w
NAME READY STATUS RESTARTS AGE
liveness-http 1/1 Running 0 5m37s
nginx-test-6cf9d87fbf-26h6m 1/1 Running 0 127m
nginx-test-6cf9d87fbf-wn94b 1/1 Running 0 7d17h
liveness-http 0/1 Running 0 5m50s
liveness-http 0/1 Running 1 (2s ago) 5m52s
liveness-http 1/1 Running 1 (10s ago) 6m
隨后用tcp做一個實驗,寫yaml文件如下
cat live-tcp.yaml
apiVersion: v1
kind: Pod
metadata:name: liveness-tcpspec:containers:- name: livenessimage: nginximagePullPolicy: IfNotPresentports:- containerPort: 80livenessProbe:tcpSocket:port: 80initialDelaySeconds: 2periodSeconds: 3
?隨后啟動pod,并且停掉其中的nginx服務
kubectl exec -it liveness-tcp -- /bin/bash
root@liveness-tcp:/# nginx -s stop
2024/06/24 05:44:54 [notice] 45#45: signal process started
發現pod重啟
kubectl get pods -w
>
NAME READY STATUS RESTARTS AGE
first 1/1 Running 0 158m
liveness-http 1/1 Running 1 (25m ago) 31m
liveness-tcp 1/1 Running 0 57s
nginx-test-6cf9d87fbf-26h6m 1/1 Running 0 153m
nginx-test-6cf9d87fbf-wn94b 1/1 Running 0 7d17h
liveness-tcp 0/1 Completed 0 2m
liveness-tcp 1/1 Running 1 (1s ago) 2m1s
就緒探測
readiness probe探測容器是否可以正常接受請求,如果探測失敗,k8s立即停止將新的流量轉發到該容器。從SVC移除
先建立一個service,再建立一個nginx的pod,通過service來轉發流量
[root@master yam_files]# cat readiness-svc.yaml
apiVersion: v1
kind: Service
metadata:name: readinessnamespace:
spec:selector:app: my-podports:- port: 80targetPort: 80[root@master yam_files]# cat ready-http.yaml
apiVersion: v1
kind: Pod
metadata:name: my-podlabels:app: my-podspec:containers:- name: nginx-containerimage: nginximagePullPolicy: IfNotPresentports:- containerPort: 80readinessProbe:httpGet:path: /index.htmlport: 80initialDelaySeconds: 30periodSeconds: 10failureThreshold: 2successThreshold: 1
可以看到設置的就緒探測為30s之后, 且每10s探測一次
一開始的service并沒有連接到pod(endpoints為空)
kubectl describe svc readiness
>
Name: readiness
Namespace: default
Labels: <none>
Annotations: <none>
Selector: app=my-pod
Type: ClusterIP
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.107.242.111
IPs: 10.107.242.111
Port: <unset> 80/TCP
TargetPort: 80/TCP
Endpoints:
Session Affinity: None
Events: <none>
隨后就緒探測成功,service也成功連上
kubectl get pods -w
>
NAME READY STATUS RESTARTS AGE
nginx-test-6cf9d87fbf-26h6m 1/1 Running 0 3h28m
nginx-test-6cf9d87fbf-wn94b 1/1 Running 0 7d18h
my-pod 0/1 Pending 0 0s
my-pod 0/1 Pending 0 0s
my-pod 0/1 ContainerCreating 0 0s
my-pod 0/1 ContainerCreating 0 2s
my-pod 0/1 Running 0 3s
my-pod 1/1 Running 0 40skubectl describe svc readiness
>
Name: readiness
Namespace: default
Labels: <none>
Annotations: <none>
Selector: app=my-pod
Type: ClusterIP
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.107.242.111
IPs: 10.107.242.111
Port: <unset> 80/TCP
TargetPort: 80/TCP
Endpoints: 10.244.166.145:80
Session Affinity: None
Events: <none>
?查看my-pod的ip,10.244.166.145,和上面svc的endpoint相同。在生產環境中,一般會部署副本,那么所有pod的ip都會在endpoint中,而這些ip也會寫在防火墻中
kubectl get pods -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
my-pod 1/1 Running 0 14m 10.244.166.145 node1 <none> <none>
啟動探測
startup probe用于監測容器是否成功啟動并準備好接收流量,只會在啟動的時候執行一次。
設置initialDelaySeconds:啟動多久之后開始探測,避免容器初始化沒完成就開始探測
periodSeconds:探測周期。默認10s
timeoutSeconds:默認1s
如果沒有設置容器啟動探測,則默認狀態為成功success
apiVersion: v1
kind: Pod
metadata:name: startupprobe
spec:containers:- name: startupimage: xianchao/tomcat-8.5-jre8:v1imagePullPolicy: IfNotPresentports:- containerPort: 8080startupProbe:exec:command:- "/bin/bash"- "-c"- " ps aux | grep tomcat"initialDelaySeconds: 20 #容器啟動后多久開始探測periodSeconds: 20 #執行探測的時間間隔timeoutSeconds: 10 #探針執行檢測請求后,等待響應的超時時間successThreshold: 1 #成功多少次才算成功failureThreshold: 3 #失敗多少次才算失敗
?用exe做探針,執行后發現40s之后運行成功,探測到的時間=initialDelaySeconds + periodSeconds,第一次initialSelaySeconds沒有探測成功
kubectl get pods -w
>
NAME READY STATUS RESTARTS AGE
nginx-test-6cf9d87fbf-26h6m 1/1 Running 0 11h
nginx-test-6cf9d87fbf-wn94b 1/1 Running 0 8d
startupprobe 0/1 Running 0 23s
startupprobe 0/1 Running 0 40s
startupprobe 1/1 Running 0 40s
修改yaml文件,使得探測失敗,可以看到開始重啟
“aa ps aux | grep tomcat”
第一次重啟的時間:initialDelaySeconds + (periodSeconds + timeoutSeconds)* failureThreshold
kubectl get pods -w
>
NAME READY STATUS RESTARTS AGE
startupprobe 0/1 ContainerCreating 0 0s
startupprobe 0/1 ContainerCreating 0 1s
startupprobe 0/1 Running 0 2s
startupprobe 0/1 Running 1 (0s ago) 80s
startupprobe 0/1 Running 2 (1s ago) 2m21s
利用tcpSocket進行探測
apiVersion: v1
kind: Pod
metadata:name: startupprobe
spec:containers:- name: startupimage: xianchao/tomcat-8.5-jre8:v1imagePullPolicy: IfNotPresentports:- containerPort: 8080startupProbe:tcpSocket:port: 8080initialDelaySeconds: 20 #容器啟動后多久開始探測periodSeconds: 20 #執行探測的時間間隔timeoutSeconds: 10 #探針執行檢測請求后,等待響應的超時時間successThreshold: 1 #成功多少次才算成功failureThreshold: 3 #失敗多少次才算失敗
全生命周期健康監測
目前livenessProbe、ReadinessProbe和startupProbe都支持以下三種探針:
exec:如果執行成功,退出碼為0則探測成功。
TCPSocketAction:通過容器的ip地址和端口號執行TCP檢查,如果能夠建立TCP連接,則表明容器健康。
HTTPGetAction:通過容器的ip地址、端口號以及調用HTTP Get方法,如果響應的狀態碼大于200小于400,則認為容器健康。
apiVersion: v1
kind: Pod
metadata:name: life-demospec:containers:- name: lifecycle-demo-containerimage: docker.io/xianchao/nginx:v1imagePullPolicy: IfNotPresentlifecycle:postStart:exec:command: ["/bin/bash","-c","echo 'lifecycle hookshandler' > /usr/share/nginx/html/test.html"]preStop:exec:command:- "/bin/sh"- "-c"- "nginx -s stop"
postStart和preStop都是容器生命管理的鉤子,PostStart
鉤子在容器啟動后立即調用。這是在容器的主進程啟動之后,且在容器被視為就緒之前發生的。如果 PostStart
鉤子執行失敗,容器將被視為啟動失敗,Kubernetes 會根據容器的重啟策略處理這個失敗的容器。PreStop
?鉤子是為了在容器終止前執行清理任務(如關閉連接、釋放資源等)而設計的。
apiVersion: v1
kind: Pod
metadata:name: checknamespace: defaultlabels:app: check
spec:containers:- name: checkimage: busybox:1.28imagePullPolicy: IfNotPresentcommand:- /bin/sh- -c- sleep 10;exit
啟動這個pod后,10s后容器就會退出,但是會一直重啟
kubectl get pods -w
>
NAME READY STATUS RESTARTS AGE
nginx-test-6cf9d87fbf-26h6m 1/1 Running 0 11h
nginx-test-6cf9d87fbf-wn94b 1/1 Running 0 8d
check 0/1 Pending 0 0s
check 0/1 Pending 0 0s
check 0/1 ContainerCreating 0 0s
check 0/1 ContainerCreating 0 1s
check 1/1 Running 0 2s
check 0/1 Completed 0 12s
check 1/1 Running 1 (2s ago) 13s
check 0/1 Completed 1 (12s ago) 23s
check 0/1 CrashLoopBackOff 1 (14s ago) 36s
check 1/1 Running 2 (14s ago) 36s
check 0/1 Completed 2 (24s ago) 46s
check 0/1 CrashLoopBackOff 2 (12s ago) 58s
check 1/1 Running 3 (24s ago) 70s
check 0/1 Completed 3 (34s ago) 80s
check 0/1 CrashLoopBackOff 3 (16s ago) 95s
check 1/1 Running 4 (43s ago) 2m2s
一個包含三種探測的pod文件
apiVersion: v1
kind: Service
metadata:name: springboot-livelabels:app: springbootspec:type: NodePortports:- name: serverport: 8080targetPort: 8080nodePort: 31180- name: managementport: 8081targetPort: 8081nodePort: 31181selector:app: springboot---
apiVersion: v1
kind: Pod
metadata:name: springboot-livelabels:app: springbootspec:containers:- name: springbootimage: mydlqclub/springboot-helloworld:0.0.1imagePullPolicy: IfNotPresentports:- name: servercontainerPort: 8080- name: managementcontainerPort: 8081readinessProbe:initialDelaySeconds: 20periodSeconds: 5timeoutSeconds: 10httpGet:scheme: HTTPport: 8081path: /actuator/healthlivenessProbe:initialDelaySeconds: 20periodSeconds: 5timeoutSeconds: 10httpGet:scheme: HTTPport: 8081path: /actuator/healthstartupProbe:initialDelaySeconds: 20periodSeconds: 5timeoutSeconds: 10httpGet:scheme: HTTPport: 8081path: /actuator/health
啟動后發現svc的服務是正常的
kubectl describe svc springboot-live
Name: springboot-live
Namespace: default
Labels: app=springboot
Annotations: <none>
Selector: app=springboot
Type: NodePort
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.98.119.36
IPs: 10.98.119.36
Port: server 8080/TCP
TargetPort: 8080/TCP
NodePort: server 31180/TCP
Endpoints: 10.244.166.156:8080
Port: management 8081/TCP
TargetPort: 8081/TCP
NodePort: management 31181/TCP
Endpoints: 10.244.166.156:8081
Session Affinity: None
External Traffic Policy: Cluster
Events: <none>
隨后進入pod,刪掉運行的程序
kubectl exec -it springboot-live -- /bin/sh
/ # ls
app.jar dev home media opt root sbin sys usr
bin etc lib mnt proc run srv tmp var
/ # ps -ef | grep springboot63 root 0:00 grep springboot
/ # ps -ef | grep hello65 root 0:00 grep hello
/ # kill 1
/ # command terminated with exit code 137
查看pod狀態,發現20s后重啟
kubectl get pods -w
NAME READY STATUS RESTARTS AGE
springboot-live 1/1 Running 0 2m36s
springboot-live 0/1 Error 0 17m
springboot-live 0/1 Running 1 (1s ago) 17m
springboot-live 0/1 Running 1 (22s ago) 17m
springboot-live 1/1 Running 1 (22s ago) 17m