問題描述
nginx的錯誤日志中突然出現大量的的Connection refused問題,日志如下:
2020/03/19 09:52:53 [error] 20117#20117: *7403411764 connect() failed (111: Connection refused) while connecting to upstream, client: xxx.xxx.xxx.xxx, server: , request: "POST /post/result/lol?type=Bet HTTP/1.1", upstream: "http://xxx.xxx.xxx.xxx/post/result/lol?type=Bet", host: "xxx.xxx.xxx.xxx"
2020/03/19 09:52:53 [error] 20117#20117: *7403411774 connect() failed (111: Connection refused) while connecting to upstream, client: xxx.xxx.xxx.xxx, server: , request: "POST /post/result/csgo?type=RollingBet HTTP/1.1", upstream: "http://xxx.xxx.xxx.xxx/post/result/csgo?type=RollingBet", host: "xxx.xxx.xxx.xxx"
2020/03/19 09:52:54 [error] 20116#20116: *7403411815 connect() failed (111: Connection refused) while connecting to upstream, client: xxx.xxx.xxx.xxx, server: , request: "POST /post/result/lol?type=Bet HTTP/1.1", upstream: "http://xxx.xxx.xxx.xxx/post/result/lol?type=Bet", host: "xxx.xxx.xxx.xxx"
出現這個問題,一開始以為是server節點掛掉,但是查看了下server運行正常;這個錯誤是突然間爆發大量的錯誤,查看了相關nginx和服務器監控系統,看到連接數突增。可以說明在高負載下,系統響應變慢,并出現超時或失誤失敗情況,TIME_WAIT積壓。
問題定位
查看了tcp連接命令
# netstat -n | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
TIME_WAIT 35423
CLOSE_WAIT 23602
SYN_SENT 62
FIN_WAIT1 61
FIN_WAIT2 259
ESTABLISHED 7543
SYN_RECV 3
CLOSING 35
LAST_ACK 507
發現WAIT數量過高,TCP連接斷開后,會以TIME_WAIT狀態保留一定的時間,然后才會釋放端口。當并發請求過多的時候,就會產生大量的TIME_WAIT狀態的連接,無法及時斷開的話,會占用大量的端口資源和服務器資源,導致很多連接被拒絕了。
修改系統參數
# vim /etc/sysctl.conf
net.ipv4.tcp_fin_timeout = 30 #保留 FIN_WAIT2 的時間, 默認值是60, 單位是秒.
net.ipv4.tcp_timestamps = 1 #時間戳可以避免序列號的卷繞,默認為0,表示關閉;
net.ipv4.tcp_tw_reuse = 1 #表示開啟重用。允許將TIME-WAIT sockets重新用于新的TCP連接,默認為0,表示關閉;
net.ipv4.tcp_tw_recycle = 1 # 表示開啟TCP連接中TIME-WAIT sockets的快速回收,默認為0,表示關閉。
配置生效
/sbin/sysctl -p
WAIT的數量降低了,nginx也沒有報Connection refused。
# netstat -n | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
TIME_WAIT 2521
CLOSE_WAIT 13602