Prometheus學習之pushgateway和altermanager組件

[root@node-exporter41 /usr/local/alertmanager-0.28.1.linux-amd64]# pwd
/usr/local/alertmanager-0.28.1.linux-amd64[root@node-exporter41 /usr/local/alertmanager-0.28.1.linux-amd64]# cat alertmanager.yml 
# 通用配置
global:resolve_timeout: 5msmtp_from: '914XXXXXX@qq.com'smtp_smarthost: 'smtp.qq.com:465'smtp_auth_username: '914XXXXXX@qq.com'smtp_auth_password: 'aalXXXjXXbXa'smtp_require_tls: falsesmtp_hello: 'qq.com'
# 定義路由信息
route:group_by: ['alertname']group_wait: 5sgroup_interval: 5srepeat_interval: 5mreceiver: 'sre_system'# 配置子路由routes:- receiver: 'sre_ops'match_re:job: linux96_ops_exporter# 建議將continue的值設置為true，表示當前的條件是否匹配，都將繼續向下匹配規則# 這樣做的目的是將消息發給最后的系統組(sre_system)continue: true- receiver: 'sre_k8s'match_re:job: linux96_k8s_exporter continue: true- receiver: 'sre_system'match_re:job: .*continue: true
# 定義接受者 
receivers:
- name: 'sre_ops'email_configs:- to: '914XXXXX@qq.com'send_resolved: true- to: '914XXXXX@qq.com'send_resolved: true
- name: 'sre_k8s'email_configs:- to: '568XXXX@qq.com'send_resolved: true- to: '56XXXX@qq.com'send_resolved: true
- name: 'sre_system'email_configs:- to: '914XXXXX@qq.com'send_resolved: true- to: '56XXXXXX@qq.com'send_resolved: true

引言

在深入探索 Prometheus 的過程中，我們往往會對其核心組件有較為深入的了解，但與此同時，一些輔助組件也發揮著不可或缺的作用，它們就像是為整個監控系統增添羽翼的利器，讓監控功能得以進一步拓展和完善。今天，我們就來聚焦于 Prometheus 生態系統中的兩個重要組件 ——Pushgateway 和 Alertmanager，深入剖析它們的工作原理、應用場景以及與整個 Prometheus 監控體系的協同合作方式，一同揭開它們在監控領域所蘊含的巨大潛力和價值，為你的技術監控之旅增添新的助力和洞察。

一、背景

隨著互聯網行業的蓬勃發展，各種復雜的應用系統和微服務架構層出不窮，這些系統的正常運轉對于企業的業務運營至關重要。為了確保這些系統能夠以最佳狀態持續運行，及時發現和解決潛在的問題，監控系統成為了不可或缺的基礎設施。

Prometheus 作為一款開源的監控告警系統，以其強大的數據模型、靈活的查詢語言（PromQL）以及高效的時序數據庫等優勢，在眾多監控解決方案中脫穎而出。它采用拉取（Pull）模式來收集被監控目標的指標數據，這種模式在大多數場景下能夠很好地工作，但在某些特殊場景，例如需要監控短生命周期的批處理任務、臨時性的任務或者在防火墻受限的環境中等，僅僅依靠 Prometheus 的拉取模式就顯得有些力不從心了，這就促使了 Pushgateway 這樣一個組件的誕生，它允許客戶端將指標數據推送到 Prometheus 中，從而彌補了拉取模式在這些特殊場景下的不足。

而另一方面，在監控過程中，當發現指標數據出現異常時，及時準確地發出告警通知對于快速響應和處理問題是至關重要的。雖然 Prometheus 自身具備一定的告警規則配置功能，但當涉及到復雜的告警通知策略、告警分組、抑制以及與多種通知渠道的集成等場景時，單獨使用 Prometheus 的告警功能就會顯得較為復雜和局限，此時 Alertmanager 便應運而生。它作為 Prometheus 的告警管理組件，專門負責處理來自 Prometheus Server 的告警信息，對告警進行分組、去重、抑制等處理，并能夠根據預設的規則將告警通過多種方式（如郵件、Slack、PagerDuty 等）發送給相應的接收者，從而構建起一個完善且高效的告警通知體系。

通過對 Pushgateway 和 Alertmanager 這兩個組件的深入學習和應用，我們能夠使 Prometheus 監控系統在更多復雜多變的場景下發揮出更大的威力，為系統的穩定運行提供更加堅實可靠的保障。在接下來的博客內容中，我們將分別對這兩個組件進行詳細講解，包括它們的安裝部署、配置使用、實際案例分析等多個方面，希望能幫助你更好地掌握和運用這些強大的監控工具。

二、 Pushgateway組件部署

1. 下載軟件包

部署pushgateway 
[root@node-exporter41 ~]# wget https://github.com/prometheus/pushgateway/releases/download/v1.11.0/pushgateway-1.11.0.linux-amd64.tar.gz解壓軟件包 
[root@node-exporter41 ~]# tar xf pushgateway-1.11.0.linux-amd64.tar.gz -C /usr/local/bin/ pushgateway-1.11.0.linux-amd64/pushgateway --strip-components=1
[root@node-exporter41 ~]# 
[root@node-exporter41 ~]# ll /usr/local/bin/pushgateway 
-rwxr-xr-x 1 1001 1002 20656129 Jan  9 22:36 /usr/local/bin/pushgateway*
[root@node-exporter41 ~]#

2.?運行pushgateway?

[root@node-exporter41 ~]# pushgateway --web.telemetry-path="/metrics" --web.listen-address=:9091 --persistence.file=/data/pushgateway.data訪問pushgateway的WebUI
http://10.0.0.41:9091/#

3. 使用pushgateway監控TCP的12種狀態

[root@elk93 ~]# cat /usr/local/bin/tcp_status2.sh
#!/bin/bashpushgateway_url="http://10.0.0.41:9091/metrics/job/tcp_status"
time=$(date +%Y-%m-%d+%H:%M:%S)state="SYN-SENT SYN-RECV FIN-WAIT-1 FIN-WAIT-2 TIME-WAIT CLOSE CLOSE-WAIT LAST-ACK LISTEN CLOSING ESTAB"
for i in  $statedot=`ss -tan |grep $i |wc -l`echo tcp_connections{state=\""$i"\"} $t >>/tmp/tcp.txt
done;cat /tmp/tcp.txt | curl --data-binary @- $pushgateway_url
rm -rf  /tmp/tcp.txt
[root@elk93 ~]# 2.調用腳本
[root@elk93 ~]# bash /usr/local/bin/tcp_status2.sh

三、Alertmanager單機環境部署

1. 下載軟件包

2.下載Alertmanager 
wget https://github.com/prometheus/alertmanager/releases/download/v0.28.1/alertmanager-0.28.1.linux-amd64.tar.gz3.解壓安裝包 
[root@node-exporter43 ~]# tar xf alertmanager-0.28.1.linux-amd64.tar.gz  -C /usr/local/
[root@node-exporter43 ~]#

2. 修改Alertmanager的配置文件

[root@node-exporter41 /usr/local/alertmanager-0.28.1.linux-amd64]# pwd
/usr/local/alertmanager-0.28.1.linux-amd64[root@node-exporter41 /usr/local/alertmanager-0.28.1.linux-amd64]# cat alertmanager.yml 
# 通用配置
global:resolve_timeout: 5msmtp_from: '914@qq.com'smtp_smarthost: 'smtp.qq.com:465'smtp_auth_username: '914@qq.com'smtp_auth_password: 'ahjplbbja'smtp_require_tls: falsesmtp_hello: 'qq.com'
# 定義路由信息
route:group_by: ['alertname']group_wait: 5sgroup_interval: 5srepeat_interval: 5mreceiver: 'sre_system'# 配置子路由routes:- receiver: 'sre_ops'match_re:job: linux96_ops_exporter# 建議將continue的值設置為true，表示當前的條件是否匹配，都將繼續向下匹配規則# 這樣做的目的是將消息發給最后的系統組(sre_system)continue: true- receiver: 'sre_k8s'match_re:job: linux96_k8s_exporter continue: true- receiver: 'sre_system'match_re:job: .*continue: true
# 定義接受者 
receivers:
- name: 'sre_ops'email_configs:- to: '914@qq.com'send_resolved: true- to: '914@qq.com'send_resolved: true
- name: 'sre_k8s'email_configs:- to: '5@qq.com'send_resolved: true- to: '5689@qq.com'send_resolved: true
- name: 'sre_system'email_configs:- to: '914@qq.com'send_resolved: true- to: '56@qq.com'send_resolved: true

3.?檢查配置是否正確

[root@node-exporter41 /usr/local/alertmanager-0.28.1.linux-amd64]# ./amtool check-config alertmanager.yml 
Checking 'alertmanager.yml'  SUCCESS
Found:- global config- route- 0 inhibit rules- 3 receivers- 0 templates

4. 啟動Alertmanager服務并測試

[root@node-exporter43 alertmanager-0.28.1.linux-amd64]# ./alertmanager # 訪問webUI測試
http://10.0.0.41:9093/#/status

四、Prometheus server集成Alertmanager實現告警功能

1. 修改Prometheus配置文件，打開告警功能

[root@prometheus-server31 /softwares/prometheus-2.53.4.linux-amd64]# cat prometheus.yml
# my global config
global:scrape_interval: 3s # Set the scrape interval to every 15 seconds. Default is every 1 minute.evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.# scrape_timeout is set to the global default (10s).# Alertmanager configuration   ------------------->打開此alter的配置信息~~
alerting:alertmanagers:- static_configs:- targets:- 10.0.0.41:9093# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:                   ------------------> 此配置是告警規則可以自定義# - "first_rules.yml"# - "second_rules.yml"- "linux96-rules.yml"# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.- job_name: "prometheus"# metrics_path defaults to '/metrics'# scheme defaults to 'http'.static_configs:- targets: ["localhost:9090"]# 這里就是靜態配置- job_name: "linux96_ops_exporter"static_configs:- targets:- 10.0.0.41:9100- job_name: "linux96_k8s_exporter"static_configs:- targets:- 10.0.0.42:9100- job_name: "whatever_job_exporter"static_configs:- targets:- 10.0.0.43:9100

2.?編寫告警配置

[root@prometheus-server31 /softwares/prometheus-2.53.4.linux-amd64]# cat linux96-rules.yml 
groups:
- name: linux96-rules-alertrules:- alert: linux96-rules-ops-alertexpr: up{job="linux96_ops_exporter"} == 0for: 3slabels:address: Shanghai class: linux96apps: opsannotations:summary: "{{ $labels.instance }} 服務器已經停止運行超過 3s!!!!!"- alert: linux96-rules-k8s-alterexpr: up{job="linux96_k8s_exporter"} == 0for: 3slabels:school: Beijingclass: linux96apps: k8sannotations:summary: "{{ $labels.instance }} K8S服務器已停止運行超過 3s！"- alert: othersServer-rules-system-alterexpr: up{job="whatever_job_exporter"} == 0for: 5slabels:school: Shenzhenclass: linux96apps: bigdataannotations:summary: "{{ $labels.instance }} 大數據服務器已停止運行超過 5s！"

3.?檢查配置文件語法并重新加載配置

[root@prometheus-server31 /softwares/prometheus-2.53.4.linux-amd64]# ./promtool check config prometheus.yml
Checking prometheus.ymlSUCCESS: 1 rule files foundSUCCESS: prometheus.yml is valid prometheus config file syntaxChecking linux96-rules.ymlSUCCESS: 3 rules found[root@prometheus-server31 /softwares/prometheus-2.53.4.linux-amd64]# ### 重新加載prometheus的配置
curl -X POST http://10.0.0.31:9090/-/reload

4.?觸發告警功能

[root@node-exporter41 ~]# systemctl stop node-exporter.service 
[root@node-exporter41 ~]# ss -ntl | grep 9100
[root@node-exporter41 ~]# [root@node-exporter42 ~]# systemctl stop  node-exporter.service 
[root@node-exporter42 ~]# 
[root@node-exporter42 ~]# ss -ntl | grep 9100
[root@node-exporter42 ~]# [root@node-exporter43 ~]# systemctl stop node-exporter.service 
[root@node-exporter43 ~]# 
[root@node-exporter43 ~]# ss -ntl | grep 9100

5.?查看alermanager的WebUI及郵箱接受者

由此webUI可以發現我們的ops模塊異常了

此時郵箱就會出現告警

?這時候我們去修復一下服務

[root@node-exporter41 /usr/local/alertmanager-0.28.1.linux-amd64]# systemctl start node-exporter.service 
[root@node-exporter41 /usr/local/alertmanager-0.28.1.linux-amd64]# ss -ntl |grep 9100
LISTEN 0      4096               *:9100            *:*

此時郵箱依舊會發送郵件告訴我們故障已解決

五、Alertmanager集成釘釘插件實現告警

參考鏈接:
?? ?https://github.com/timonwong/prometheus-webhook-dingtalk/

1.部署釘釘插件

	1.1 下載釘釘插件 
wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v2.1.0/prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz1.2 解壓文件
[root@node-exporter43 ~]# tar xf prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz  -C /usr/local/
[root@node-exporter43 ~]# 
[root@node-exporter43 ~]# cd /usr/local/prometheus-webhook-dingtalk-2.1.0.linux-amd64/
[root@node-exporter43 prometheus-webhook-dingtalk-2.1.0.linux-amd64]# 
[root@node-exporter43 prometheus-webhook-dingtalk-2.1.0.linux-amd64]# ll
total 18752
drwxr-xr-x  3 3434 3434     4096 Apr 21  2022 ./
drwxr-xr-x 12 root root     4096 Mar 30 17:47 ../
-rw-r--r--  1 3434 3434     1299 Apr 21  2022 config.example.yml
drwxr-xr-x  4 3434 3434     4096 Apr 21  2022 contrib/
-rw-r--r--  1 3434 3434    11358 Apr 21  2022 LICENSE
-rwxr-xr-x  1 3434 3434 19172733 Apr 21  2022 prometheus-webhook-dingtalk*
[root@node-exporter43 prometheus-webhook-dingtalk-2.1.0.linux-amd64]# [root@node-exporter43 /usr/local/prometheus-webhook-dingtalk-2.1.0.linux-amd64]# cp config{.example,}.yml 
[root@node-exporter43 /usr/local/prometheus-webhook-dingtalk-2.1.0.linux-amd64]# ll
total 18756
drwxr-xr-x  3 3434 3434     4096 Mar 30 13:47 ./
drwxr-xr-x 11 root root     4096 Mar 30 13:46 ../
-rw-r--r--  1 3434 3434     1299 Apr 21  2022 config.example.yml
-rw-r--r--  1 root root     1299 Mar 30 13:47 config.yml
drwxr-xr-x  4 3434 3434     4096 Apr 21  2022 contrib/
-rw-r--r--  1 3434 3434    11358 Apr 21  2022 LICENSE
-rwxr-xr-x  1 3434 3434 19172733 Apr 21  2022 prometheus-webhook-dingtalk*
[root@node-exporter43 /usr/local/prometheus-webhook-dingtalk-2.1.0.linux-amd64]#

?2.?登錄釘釘添加自定義機器人

保存好配置文件信息

webhook地址:
https://oapi.dingtalk.com/robot/send?access_token=4b1f23e7286ebce1f626474534050d1cb3868f5055914d77d9217

加簽信息:
SEC6a9472f35ce08bd855c5e5b8cf3b39fd57fdb99d650aa35786acd884b9e

3.?配置釘釘的config.yml

[root@node-exporter43 /usr/local/prometheus-webhook-dingtalk-2.1.0.linux-amd64]# vim config.yml 
## Request timeout
.....
### 配置釘釘的config.yml
## Targets, previously was known as "profiles"
targets:webhook1:url: https://oapi.dingtalk.com/robot/send?access_token=4b1f23e7286ebce1f624dff7c4eb236474534050d1cb3868f5055914d77d9217# secret for signaturesecret: SEC6a9472f35ce08bd855c5e5b8eb2a4cf3b39fd57fdb99d650aa35786acd884b9e
.....
當前只要一個自定觸發的鉤子，如果有多個可以繼續寫

4.?啟動釘釘

[root@node-exporter43 /usr/local/prometheus-webhook-dingtalk-2.1.0.linux-amd64]# ./prometheus-webhook-dingtalk --web.listen-address="10.0.0.43:8060"
..........
這里有URL和一些messages
ts=2025-03-30T14:07:59.409Z caller=main.go:97 level=info component=configuration msg="Loading templates" templates=
ts=2025-03-30T14:07:59.410Z caller=main.go:113 component=configuration msg="Webhook urls for prometheus alertmanager" urls=http://10.0.0.43:8060/dingtalk/webhook1/send
ts=2025-03-30T14:07:59.411Z caller=web.go:208 level=info component=web msg="Start listening for connections" address=10.0.0.43:8060

5.?Alertmanager集成釘釘插件

[root@node-exporter41 /usr/local/alertmanager-0.28.1.linux-amd64]# vim alertmanager.yml .....### 添加此配置- name: 'sre_system'webhook_configs:# 指向的是DingDing的插件地址- url: 'http://10.0.0.43:8060/dingtalk/webhook1/send'http_config: {}max_alerts: 0send_resolved: true#- name: 'sre_system'#  email_configs:#  - to: '94@qq.com'#    send_resolved: true#    headers: { Subject: "[WARN] LINUX96報警郵件" }
"alertmanager.yml" 76L, 2274B written

6. 測試

[root@node-exporter43 ~]# systemctl stop node-exporter.service?
[root@node-exporter43 ~]#?

六.?Alertmanager的告警靜默(Silence)

1.告警靜默(Silence)
一般用于系統維護，預期要做的操作，這意味著就沒有必要告警。

比如系統升級，需要8h，在這8h過程中，就可以考慮先不用告警。

這時候就可以說明當前標簽為K8S規則的節點正在靜默，所以他即使有問題也不會報警，相反其他節點就可以正常。

測試

[root@node-exporter42 ~]# systemctl stop node-exporter.service 
[root@node-exporter43 ~]# systemctl stop node-exporter.service 我們就可以看到42的k8s的警告就被靜默了，但是43還是正常報警

打開靜默就發現了K8S

機器人的報警也是一樣的

沒有k8s標簽的警告

恢復節點
[root@node-exporter42 ~]# systemctl start node-exporter.service 
[root@node-exporter43 ~]# systemctl start node-exporter.service

?七、Alertmanager的告警抑制(inhibit)

1.什么是告警抑制
說白了，就是抑制告警，和靜默不同的是，抑制的應用場景一般用于抑制符合條件的告警。

舉個例子:?
?? ?一個數據中心有800臺服務器，每臺服務器有50個監控項，假設一個意味著有4w個監控告警。
?? ?如果數據中心端點，理論上來說就會有4w條告警發送到你的手機，你是處理不過來的，所以我們只需要將數據中心斷電的告警發出來即可。

1.?Prometheus編寫規則

### 主要添加severity: criticaldc: beijing
這兩行，標志著報警的級別和（可以理解為）標簽[root@prometheus-server31 /softwares/prometheus-2.53.4.linux-amd64]# vim linux96-rules.yml 
- name: linux96-rules-alertrules:
groups:
- name: linux96-rules-alertrules:- alert: linux96-rules-ops-alertexpr: up{job="linux96_ops_exporter"} == 0for: 3slabels:address: Shanghaiclass: linux96apps: opsseverity: criticaldc: beijingannotations:summary: "{{ $labels.instance }} 服務器已經停止運行超過 3s!!!!!"- alert: linux96-rules-k8s-alterexpr: up{job="linux96_k8s_exporter"} == 0for: 3slabels:school: Beijingclass: linux96apps: k8sseverity: warningdc: beijingannotations:summary: "{{ $labels.instance }} K8S服務器已停止運行超過 3s！"- alert: othersServer-rules-system-alterexpr: up{job="whatever_job_exporter"} == 0for: 5slabels:school: Shenzhenclass: linux96apps: bigdataseverity: warningdc: Shenzhenannotations:summary: "{{ $labels.instance }} 大數據服務器已停止運行超過 5s！"
"linux96-rules.yml" 36L, 981B written

?2.?Alertmanager配置告警抑制規則?

......
### 在最后加入
## 配置告警抑制規則
inhibit_rules:# 如果"dc"的值相同的前提條件下。#	則當觸發了"severity: critical"告警，就會抑制"severity: warning"的告警信息。
- source_match:severity: criticaltarget_match:severity: warningequal:- dc

3.?啟動Alertmanager&&DingDing

[root@node-exporter43 prometheus-webhook-dingtalk-2.1.0.linux-amd64]# ./prometheus-webhook-dingtalk --web.listen-address="10.0.0.43:8060" [root@node-exporter43 alertmanager-0.28.1.linux-amd64]# ./alertmanager

4.?停止服務在釘釘驗證

[root@node-exporter41 ~]# systemctl stop node-exporter.service 
[root@node-exporter42 ~]# systemctl stop node-exporter.service 
### 這里故意停止了相同的equal指標但是警告等級不一樣，這時候只會報警我們警告時（災難）嚴重級別的節點信息。

[root@node-exporter43 ~]# systemctl stop node-exporter.service?
再停止一個43節點不同equal指標的，當然就可以正常通知了

5. 恢復測試?

如果恢復了就正常恢復報警
[root@node-exporter41 ~]# systemctl start node-exporter.service
[root@node-exporter42 ~]# systemctl start node-exporter.service
[root@node-exporter43 ~]# systemctl start node-exporter.service?

八、總結

在本博客中，我們深入探討了 Prometheus 生態系統中的兩個重要組件 ——Pushgateway 和 Alertmanager。通過學習和實踐，我們了解到 Pushgateway 在特殊場景下（如短生命周期任務、防火墻限制等）彌補了 Prometheus 拉模式的不足，能夠接收客戶端推送的指標數據并臨時存儲，以便 Prometheus Server 進行拉取監控。我們還了解到 Alertmanager 的強大功能，它負責處理來自 Prometheus Server 的告警信息，實現了告警的分組、去重、抑制以及靈活的通知方式，使得告警系統更加高效和智能。

同時，我們還通過示例腳本的講解，掌握了如何利用 Pushgateway 來推送自定義的 TCP 連接狀態指標數據，進一步拓展了 Prometheus 的監控能力。理解并熟練運用 Pushgateway 和 Alertmanager 這兩個組件，對于我們構建一個完整、高效、可靠的監控告警體系具有重要意義，能夠更好地保障系統的穩定運行和及時響應問題。