接上篇基于Alertmanager 配置釘釘告警

Alertmanager 是一個用于處理和管理 Prometheus 警報的開源工具。它負責接收來自 Prometheus 服務器的警報，進行去重、分組、靜默、抑制等操作，并通過電子郵件、PagerDuty、Slack 等多種渠道發送通知。

主要功能

去重：合并相同或相似的警報，避免重復通知。
分組：將相關警報合并為一個通知，減少信息過載。
靜默：臨時屏蔽特定警報，避免干擾。
抑制：在特定條件下阻止某些警報的發送。
路由：根據標簽將警報分發到不同的接收者或渠道。
通知：支持通過多種方式發送警報通知。

核心概念

Alert：由 Prometheus 生成的警報，包含標簽、注解和狀態。
Receiver：警報的接收者，如電子郵件或 Slack 頻道。
Route：定義警報如何路由到接收者。
Silence：臨時屏蔽特定警報的機制。

下載安裝包：
地址：https://prometheus.io/download/#alertmanager

將安裝包alertmanager-0.24.0.linux-amd64.tar.gz上傳服務器

tar zxf alertmanager-0.24.0.linux-amd64.tar.gz -C /usr/local/
mv /usr/local/alertmanager-0.24.0.linux-amd64/ /usr/local/alertmanager

接下來再安裝一個插件prometheus-webhook-dingtalk?

由于 Alertmanager 沒有內置釘釘的支持，因此需要通過?Webhook?的方式將告警信息發送到釘釘。prometheus-webhook-dingtalk?就是這樣一個工具，它充當了 Alertmanager 和釘釘之間的橋梁：

Alertmanager?將告警信息通過 Webhook 發送到?prometheus-webhook-dingtalk。
prometheus-webhook-dingtalk?將告警信息格式化為釘釘支持的格式（如 Markdown），并通過釘釘的 Webhook API 推送到指定的群聊。

下載安裝包：
地址：https://github.com/timonwong/prometheus-webhook-dingtalk/releases/

上傳到服務器進行解壓安裝

tar zxvf prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz -C /usr/local/
mv /usr/local/prometheus-webhook-dingtalk-2.1.0.linux-amd64/ /usr/local/prometheus-webhook-dingtalk

創建釘釘機器人：

【電腦端釘釘】-【群聊】-【群設置】-【智能群助手】-【添加更多】-【添加機器人】-【自定義】-【添加】，編輯機器人名稱和選擇添加的群組，勾選加簽，將生成的秘鑰復制出來。

修改prometheus-webhook-dingtalk配置，將以上信息填到文件中：

新建/usr/local/prometheus-webhook-dingtalk/config.yml，添加以下配置

targets:ding_webhook:# 釘釘webhook地址,根據自己的來填url: https://oapXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX# 創建機器人時獲取到的加簽秘鑰，根據自己的來填secret: SECXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

啟動prometheus-webhook-dingtalk服務

nohup /usr/local/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk --config.file=/usr/local/prome                                                        theus-webhook-dingtalk/config.yml &

查看插件提供的webhook地址：這個記好待會有用

編輯alertmanager.yml配置文件，添加路由和接受者配置，注意url填寫釘釘插件提供的webhook地址，就是上圖圈起來那個，根據自己的情況來，而不是釘釘直接提供的那個webhook。

vim /usr/local/alertmanager/alertmanager.yml

route:#接收人receiver: 'webhook'#同組內告警等待時間。也就是告警產生后等待5s，如果有同組告警一起發出group_wait: 5s#兩個組告警的間隔時間group_interval: 10s#重復告警的間隔時間，減少相同釘釘告警的發送頻率repeat_interval: 30s#采用哪個標簽來作為分組依據group_by: [alertname]routes:- receiver: webhook#配置告警消息接受者信息，常用的有 郵箱、wechat、webhook 等消息通知方式
receivers:
- name: 'webhook'webhook_configs:#釘釘插件提供的webhook地址- url: http://localhost:8060/dingtalk/ding_webhook/send#警報被解決之后是否通知send_resolved: true

接下來編輯prometheus配置文件：

增加和修改prometheus.yml的alertmanager部分，讓alertmanger能與Prometheus通信。

# my global config
global:scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.# scrape_timeout is set to the global default (10s).# Alertmanager configuration
alerting:alertmanagers:- static_configs:- targets:##修改成alertmanager服務器的ip和端口- 192.168.158.183:9093# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
# 指定告警規則的配置路徑
rule_files:- "/usr/local/prometheus/rules/*.yml"# - "first_rules.yml"# - "second_rules.yml"# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.- job_name: "prometheus"# metrics_path defaults to '/metrics'# scheme defaults to 'http'.static_configs:- targets: ["192.168.158.183:9090"]- job_name: 'linux'file_sd_configs:- files:- /usr/local/prometheus/node_exporter_targets.json#接收alertmanager的數據- job_name: 'alertmanager'static_configs:- targets: ['192.168.158.183:9093']

在/usr/local/prometheus/路徑建立rules文件夾

在rules文件夾中創建node_rules.yml用來配置主機節點的告警

[root@prometheus prometheus]# cat  rules/node_rules.yml
groups:- name: node_alertsrules:# 規則 1: CPU 使用率過高- alert: HighCPUUsageexpr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80for: 5mlabels:severity: criticalannotations:summary: "High CPU usage on {{ $labels.instance }}"description: "CPU usage is above 80% for more than 5 minutes on {{ $labels.instance }}."# 規則 2: 內存使用率過高- alert: HighMemoryUsageexpr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80for: 5mlabels:severity: criticalannotations:summary: "High memory usage on {{ $labels.instance }}"description: "Memory usage is above 80% for more than 5 minutes on {{ $labels.instance }}."# 規則 3: 磁盤使用率過高- alert: HighDiskUsageexpr: 100 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100) > 80for: 5mlabels:severity: warningannotations:summary: "High disk usage on {{ $labels.instance }}"description: "Disk usage is above 80% for more than 5 minutes on {{ $labels.instance }}."# 規則 4: 節點宕機- alert: InstanceDownexpr: up == 0for: 1mlabels:severity: criticalannotations:summary: "Instance {{ $labels.instance }} down"description: "{{ $labels.instance }} has been down for more than 1 minute."

在/usr/local/prometheus/node_exporter_targets.json文件中添加測試節點

重啟prometheus

ps -ef |grep prometheus |grep -v grep |awk '{print $2}' |xargs kill -9
nohup /usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml &

啟動Alertmanager

nohup /usr/local/alertmanager/alertmanager --config.file /usr/local/alertmanager/alertmanager.yml &

啟動釘釘插件prometheus-webhook-dingtalk

nohup /usr/local/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk --config.file=/usr/local/prometheus-webhook-dingtalk/config.yml &

查看grafana、alertmanager、prometheus端口都已經啟動

接下來關閉剛才添加的測試機器

等了一會查看釘釘出現了告警

接下來優化告警消息：

1、使用中文發送告警信息

修改prometheus-webhook-dingtalk/config.yml文件添加以下字段

targets:ding_webhook:url: https://oapXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXsecret: SEC6c2bf6d8XXXXXXXXXXXXXXXXXXXXXXXmessage:title: 'Prometheus - {{ if eq .Status "resolved" }}恢復通知{{ else }}告警通知{{ end }}'text: |- **告警名稱**: {{ .CommonLabels.alertname }}- **當前狀態**: {{ .Status }}{{ if eq .Status "resolved" }}- **描述**: 實例 {{ .CommonLabels.instance }} 已恢復正常。- **可能影響的服務**: 沒有影響的服務{{ else }}- **描述**: {{ .CommonAnnotations.description }}- **可能影響的服務**: {{ .CommonAnnotations.impact }}{{ end }}

2、告知故障的影響范圍

修改/usr/local/prometheus/rules/node_rules.yml配置文件，添加以下信息

.............# 規則 4: 節點宕機- alert: InstanceDownexpr: up == 0for: 1mlabels:severity: criticalannotations:summary: "實例 {{ $labels.instance }} 已宕機"description: "實例 {{ $labels.instance }} 已宕機。"impact: |{{- if eq $labels.instance "192.168.158.182:9900" }}K8S中pod調度，導致服務無法正常使用。{{- else if eq $labels.instance "192.168.158.183:9900" }}無法訪問監控系統。{{- else }}可能影響的服務：未知。{{- end }}

重啟/prometheus-webhook-dingtalk服務：

ps -ef |grep prometheus-we |grep -v grep |awk -F " " '{print $2}' |xargs kill -9#為了區分nohup的啟動日志，建議進入各自的目錄執行
cd /usr/local/prometheus-webhook-dingtalk/
nohup /usr/local/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk --config.file=/usr/local/prometheus-webhook-dingtalk/config.yml &

重啟prometheus服務：

ps -ef |grep prometheus |grep -v grep |awk '{print $2}' |xargs kill -9cd /usr/local/prometheusnohup /usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml &

查看端口都已經啟動