運維別卷系列 - 云原生監控平臺之 05.prometheus alertManager 實踐

文章目錄

@[toc]
Alertmanager 簡介
Alertmanager 實現的核心概念
Grouping
Inhibition
Silences
Client behavior
High Availability

Alertmanager 配置文件
global
templates
route
inhibit_rules
receivers

Alertmanager 部署
創建 cm
創建 svc
創建 sts
Prometheus 配置告警
Prometheus 配置文件增加 Alertmanager 配置
Prometheus 增加告警規則

Alertmanager 簡介

ALERTMANAGER

Alertmanager 處理客戶端應用程序（如 Prometheus 服務器）發送的警報。它負責重復數據刪除、分組并將它們路由到正確的接收器集成，例如電子郵件、PagerDuty 或 OpsGenie。它還負責靜音和抑制警報。

Alertmanager 實現的核心概念

Grouping

Grouping categorizes alerts of similar nature into a single notification. This is especially useful during larger outages when many systems fail at once and hundreds to thousands of alerts may be firing simultaneously.
分組將類似性質的警報分類到單個通知中。這在較大規模的中斷期間特別有用，因為許多系統同時發生故障，并且可能同時觸發數百到數千個警報。

Inhibition

Inhibition is a concept of suppressing notifications for certain alerts if certain other alerts are already firing.
抑制是一個概念，用于在已觸發某些其他警報時抑制某些警報的通知。

Silences

Silences are a straightforward way to simply mute alerts for a given time. A silence is configured based on matchers, just like the routing tree. Incoming alerts are checked whether they match all the equality or regular expression matchers of an active silence. If they do, no notifications will be sent out for that alert.
靜默是一種在給定時間內簡單地將警報靜默的簡單方法。靜默是基于匹配器配置的，就像路由樹一樣。檢查傳入警報是否與活動靜默的所有相等或正則表達式匹配器匹配。如果他們這樣做，則不會針對該警報發送任何通知。

Client behavior

The Alertmanager has special requirements for behavior of its client. Those are only relevant for advanced use cases where Prometheus is not used to send alerts.
Alertmanager 對其客戶端的行為有特殊要求。這些僅與不使用 Prometheus 發送警報的高級用例相關。

High Availability

Alertmanager supports configuration to create a cluster for high availability. This can be configured using the --cluster-* flags.

Alertmanager 支持配置以創建集群以實現高可用性。這可以使用 --cluster-* 標志進行配置。

It’s important not to load balance traffic between Prometheus and its Alertmanagers, but instead, point Prometheus to a list of all Alertmanagers.

重要的是不要在 Prometheus 及其 Alertmanager 之間對流量進行負載均衡，而是將 Prometheus 指向所有 Alertmanager 的列表。

Alertmanager 配置文件

CONFIGURATION

和 Prometheus 一樣，Alertmanager 也支持 post 請求來 reload 配置文件，也是 /-/reload

github 上的配置文件示例

global

全局配置

global:# 定義郵件服務器smtp_smarthost: 'localhost:25'# 發送郵件的郵件地址smtp_from: 'alertmanager@example.org'# 發件人名字（具體以郵件服務器為準）smtp_auth_username: 'alertmanager'# 基于 SMTP 身份驗證的，不是平常使用的明文密碼，需要從郵箱里面申請smtp_auth_password: 'password'# SMTP 是否是 tlssmtp_require_tls: false

templates

郵件報警的內容模板

templates:- '/etc/alertmanager/template/*.tmpl'

route

與路由相關的設置允許根據時間配置警報的路由、聚合、限制和靜音方式。

route:# 依據 label 做分組，例如：cluster=A 和 alertname=LatencyHigh 的多個警報將被批處理到一個組中。# 這有效地完全禁用了聚合，按原樣傳遞所有警報。這不太可能是您想要的，除非您的警報量非常低，或者您的上游通知系統執行自己的分組。group_by: ['alertname', 'cluster', 'service']# 當傳入警報創建新的警報組時，請至少等待 "group_wait" 以發送初始通知。# 這種方式可以確保您獲得同一組的多個警報，這些警報在第一次通知中將另一個警報批處理在一起后不久就開始觸發。group_wait: 30s# 發送第一個通知時，請等待 "group_interval" 以發送一批已開始為該組啟動的新警報。group_interval: 5m# 如果警報已成功發送，請等待 "repeat_interval" 重新發送警報。repeat_interval: 3h# 默認的接收器receiver: team-X-mails# 以上所有屬性都由所有子路由繼承，并且可以在每條路由上進行覆蓋。# 子路由routes:# 此路由對警報標簽執行正則表達式匹配，以捕獲與服務列表相關的警報。- matchers:- service=~"foo1|foo2|baz"receiver: team-X-mails# 該服務有一個關鍵警報的子路由，任何不匹配的警報，即不等于 critical 的，回退到父節點并發送到 "team-X-mails"routes:- matchers:- severity="critical"receiver: team-X-pager- matchers:- service="files"receiver: team-Y-mailsroutes:- matchers:- severity="critical"receiver: team-Y-pager# 此路由處理來自數據庫服務的所有警報。如果沒有團隊來處理，則默認由 DB 團隊處理。- matchers:- service="database"receiver: team-DB-pager# 按受影響的數據庫對警報進行分組。group_by: [alertname, cluster, database]routes:- matchers:- owner="team-X"receiver: team-X-pagercontinue: true- matchers:- owner="team-Y"receiver: team-Y-pager

inhibit_rules

當存在與另一組匹配器匹配的警報（源）時，禁止規則會將匹配一組匹配器的警報（目標）靜音。目標警報和源警報必須具有 equal 列表中標簽名稱的相同標簽值。

# 抑制規則允許在另一個警報正在觸發的情況下使一組警報靜音。
# 如果同一警報已經是關鍵警報，我們將使用此功能來靜音任何警告級別的通知。
inhibit_rules:- source_matchers: [severity="critical"]target_matchers: [severity="warning"]# 如果源警報和目標警報中都缺少 "equal" 中列出的所有標簽名稱，則將應用禁止規則！equal: [alertname, cluster, service]

receivers

一個或多個通知集成的命名配置。

receivers:- name: 'team-X-mails'email_configs:- to: 'team-X+alerts@example.org'- name: 'team-X-pager'email_configs:- to: 'team-X+alerts-critical@example.org'pagerduty_configs:- service_key: <team-X-key>- name: 'team-Y-mails'email_configs:- to: 'team-Y+alerts@example.org'- name: 'team-Y-pager'pagerduty_configs:- service_key: <team-Y-key>- name: 'team-DB-pager'pagerduty_configs:- service_key: <team-DB-key>

Alertmanager 部署

同樣，這里是采用 k8s 的方式來部署的，部署的版本是 v0.27.0

創建 cm

smtp 相關的，大家修改成自己的配置就可以了

---
apiVersion: v1
data:alertmanager.yml: |global:resolve_timeout: 5msmtp_smarthost: 'localhost:25'smtp_from: 'alertmanager@example.org'smtp_auth_username: 'alertmanager'smtp_auth_password: 'alertmanager'smtp_require_tls: falsetemplates:- '/app/config/email.tmpl'receivers:- name: default-receiveremail_configs:- to: "imcxsen@163.com"html: '{{ template "email.to.html" . }}'headers: { Subject: " {{ .CommonAnnotations.summary }}" }send_resolved: trueroute:group_interval: 15mgroup_wait: 30sreceiver: default-receiverrepeat_interval: 15mroutes:- match:severity: warningreceiver: default-receivercontinue: true- match:severity: errorreceiver: default-receivercontinue: trueemail.tmpl: |-{{ define "email.to.html" }}{{ range .Alerts }}========= {{ .StartsAt.Format "2006-01-02T15:04:05" }} ==========<br>告警程序: prometheus_alert <br>告警類型: {{ .Labels.alertname }} <br>故障主機: {{ .Labels.instance }} <br>告警主題: {{ .Annotations.summary }} <br>告警詳情: {{ .Annotations.description }} <br>{{ end }}{{ end }}
kind: ConfigMap
metadata:labels:name: alertmanager-cmnamespace: monitor

創建 svc

---
apiVersion: v1
kind: Service
metadata:annotations:labels:app: alertmanagername: alertmanager-svcnamespace: monitor
spec:ports:- name: httpprotocol: TCPport: 9093selector:app: alertmanagertype: ClusterIP

創建 sts

---
apiVersion: apps/v1
kind: StatefulSet
metadata:labels:app: alertmanagername: alertmanagernamespace: monitor
spec:replicas: 1selector:matchLabels:app: alertmanagerserviceName: alertmanager-svctemplate:metadata:annotations:labels:app: alertmanagerspec:containers:- args:- "--config.file=/app/config/alertmanager.yml"- "--storage.path=/alertmanager/data"image: prom/alertmanager:v0.27.0livenessProbe:failureThreshold: 60initialDelaySeconds: 5periodSeconds: 10successThreshold: 1tcpSocket:port: servicetimeoutSeconds: 1name: alertmanagerports:- containerPort: 9093name: serviceprotocol: TCP- containerPort: 8002name: clusterprotocol: TCPresources:limits:cpu: 1000mmemory: 1024Mirequests:cpu: 1000mmemory: 1024MivolumeMounts:- mountPath: /app/configname: config-volumevolumes:- configMap:name: alertmanager-cmname: config-volume

Prometheus 配置告警

Prometheus 配置文件增加 Alertmanager 配置

主要增加以下的內容，定義 Prometheus 的告警規則路徑和 Alertmanager 的地址，配置完成后 curl -X POST http://ip:port/-/reload 來更新 Prometheus 的配置文件

rule_files:
- /etc/prometheus/rules/*.yml
alerting:alertmanagers:- static_configs:- targets: ["alertmanager-svc.monitor.svc.cluster.local:9093"]

Prometheus 增加告警規則

這里為了方便驗證，所以把內存使用率超過 15% 的來觸發報警，因為我當前環境，有機器的內存使用率是超過 15%的，這個只需要把 expr 里面的 PromQL 放到 Prometheus 里面執行一下，找到一個均值就可以了，和上面一樣，增加了規則文件，也需要 reload 一下 Prometheus 的配置文件

下面定義了一個名字叫 NodeMemoryUsage 的報警
for 語句會使 Prometheus 服務等待指定的時間，然后執行查詢表達式。
labels 語句允許指定額外的標簽列表，把它們附加在告警上。我這里暫時沒加
annotations 語句指定了另一組標簽，它們不被當做告警實例的身份標識，它們經常用于存儲一些額外的信息，用于報警信息的展示之類的。

一個報警信息在生命周期內有下面 3 種狀態：
inactive: 表示當前報警信息既不是 firing 狀態也不是 pending 狀態
pending: 表示在設置的閾值時間范圍內被激活了
firing: 表示超過設置的閾值時間被激活了

groups:
- name: test-rulerules:- alert: NodeMemoryUsageexpr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes * 100 > 15for: 2mannotations:summary: "{{$labels.instance}}: High Memory usage detected"description: "{{$labels.instance}}: Memory usage is above 15% (current value is: {{ $value }}"