Prometheus告警规则示例:欢迎大家分享
这个网站有很多Prometheus告警规则样例: https://awesome-prometheus-alerts.grep.to/
# centos6和7的内存空闲量计算
node_memory_MemAvailable_bytes or (node_memory_Buffers_bytes + node_memory_Cached_bytes + node_memory_MemFree_bytes + node_memory_Slab_bytes)
一个prometheus rules的示例,level用作区分告警方式,level, kind用作告警抑制方式。
groups: - name: node-cpu rules: # cpu核数 - record: instance:node_cpus:count expr: count without (cpu, mode) (node_cpu_seconds_total{mode="idle"}) # 每个cpu使用率 - record: instance_cpu:node_cpu_seconds_not_idle:rate1m expr: sum without (mode) (1 - rate(node_cpu_seconds_total{mode="idle"}[1m])) # 总cpu使用率 - record: instance:node_cpu_utilization:ratio expr: avg without (cpu) (instance_cpu:node_cpu_seconds_not_idle:rate1m) - alert: cpu使用率大于88% expr: instance:node_cpu_utilization:ratio * 100 > 88 for: 5m labels: severity: critical level: 3 annotations: summary: "cpu使用率大于85%" description: "主机 {{ $labels.hostname }} 的cpu使用率为 {{ $value | humanize }}" - alert: cpu使用率大于93% expr: instance:node_cpu_utilization:ratio * 100 > 93 for: 2m labels: severity: emergency level: 4 annotations: summary: "cpu使用率大于93%" description: "主机 {{ $labels.hostname }} 的cpu使用率为 {{ $value | humanize }}" wxurl: "webhook1, webhook2" mobile: "13xxx, 15xxx" - alert: cpu负载大于Cores expr: node_load5 > instance:node_cpus:count for: 5m labels: severity: warning level: 2 annotations: summary: "cpu负载大于Cores" description: "主机 {{ $labels.hostname }} 的cpu负载为 {{ $value }}" - alert: cpu负载大于2Cores expr: node_load1 > (instance:node_cpus:count * 2) for: 4m labels: severity: critical level: 3 annotations: summary: "cpu负载大于2Cores" description: "主机 {{ $labels.hostname }} 的cpu负载为 {{ $value }}" alertgroup: ops
在特定时间触发/不触发告警,参考: https://www.robustperception.io/combining-alert-conditions
groups: - name: 指定特定时间范围 rules: - alert: 凌晨0点到6点不触发告警 # prometheus默认是utc时间,请注意 expr: promQL表达式 and ON() (hour() < 16 > 22)