告警的哲学 - V2EX

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

这是一个创建于 3869 天前的主题，其中的信息可能已经有所发展或是发生改变。

My Philosophy on Alerting, based my observations while I was a Site Reliability Engineer at Google

Author: Rob Ewaschuk [email protected]

Link : Google Docs

这是最近比较火的开源监控架构Prometheus在Alerting Practices上的推荐阅读，见http://prometheus.io/docs/practices/alerting/

中心思想：

Keep alerting simple, alert on symptoms, have good consoles to allow pinpointing causes, and avoid having pages where there is nothing to do.

读后感：

任何知识都是从知识到技能，最后达到方法论，OP的技能也不外如此。OP们，搞好报警，过个好年吧

alerting

告警

2 条回复 • 2015-02-20 02:37:09 +08:00

dcoder

2015-02-20 02:14:10 +08:00

看了下 prometheus

visual 是 rails + SQL, 感觉不如流行的 ElasticSearch+Kibana 给力呢
http://prometheus.io/docs/visualization/promdash/

顺便问一下, 他这个 storage 是 levelDB 的, 容易 horizontal scale out 吗
http://prometheus.io/docs/operating/storage/

9hills

2015-02-20 02:37:09 +08:00 via iPhone

@dcoder SQL语义表达肯定比es要好，但是数据量上来后的性能可能是个瓶颈。

存储只有index存leveldb 监控数据按块存文件。它说当时开发的时候还没有influxdb，所以没用。另外存储只存抽样的数字时间序列，不像influxdb一样存所有信息。

另外这个系统不是分布式系统是个单机系统，所以没有水平扩展的能力。官方推荐的扩展方法就是不同的机器监控进不同master.……

用的感觉，这个系统是给500台以下的监控系统准备的