Location>code7788 >text

Regarding alerts, to get it right, start with these

Popularity:307 ℃/2024-08-28 14:58:27

All kinds of monitoring systems will generate alarm events, so FlashDuty, PagerDuty, Opsgenie such products, do alarm event convergence noise reduction, scheduling to recognize the upgrade and so on. If you want to enhance your company's alarm event processing capabilities, reference (chao xi) the functions of these products can be 😎.

  • Alarm IntegrationThe goal is to handle all the alarms on one Oncall platform. Generally, common monitoring tools have the ability to interface with webhooks, so the Oncall platform can adapt interfaces to different monitoring tools and provide a corresponding webhook, which has the lowest configuration cost for users. There are also some less open monitoring tools, which may only provide email notification, if Oncall platform can accept these emails and parse the content, it is also a kind of underpinning alarm integration.

  • Label EnhancementThe richer the labels in the alert information, the more efficiently engineers can handle the alerts when they receive them. In reality, many monitoring tools send out alarms with only a few bare fields, such as machine name, monitoring items, thresholds, if you can dock external metadata (such as CMDB) to expand the fields of the alarm, then you can use the expanded fields to distribute alarms more automatically, as well as in the handling of faults, so that engineers can quickly determine the impact and severity of the alarm. In addition, when dealing with faults, engineers can quickly determine the impact and severity of alarms.

  • polymeric noise reduction: Aggregation of similar alarms and convergence of frequent alarms can significantly reduce the number of alarms and the ineffective disturbance to engineers. Rule-based, semantic similarity-based are all feasible aggregation methods. Alarms can be aggregated across monitoring data sources, such as alarms from Zabbix and alarms from Prometheus, if they are "similar", they can be aggregated.

  • Alarm SuppressionIn general, it introduces "some kind of dependency": either a high-level alert suppresses a low-level alert, or an underlying infrastructure alert suppresses an upper-level module alert. These dependencies are expensive to maintain and not easy to explain, and are not recommended for heavy use in large-scale scenarios.

  • be on duty roster: The aim is to avoid regular interruptions of the entire team. Daily duty, holiday duty, temporary transfers, and fair rotation are all factors to consider when scheduling, and there should be a clear notification mechanism when duty rotations are handed over. Duty holders should also have the concept of roles, such as primary and backup duty holders.

  • accept (an illegitimate child as one's own): Theoretically, all alerts need to be claimed. If an alert is sent out and no one claims it and it does not have any undesirable consequences, then the alert is meaningless and should not be sent out. The efficiency and effectiveness of alert claiming is usually quantified using MTTA.

  • Upgrade/reassignment: Establishing clear escalation routes in advance for different levels of alarms will reduce the psychological pressure on Oncall engineers and help resolve problems quickly and accurately. Alarm escalation can be manual or automatic, for example, when an alarm has not been handled for more than 30 minutes and has not been recovered, then it will be automatically escalated to the supervisor or backup personnel to ensure that the problem is finally handled in a timely manner.

  • synergistic: In the process of alarm processing, you can always pull in the relevant personnel to collaborate (usually, pull together the relevant personnel, the problem is half solved, if you can automatically create a warroom would be better), to add the collaborators need to accurately and timely notify each other, and the process of alarm processing and the timeline, clearly retained for the collaborative party to quickly understand the full picture.

  • notifications: Foreign Slack can connect a huge peripheral ecosystem, a lot of collaborative work is done in Slack, it is not an exaggeration to say that the operating system in the field of collaboration; in the country that is Enterprise Micro, Flybook, Nail triad, these IM support the development of applications, in these built-in applications to receive alarms, claiming, closure, reassignment, processing, is a key way to enhance the Oncall experience. The mobile office experience feels good when you've used it.

  • Statistical Analysis Operations: Alarm compression rate, MTTA, MTTR, alarm claim ratio, and number of alarms are the key indicators to measure the efficiency of Oncall. By analyzing the above indicators by business, by team, by individual and other dimensions, it is possible to effectively promote the optimization and governance of alarms and make Oncall more efficient.

There is a lack of open source projects in this category, probably because as more and more open source authors are having trouble supporting their families, no one wants to generate power with love. If you have the budget, I recommend FlashDuty, which I think is the best OnCall product in the Eastern Hemisphere.