When a Prometheus alarm is recovered, how do I get the value at the time of recovery?

Prometheus alarm events in the$value indicates the value when the alarm was triggered, but when the alarm is resumed, the value of$value What is the reason and rationale for still having the value at the time of the latest alarm, not the value at the time of recovery? Is there a way to fix it?

Without further ado, let's start with the principle.

principle

Alert rules are configured in Prometheus, and Prometheus is responsible for determining the rules.The logic of Prometheus' rule determination is also very simple, which is to periodically query the data with the promql, and if the data is found and meets the criteria offor The alarm event is triggered for the specified length of time, and if no data can be found, the indicator is considered to be in a normal healthy state. For example:

cpu_usage_idle < 5

As in the above example, the promql for the alert rule is with a threshold (< 5), so as long as the data is found, it means that the current value is less than 5, as long as the data is not found, it means that the current value is greater than or equal to 5, i.e., the current data is in a healthy state. Note that when no data is found, the timing library does not return data, in other words, when the data is normal, because the timing library does not return data, the upper layer can not get the value of the normal state, since it can not get the value of the normal state, it can not show the latest value in the recovery.

In fact, the event on recovery is generated by Alertmanager based on resolve_timeout, not by Prometheus. when Alertmanager generates the recovery event, it will bring the label and annotation of the last alert, and the value? Alertmanager will not query Prometheus to get the latest value.

Can Alertmanager get the value at the time of recovery?

Frankly, it's hard. alertmanager needs to query Prometheus to get the value at the time of the last alert based on the last alert's tags and annotations. alertmanager won't do that, at its core:

Functionally, Alertmanager going to query Prometheus is a reverse dependency, Alertmanager is the distribution center for alerts, not only receiving events pushed by Prometheus, but also events pushed by other alert sources, and if Alertmanager were to query Prometheus, it would be The coupling is too serious.
Prometheus can attach tags to its alert rules, and together with the tags of monitoring metrics, send them to Alertmanager as the tag set of the event, and Alertmanager needs to query Prometheus based on these tags to get the raw data, which is not feasible in some scenarios. This is not feasible in some scenarios. First, Alertmanager has no way to eliminate the additional tags from the tags and leave only the data tags; second, some promql query results do not have tags at all, so there is no way to check them; and again, Alertmanager needs to parse the promql to remove the thresholds, and some promql are not numeric thresholds at all.

If you want to get what you want by modifying Alertmanager, give up.

Is there a way out of this?

There are. There are two usual solutions:

In the alert rule, by the way, configure the promql on recovery
Take the thresholds out of promql, promql is only used to query the raw data, and then the threshold determination is done at the upper level, regardless of whether the monitoring metrics are currently healthy or not

I'll use some surveillance products as an example to illustrate this specifically.

Configure the promql for recovery in the alert rule

Products that use this method toNightingale MonitorAs an example, to explain the specific practice. The core is to configure two places, one is to configure the promql when recovering in the alert rule, and the other is to configure the rendering of the value when recovering in the alert template.

For example, I have an alert rule for detecting HTTP address probing failures:

夜莺告警恢复时拿到恢复时的值-告警时的promql

You need to add the configuration for recovery_promql in the custom field at the bottom of the alert rule as follows:

recovery_promql

To understand the logic of how this works, let's first look at what the data for the metric http_response_result_code looks like:

夜莺告警恢复时拿到恢复时的值-即时查询

From the above figure, we can see that this indicator contains two series, the agent_hostname and method fields are the same, and the target field can distinguish these two series. http_response_result_code ! = 0 If triggered, the alarm event will have a target tag, so if the alarm event is recovered, we use the target tag of the high alarm to query, we will be able to accurately find the value of the recovery. So the recovery_promql configuration references the target tag, and its value is a variable, which is the value of the target tag in the alarm event.

Then, in the alert template, we add a rendering of the value on recovery, using the pinned template as an example:

#### {{if .IsRecovered}}<font color="#008800">💚{{.RuleName}}</font>{{else}}<font color="#FF0000">💔{{.RuleName}}</font>{{end}}

---
{{$time_duration := sub .FirstTriggerTime }}{{if .IsRecovered}}{{$time_duration = sub .LastEvalTime .FirstTriggerTime }}{{end}}
- **Alarm level**: {{.Severity}}step (of stairs)
{{- if .RuleNote}}
- **Remarks on the rules**: {{.RuleNote}}
{{- end}}
{{- if not .IsRecovered}}
- **Current Trigger Time Value**: {{.TriggerValue}}
- **Current Trigger Time**: {{timeformat .TriggerTime}}
- **Alarm duration**: {{humanizeDurationInterface $time_duration}}
{{- else}}
{{- if .AnnotationsJSON.recovery_value}}
- **restoration time value**: {{formatDecimal .AnnotationsJSON.recovery_value 4}}
{{- end}}
- **recovery time**: {{timeformat .LastEvalTime}}
- **Alarm duration**: {{humanizeDurationInterface $time_duration}}
{{- end}}
- **Alarm Event Tags**:
{{- range $key, $val := .TagsMap}}
{{- if ne $key "rulename" }}
  - `{{$key}}`: `{{$val}}`
{{- end}}
{{- end}}

The most critical logic here is the judgment.AnnotationsJSON.recovery_value The logic of the

{{- if .AnnotationsJSON.recovery_value}}
- **restoration time value**: {{formatDecimal .AnnotationsJSON.recovery_value 4}}
{{- end}}

AnnotationsJSON is displayed if it contains a recovery_value, and the recovery_value is displayed with 4 decimal places. This .AnnotationsJSON is the custom field portion of the Nightingale Alert Rule, and if there is a recovery value in the alert event, it will be reflected in this field.

The final result is as follows:

夜莺告警恢复时拿到恢复时的值-最终效果

Take the threshold out of promql, promql is only used to query raw data

Products that use this practice toFlashDuty For example, FlashDuty supports not only configuring recovery_promql like Nightingale, but also promql without a threshold. We will focus on the promql method without threshold.

As an example, one of Memcached's alert rules leaves the threshold out of the query condition and writes the threshold in the decision rule, as shown in the following figure:

夜莺告警恢复时拿到恢复时的值-FlashDuty规则

This way you need to check the current value first and then take the current value to make a determination, so you can get the current value no matter when alarming or when recovering. This approach is very intuitive and is applicable to most scenarios. For a query condition filtered to a lot of timing scenarios, this approach will check a particularly large amount of data, the alarm engine is also a pressure. Need to consider.

If you also want your monitoring system to support getting the value at recovery time when recovering, you can refer to the above two ways.