Alerting downtimes in Slack using Heartbeat and Elasticsearch Watchers
This is a two parts series, check out the first part, Elastic Heartbeat uptime and latency monitoring.
To setup an alerting, we can use Elasticsearch Watcher. Never heard of them?
You add watches to automatically perform an action when certain conditions are met. The conditions are generally based on data you’ve loaded into the watch, also known as the Watch Payload. This payload can be loaded from different sources — from Elasticsearch, an external HTTP service, or even a combination of the two.
For example, you could configure a watch to send an email to the sysadmin when a search in the logs data indicates that there are too many 503 errors in the last 5 minutes.
I will not go into too much depth about watchers in general, check out the great documentation on How Watcher works.
To create a new watcher, open up Kibana, go to Management > Elasticsearch > Watcher. You have to create an advanced watch (JSON) to achieve this. Threshold alerts can only cover very simple threshold scenarios.
Our alert will do the following:
Check every minute, if at least two failed pings occured for each service we are monitoring. If two pings failed for one service, we want to get an alert via Slack.
First, the complete example, which we’ll break down afterwards
Let’s break this down.
The trigger defines, when the watcher should be triggered. In this case, the watcher is triggered every minute. Check out the schedule trigger documentation for more information and possibilities.
We’ll define which indices should be searched, with the wildcard, any heartbeat index is picked up.
The query is an Elasticsearch query.
We search for entries where the field monitor.status
equals down
and the timestamp of the event is within the last minute. Depending on how many times you ping a target, the timestamp filter should be adjusted. If you ping a target once every minute, searching for multiple failed pings within the last minute would not make any sense.
Heartbeat monitors should have an understandable name. This way we get an understandable identifier in Kibana and can output the name in the alert.
We aggregate by the name. If we have multiple down events for different systems, we want to aggregate by the system name.
So if we have one failed ping from Service-A and another failed ping from Service-B, that will not trigger an alert, because we aggregated by name and each only has a single failed ping.
The condition makes sure that we have at least two failed pings (amount of hits is greater than 1).
We use a Slack action and send a message to the #ops
channel. By defining a list_path on our aggregation, the slack message will contain a list of systems that are down by their names.
The throttle_period_in_millis
defines a notification throttle in milliseconds. In this case we will only get a notification once every five minutes.
If you like this post, feel free to follow me on Twitter or leave a comment.