Alerting downtimes in Slack using Heartbeat and Elasticsearch Watchers

Kevin Grüneberg
3 min readFeb 4, 2020

--

This is a two parts series, check out the first part, Elastic Heartbeat uptime and latency monitoring.

To setup an alerting, we can use Elasticsearch Watcher. Never heard of them?

You add watches to automatically perform an action when certain conditions are met. The conditions are generally based on data you’ve loaded into the watch, also known as the Watch Payload. This payload can be loaded from different sources — from Elasticsearch, an external HTTP service, or even a combination of the two.

For example, you could configure a watch to send an email to the sysadmin when a search in the logs data indicates that there are too many 503 errors in the last 5 minutes.

I will not go into too much depth about watchers in general, check out the great documentation on How Watcher works.

To create a new watcher, open up Kibana, go to Management > Elasticsearch > Watcher. You have to create an advanced watch (JSON) to achieve this. Threshold alerts can only cover very simple threshold scenarios.

Our alert will do the following:

Check every minute, if at least two failed pings occured for each service we are monitoring. If two pings failed for one service, we want to get an alert via Slack.

First, the complete example, which we’ll break down afterwards

Let’s break this down.

Trigger

The trigger defines, when the watcher should be triggered. In this case, the watcher is triggered every minute. Check out the schedule trigger documentation for more information and possibilities.

Indices

We’ll define which indices should be searched, with the wildcard, any heartbeat index is picked up.

Query

The query is an Elasticsearch query.

We search for entries where the field monitor.status equals down and the timestamp of the event is within the last minute. Depending on how many times you ping a target, the timestamp filter should be adjusted. If you ping a target once every minute, searching for multiple failed pings within the last minute would not make any sense.

Aggregations

Heartbeat monitors should have an understandable name. This way we get an understandable identifier in Kibana and can output the name in the alert.

We aggregate by the name. If we have multiple down events for different systems, we want to aggregate by the system name.

So if we have one failed ping from Service-A and another failed ping from Service-B, that will not trigger an alert, because we aggregated by name and each only has a single failed ping.

Action condition

The condition makes sure that we have at least two failed pings (amount of hits is greater than 1).

Slack Action

We use a Slack action and send a message to the #ops channel. By defining a list_path on our aggregation, the slack message will contain a list of systems that are down by their names.

The throttle_period_in_millis defines a notification throttle in milliseconds. In this case we will only get a notification once every five minutes.

Slack sample alert

If you like this post, feel free to follow me on Twitter or leave a comment.

--

--