Zabbix stopped sending out alerts, HALP!

Zabbix was this reliable friend, always sending you an email, SMS or both when something went down. It sometimes sent you a lot of emails, but you never got angry at Zabbix about that – it was just eager to help you, make sure you did not miss the weekly disaster. But then… last week… Zabbix did not send you an SMS. It did not send you an email. It did not telepathically inform you. But things were DOWN. Server was not RESPONDING.

A fat cat at a table with paws up

Zabbix knew about this. As you review the data, sitting in a dark room, the graphs clearly show the downtime. But there was-no-alert. How is that possible? Wait, what, this is impossible. You can see on the glowing screen that the main action, a crucial piece in getting those alerts, is disabled. That just cannot be, as nobody, NOBODY would ever disable that. How, oh how. Why, oh why.

Zabbix frontend row, showing a disabled action

Zabbix can skip sending alerts for a lot of reasons, but those will usually come into play when setting up new notification schemes. How could an existing, working configuration suddenly have an action disabled? Sure, somebody could have mis-clicked. Or made some other mistake. But what if nobody else but you has the administrative access, and you are SURE you did not disable that action?

When actions magically get disabled

Well, here’s one possible scenario. Let’s say you were testing some new remedy, maybe automatic restart for a picky service. For a clean test, you added a trapper item and a trigger on it, which you included in the conditions for this action. You sent in a value for the trapper item, the action fired as expected, then you removed the test item and trigger. You were happy.

Maybe the action had a condition to exclude group “Just Testing”. This approach was dropped, so you removed the group.

Wait, but if your Important Action had conditions that checked for these entities, what happens to the action? Zabbix sort of tries to be helpful and disables the action. Zabbix thinks:

If some condition is lost, the action might not perform as intended anymore. To be on the safe side, I’ll just disable it.

Zabbix is not entirely wrong here. But it is not entirely right here. This disabling happens without any confirmation and even without any message. User has no way to know that the template, trigger, host group or some other entity that was just removed resulted in various actions being disabled.

What can you do about it

Not much. Changing the behaviour will involve some hacking and a patch to Zabbix. This patch will surely be pain to maintain. A safety net approach could be used, though – you could check your most important actions. If one of them gets disabled, alert yourself somehow (probably not relying on that same action). This topic was discussed on the Zabbix IRC channel, and a simple script was born. This script accepts a single action ID as a parameter, for example:

$ --actionid 13

Output is a numeric code:

  • 0 – action exists and is enabled
  • 1 – action exists and is disabled
  • 2 – action does not exist

This script could be run either via cron or from a Zabbix external check. Upon non-zero return value (not exit code), it should notify by either sending a direct email or some other method that would not be thwarted by disabled actions in Zabbix.

What might be done about it some day

Probably the best solution would be an alert when attempting to delete an entity that is used in an action condition. Zabbix could show all the affected actions and allow to mark ones to disable, ones to keep enabled and allow opening action details for more detailed inspection. As a simple short-term improvement, a list of affected actions could be shown with a simple proceed/not choice. If you would like to see a solution for this problem, make sure to vote on ZBXNEXT-551.

Note that things are slightly worse when using custom action condition expression. The saved expression won’t be updated, but you would not see that in the frontend – a seemingly correct auto-generated expression will be displayed instead. There’s a separate report to handle this case at ZBX-9943.

Leave a Reply