The most important bug in Zabbix

How do you determine which bugs are important ?

The bug must be still unfixed to be important. If a new version of Zabbix comes out and the server crashes for all the users, that is the most important bug. Until it is fixed, hopefully, soon.

But there are some long-standing bugs that linger around just below the “fix-it” surface – they’re  not terrible enough to be fixed right away, and somewhat complicated usually. Such bugs can be around for many years, sometimes not even being fixed, but going away because a feature gets dropped completely. We’d need a way to measure which of all those known bugs is the most important. And there is a way to find out – same as with features, users can vote on bugreports. The bugreport with the most votes is titled deadlock between server and frontend.

Jira screenshot, showing "deadlock between server and frontend" issue title

OK, that sounds scary. Is that an old one? Yep, reported on 2010-06-01. This bug will have its 7th anniversary in a few months. Deadlock in computing happens when several processes lock resources in a way that none of them can proceed. More specifically, in databases deadlocks occur when a process locks tables or rows so that another process cannot proceed – and that other process in turn holds locks on resources the first process needs to proceed. Or even more specifically, if one process locks the applications table and needs to proceed with changes on the items table before releasing that lock while another process holds a lock on the items table and needs to proceed with changes in the applications table, neither of them can proceed. [Not a real situation in Zabbix]

Schematic of two processes having locked a table each and depending on the other table before being able to proceed and release their lock
Processes ‘zapper’ and ‘fryer’ have locked tables ‘applications’ and ‘items’ and need the other process to release the lock before proceeding

The bug itself has a few manifestations, as discussed in the issue ZBX-2494:

  • Zabbix server can actually deadlock itself, slightly contradicting the issue title
  • Zabbix server can deadlock with the Zabbix frontend

Zabbix developers have posted a scenario where Zabbix server deadlocking with itself has been reproduced. While a serious problem, it only occurs in very specific high load conditions that are quite unlikely to happen during normal operation, even in large Zabbix instances. How could Zabbix server deadlock with itself if we needed at least two processes to hold locks? The server actually consists of many internal processes (pollers, trappers, discoverers and so on), and most of those processes have their own connection to the database. In the end we have dozens or even hundreds of such processes, so no lack of potential participants in the deadlock game.

The server-frontend deadlocks are more serious. These can happen when making large changes in the frontend while the server is running. Usually this will happen in bigger Zabbix instances. “Large changes” could be linking a template to many hosts, or making changes to an item from a template that is linked to many hosts. If your hosts are numbered in hundreds, you are not that likely to see this problem in recent Zabbix versions. If your hosts are numbered in thousands or dozens of thousands, that is quite likely to happen at some point. Zabbix version matters somewhat as well, as the older versions hammered the database a bit more, thus requiring less objects to be updated for the trouble to begin.

If one has a large Zabbix instance, one workaround could be splitting up those changes – if thousands of hosts have to be linked to a template, that could be scripted in smaller batches using the Zabbix API. A hundred hosts at a time, you should be fine. Similar with unlinking and some other operations, but there are some cases that are not easy to split in batches. For example, modifying an item in a template that is linked to many hosts cannot be simply done in batches – if you modify the template, that change is propagated to hosts right away. One solution could be using the API to unlink (without clearing) the hosts from that template one by one or in small batches, then modifying the template, then relinking the templates. Ugh.

Another workaround has been mentioned in the bugreport:

Because of this problem, it is impossible to make changes to the database without stopping the Zabbix server.

Stopping the server would eliminate the deadlocks, and it would also reduce the load on the database, allowing the change to be completed sooner. Of course, stopping and starting a large Zabbix instance can take quite some time, so that is not a good solution either.

So how common and serious is this problem? The bugreport has 28 votes. Compare that to the most popular feature request with 172 votes back in January, and it is clear that, while being a real problem, it is not very common. As it’s usually seen by people with bigger installations, it could be that only a small number of Zabbix users are facing this issue. On the other hand, this report has 10 duplicates – other issues that are either about the same problem or similar enough to be folded in this task.

Have you seen this problem? Make sure to vote on the bugreport. Just don’t add the terrible “+1” comments 😉

Leave a Reply