Looking back at the Zabbix Conference 2016, day 2

The second day of the Zabbix conference started with workshops. This was a completely new thing, thus there was limited experience with organising these. There were four workshops in two tracks:

  • Scale with Zabbix Partitioning
  • Master Low Level Discovery
  • Hands On Trend Prediction
  • Guide to Extending Zabbix + Scripts and API

Zabbix workshop information on the agenda

Note that “Zabbix Partitioning” above is “partitioning MySQL for Zabbix” – a few people expressed confusion about the workshop name.

The topics were technical and the audience brave – they had managed to get up after the first day of the conference and be there at 08:30. One thing everybody agreed upon – there was not enough time.
With 40 minutes for each workshop, it was enough to discuss the matter and barely start the practical part. Nevertheless, workshops were very well received and many participants suggested to devote a whole day before the conference to them next year. I’ll write in more detail about mastering low level discovery later, as I had the honour of giving – or attempting to give – that workshop. Just 40 minutes that turn into 30 due to various delays are far from enough to play with the great feature of LLD.

But let’s return to the conference hall now and see what talks are available there.

The first talk on the second day was given by Raymond Kuiper from Netco Technology. He shared the lessons learned trying to design Zabbix templates that would be easy to reuse and maintain. He even called that art – and not without a reason. Raymond stressed the well-known truth (at least in some circles) that the default templates should not be used unmodified, they should always be reviewed and used as an example of what is possible – not necessarily of what should be done. He shared his vision on template organisation around the roles of the monitored systems:

Organising nested templates by system role

He also demonstrated an interesting use of user macro context for similar items – instead of having a unique macro for each of icmpping items that share, for example, packed size parameter, a common usermacro is used that may be overwritten by the use of the macro context.

Another great idea was storing the trigger threshold and graphing it, so it would be obvious why trigger fired or not in case the threshold is changed.

A graph, showing historical threshold

Lukáš Malý from DATASYS shared the details of a log management solution called ELISA, and mentioned that Zabbix is used by Czech Post, Czech Aeroholding and several ministries in Czech Republic. Oh, dang. Go, Czechia 🙂

We learned that ELISA (not ELIZA) is a solution that integrates Logstash, Elasticsearch, Kibana, NXlog, JasperReports, Zabbix and a few other components. Zabbix is relied upon to provide authentication and authorisation support, as well as more “normal” monitoring features and alerting. Even more – Zabbix interface has been modified to act as an interface for ELISA, providing additional management functionality. The Elasticsearch indices can be managed straight from the Zabbix frontend. The NXlog agents are centrally managed from Zabbix. Indeed lots of integration going on 🙂

Oh, and they plan to release a virtual appliance in November.

ELISA virtual appliance screenshot

After a break, two Zabbix support engineers – Oleg Ivanivskyi and Ingus Vilnis – shared their stories about various Zabbix-related issues they have solved for customers. Oleg shared seven lessons he has learnt while providing Zabbix support:

  • It is important to agree on the expectations
  • Documenting the requirements is a must
  • Small changes can either help a lot, or break things a lot
  • Customers might delay reaching out to Zabbix for too long
  • Zabbix DB schema might be better than you think
  • Zabbix DB might often have useless data
  • Having good monitoring coverage is, well, good

He also expects to learn more lessons in the future.

Every day is a school day

There’s one lesson customers usually learn – Zabbix support team is really qualified, helpful and friendly.

Ingus touted the benefits of the official Zabbix training. Having given them myself for 6 years, I’d like to highlight two great things Ingus mentioned:

  • Zabbix training sessions have a lot of practical tasks
    The contents might be slightly adapted for each session, but in general one can expect roughly half of the training time to be spent doing hands-on tasks with Zabbix – and not just clicking on graphs, but also real low level configuration and customisation
  • Zabbix trainers are knowledgeable and experienced
    The people providing Zabbix training are not just reading from the slides, they actually know the product really well and are expected to answer nearly all questions about Zabbix. Indeed, Zabbix trainers can explain every single feature Zabbix has, and it is extremely rare that somebody would already know Zabbix so well that they wouldn’t learn a few new and useful things. One may even think of the training sessions as advanced consulting endeavours where the trainer is also this super-expert consultant.

Some of the topics, covered in Zabbix training

Next up Ryan Armstrong from Kinetic IT demonstrated their approach to automating Zabbix when monitoring more than 7000 devices. He advocated the “cattle, not pets” philosophy for treating the systems. For Zabbix access, they retrieve user data from a central directory, then update Zabbix using the API. This synchronisation is based on LDAP groups.

Similarly, host configuration is retrieved from a CMDB, parsed and then Zabbix templates, hosts and trigger dependencies are updated using the API. He shared an SNMP template generator that grabs a MIB file and spits out a Zabbix template. And “shared” means that it is available on GitHub.

For people monitoring Windows another tool might be useful – a PowerShell module that generates Zabbix templates from Windows performance counters. Ryan advocated the performance benefits of loadable Zabbix modules. And if you want to test how well Zabbix agent is performing with a module, there’s an Agent stress test utility for that.

Performace comparison of loadable modules VS other solutions

Konstantin Yakovlev explained how RingCentral dealt with an issue of having less-than-perfect visibility on various events Zabbix generated. Monitoring more than 10 thousand hosts across 4 Zabbix servers, their NOC was exposed to 2 thousand events per 24 hours on average. He highlighted 3 problematic areas:

  • a small number of triggers generating majority of events
  • too many flapping triggers
  • a need for more automated event processing

Their approach was to develop a separate tool to parse Zabbix events for tracking and reporting purposes. A very interesting feature of this tool is an ability to model a trigger on existing historical data, and even compare how multiple different triggers would react to real-world data. This way, it is easy to see how a new or changed trigger would have behaved in a real-life situation. The tool creates the desired triggers in a test Zabbix instance, pushes production data to that instance and presents the generated events to the user.

Trigger analyser interface screenshot

One reason why developing such a tool might not be for everybody – several people worked for a year to implement this slick solution.

Dimitri Bellini and Pietro Antonacci from Quadrata had to implement a geographically distributed Zabbix solution. Of course, they opted for Zabbix proxies, but they faced an issue – Zabbix proxies were not able to deliver data quickly enough to the Zabbix server. They learned it the hard way that having synchronised clocks on all the servers can help a bit. As the next step, they attacked proxy database performance issues, which resulted in the proxies being able to push more values to the Zabbix server. On top of that, they reduced the time period that Zabbix proxies store the data locally, thus trading data availability for less queue buildup during connectivity issues.

As their users were in several timezones, Dimitri and Pietro decided to set up several Zabbix frontends, each showing data in the local time for the users. That is a feature which is not available in Zabbix natively yet.

Using multiple frontends, each with its own timezone

Rafael Martinez Guerrero took the stage to share the monitoring experience at the University of Oslo. He outlined the diverse environment, having a lot of different operating systems and devices, connected in a complex IT infrastructure. As is often the case, monitoring landscape over the time had grown to include a lot of different tools, and 3 important areas they needed from a monitoring solution were highlighted:

  • distributed monitoring
  • limiting access
  • setting up dependencies

An overarching topic at this conference has been automation, and Rafael confirmed the importance of having integration between monitoring and other systems like CMDB. Interestingly enough, their CMDB is called Nivlheim – an afterlife for “those who did not die a heroic or notable death” – there’s surely a message in there 🙂

He mentioned the major components, used in automating monitoring configuration (git, CFEngine and others), and covered the importance of limiting user permissions, integrating with LDAP. With a lot of users accessing Zabbix with different levels of permissions, they are also investing in ability to see what was changed, and rolling back the changes.

Schematic of Zabbix, CMDB and other component integration at University of Oslo

That concluded talks for the second Zabbix Conference 2016 day, and it’s a good moment to say thanks to the conference sponsors who helped all the participants to gain more experience on Zabbix: NTT Communications, Quadrata and Unirede.

3 thoughts on “Looking back at the Zabbix Conference 2016, day 2”

Leave a Reply