Visiting the Open Source Monitoring Conference 2016, Part 4

Returning to the events of the Open Source Monitoring Conference 2016, Avishai Ish-Shalom discussed an engineer’s approach to monitoring. David Hustace from OpenNMS told positive stories about this true-opensource monitoring tool.

OpenNMS mascot with a t-shirt saying "The kiwi bird is a direct descendant of the tyrannosaurus rex. rawr."

Avishai’s talk opened with a quick presentation of IaaS. What’s IaaS? Why, it is… Insult as a Service. It gives you various insults and offers an API. If you want to annoy your co-workers, you might even integrate it in sudo like this:

sudo password failures showing various insults

More on the monitoring side, he discussed various approaches to data collection and visualisation while walking through real-world problems. One of the suggested approaches was using histograms more. A histogram would show value distribution over some period of time, sorted by value range – a “bucket”. These buckets could be static, logarithmic or dynamic, depending on the input data. Histogram visualisation is notably missing from most monitoring tools, except some very specific cases. When using histograms, deciding on the approach is important. For a fixed thing like HTTP response codes there’s not much confusion, but for other data bucket sizes and other decisions can impact the outcome a lot.

Static bins for variable data are especially risky – if the data significantly exceeds the expected range, you have a problem. On the other hand, if there’s a single outlier, the other data is stuffed in a few buckets thus the variance cannot be seen. As usual, know what you are graphing to spot such issues.

Graphite was mentioned as having a somewhat “classic problem” of graphing averages and losing the peaks. While it has a way to graph the max values, min/max would often be lost when storing the data with the default settings. It was suggested to use various ways to configure the aggregation, but from the Zabbix perspective, it seemed as if Zabbix does this properly already by properly storing the hourly trend data with min/max values.

I found agreeing intensively with another suggestion – to go for collecting counter data instead of sampling-in-time, whenever possible. This topic is worth discussing in more detail, but the short version: if you can get counter data – like network traffic or database queries – use that instead of sampling the “current” traffic or queries per second.

A graph, showing how sampling can miss a lot of detail

An interesting suggestion was not to lump all values together always – for example, tracking the error response timing/sizes separately would be suggested, as the latency, size and other parameters would likely be wildly different from these values for the normal responses.
It was also advised not to put more than 3 data series on the same graph. Of course, often one would like to see a specific series data across a cluster consisting of many machines – how to display that in a meaningful way? A quick idea to graph the average + a configurable number of outliers was born and registered as a Zabbix feature request 🙂

Technical people are used to looking at all kinds of graphs and other ways to visualise data. We don’t get too confused, as long as there’s some sensible labelling and the visualisation is not extremely weird. When presenting monitoring to “normal people”, the expectations cold be different – specifically, it was mentioned that displaying hourly/daily/weekly/monthly etc graphs on the same page can be confusing for people who are not used to such a way to show information.

Another take on the data visualisation was related to the scaling. Take used diskspace graph in absolute units (bytes) and use auto-scaling for the y axis like this:

Diskspace graph, showing a sharp drop at the end

Disk is full! The sky is falling! Oh wait. If we look at the y axis, there’s still more than 200GB free. On the other hand, when large values are not expected to go outside a smaller range, using absolute y axis will hide all the variance. Again, know what you are graphing and adjust accordingly. As a sidenote, manipulating with the y axis is also a very popular way to do dishonest comparisons. Use relative axis and a small difference will look significant. Hopefully all IT people are familiar with the trap and can easily spot it.

Similarly, graphing the average masks out peaks and dips. Graphing the maximum makes it look like a constant disaster. Graphing the minimum makes it look like nothing is happening. Zabbix does pretty good here with showing minimum, average and maximum lines for a single series. There is a well-known problem when graphing averages (for example, when many items are placed on the same graph) – while the graph shows the average line, the legend honestly points out the min and max values, which confuses users sometimes – such values are not reflected in the graph. While there is no simple solution, it was mentioned that plain average can be misleading because of hiding the real min/max values completely, thus maybe the slight confusion Zabbix graphs create is not that bad after all.

Then David Hustace shared the latest news on another opensource monitoring tool, OpenNMS. OpenNMS is about as old as Zabbix, is Java-based and – same as Zabbix – completely open source.

This talk started with an overview of the OpenNMS community and advocated measuring the community – a wonderful and useful advice! David mentioned commit, accepted pull request, download and contributor statistics, and the contributor count had increased from 52 to 61 during the last year. A sign of a healthy and growing community, although you will have to trust me on the numbers – the slides from this talk do not seem to be public 🙂

Very interesting data is available at OpenNMS has an opt-in capability to submit anonymous statistics so that the development team has a better idea of their userbase. Of course, as one would expect from a proper opensource project, the collected data is public. Collecting various usage data is something Zabbix is missing, requested at ZBXNEXT-486.

Grafana dashboard, showing OpenNMS usage statistics

On the release model, OpenNMS has changed to a fairly industry-standard approach of having two main branches:

  • Horizon – faster access to new features, but might be less stable (think Fedora)
  • Meridian – more behind feature-wise, but likely more stable (think RHEL)

Release early, release often

This change was partially motivated by lack of resources to review all the patches, thus the patches sort of get field-tested by the community first. Interestingly, a similar approach was adopted by Zabbix a few years ago as well – while Zabbix does not have codenames, the more stable versions are labelled as being “Long Term Support”.

Various product improvements were also shared, including:

  • a JMX CLI configuration/monitoring tool – internal health data is important, and always handy when some scripting can make a configuration task easier
  • embedded JasperReports, providing on-demand and scheduled reports with an ability to email those – looked very neat and easy
  • a fancy dynamic, animated business service view with search and interactive navigation. Allows having a “focal point” and services displayed around it, clicking around rearranges it all. Not sure of the practical usefulness, but I’m sure it is a hit when demoing the solution 🙂
  • manually invoked scans from remote data collectors – these allow to run tests from a browser to see how things work from a remote location (for example, useful when troubleshooting). The results are also saved in the main server
  • introduced “minion” – a remote data collector like Zabbix proxy, but can run several of these that automatically take over load if one fails (at least that’s how I understood it)

Another new thing is something called “OaaS”… ok, it is not a completely unheard-of thing, but this acronym stands for “OpenNMS as a Service”. Actually using the product can do wonders to the developers. Eating your own dogfood is an incredibly helpful thing that is not done often enough, thus it would be interesting to read about the “TOP10 lessons we learned”. One can also deploy OpenNMS in any OpenStack environment, although the current approach is based on Docker – a slightly chaotic solution still 😉

An encouraging aspect is the nurturing and care shown by the OpenNMS company for the community, the inviting attitude. There is a yearly developer meetup – DevJam – that has been going on for more than a decade, where OpenNMS covers the travel & accommodation expenses of the core developers who then hack on OpenNMS and create magical things for a week. The last one reportedly had 30 participants. Having met several OpenNMS contributors, I am sure those are wonderful and productive events.

OpenNMS DevJam hall with tables, computers and people

Leave a Reply