Open Source Monitoring Conference (OSMC) is an event in Nuremberg, Germany. It started back in 2006 as a Nagios Conference, and got renamed to OSMC in 2009. As the name implies, it started out very focused on Nagios, then slowly became more generic with various other monitoring-related topics being included. I had the pleasure of attending the conference this year and here’s a small summary of a few of the very interesting talks at OSMC 2016.
- Visiting the Open Source Monitoring Conference 2016, Part 1
- Visiting the Open Source Monitoring Conference 2016, Part 2
- Visiting the Open Source Monitoring Conference 2016, Part 3
- Visiting the Open Source Monitoring Conference 2016, Part 4
Monitor your Infrastructure with Elastic Beats
Monica Sarbu opened the talks with information on “beats” – something that could be called a “monitoring agent”. These components can be deployed to collect and send data to Elasticsearch.
A single “beat” will usually have a data source plugin. Currently shipped beats include:
- packetbeat – network data
- filebeat – logfiles
- winlogbeat – Windows Eventlog data
- metricbeat – some OS statistics like CPU load
For me, the most interesting functionality was filebeats. Zabbix covers my monitoring needs, but a larger scale logfile handling goes in Logstash/Elasticsearch. Logstash is a bit heavy to be deployed on all servers, thus the promise of filebeats – to be a lightweight-data shipper – is very appealing.
Filebeats builds onto a platform of beats, which started in 2015 when Elastic acquired Packetbeat – back then a framework just for network data ingestion. So what’s the difference between placing Logstash or filebeat on individual servers? Filebeat is much more lightweight, but it also offers less functionality – no fancy transforming or filtering, I couldn’t find a detailed comparison, but there’s a short page of Logstash vs Beats in the Elastic documentation.
I was curious whether one can chain filebeat -> Logstash -> Elasticsearch, and that does seem to be possible according to the beat reference manual:
Beats can send data directly to Elasticsearch or send it to Elasticsearch via Logstash, which you can use to parse and transform the data.
Seems like it is possible to push data from filebeats to a few centralised Logstash instances for parsing, then handing off to Elasticsearch. That should reduce the load on the production servers, caused by the log collection. If Logstash is overloaded, it can tell all the filebeats to back off a bit. This can be helpful when some logfiles get a huge amount of lines. Of course, data sending resumes when the congestion at the Logstash end is over.
Filebeat also supports loadbalancing where it is told to send data to multiple Logstash instances. Well performing instances will be able to process more data, while overloaded ones will get less data. There was also something about filebeat assigning a unique ID to each log line/entry and then those lines being deduplicated on the receiving end – especially useful in load-balanced environments. So far I’ve been unable to find a reference to detailed documentation, just some forum threads talking about this not being implemented yet in October.
Metrics are for chumps – Understanding and overcoming the roadblocks to implementing instrumentation
Another talk I had a chance to attend was by James Fryman from Auth0 – a company that offers integrated authentication and SSO solutions.
His talk was more about communicating the need for monitoring, for instrumentation. It did not seem like a very interesting topic at first, but James did a great job at stressing communication strategies that are more likely to succeed – and he also had some very catchy slogans that I might… borrow for some t-shirts or something 😉
He stated that his mission is “to enable operations people to sleep”, and that’s fairly uncommon and brave enough to get my attention. More people seem to care about just getting something to be monitored than making sure the monitoring makes sense and alerts are meaningful.
Meaningful alerts? If you already have those, you might be very lucky. Most people would probably laugh and say “what’s that?”. James mentioned that it is important to get everybody on board to get quality instrumentation in place. If top leaders do not care, it won’t happen. If developers don’t care, it is highly unlikely to succeed either.
Availability is a team sport.
There was also a suggestion to have developers oncall every now and then. Wait, what? Developers – and oncall? In many companies that idea would be considered a joke, but in many of those companies there are serious communication problems as well… So what’s the point of having developers oncall? The purpose is to help them to understand their product better. Sure, they developed it, they maybe even tested it a bit. But they have no idea how it behaves in real world settings or how easy is it to find out how well (or not) things are going. Unless the development team is already extremely open to feedback and is making the product a pleasure to work with, having them actually use it will result in improvements that will save a lot of time and money later on.
It was stressed how important it is to show the monitoring data to developers. How the changes they do to the product affect the graphs/values. And, if possible, giving direct access to the monitoring system. That is very different from developers having their own functionality and performance testing systems – that’s not real data, those are not real customers.
Now, being oncall might not be feasible in all teams. Developers might be more focused on a single product, but the operations people could be supporting dozens of products. That doesn’t mean developers should be excluded from the experience of using the product. Maybe a developer could be an “oncall buddy”, who could help solving problems with a specific problem and discuss the troubleshooting sequence later. That way developers could see which areas of the product are well done for problem solving, and which ones need more improvement.
Just do it. Iterate.
There were also a few suggestions on doing things, and doing them in smaller steps. This is in line with the “perfect is the enemy of good” – trying to build the best tool possible right away is likely to fail. Building a “good-enough” tool will allow to have something useful now and improve it in iterations later. With monitoring, it might mean “just monitor something” – even if you only collect data and do not alert on it, at least you will get some useful baseline data for later.
Another thing that was suggested – data flow diagrams. These are really great to onboard other people with a product or deployment. All the ways various components connect and data is passed around might seem obvious to experienced participants and especially developers, but such diagrams are a huge help when you need to get he knowledge to other people in the team. That, and they actually show that somebody understands the solution.
I like sleep, you like sleep.
Then James went back to the premise, mentioned earlier – the importance of convincing everybody that monitoring is useful. Not the shiny graphs, but the functional benefit. For the operations people, it could be a discussion along the lines of “I like sleep, you like sleep”. For the leadership it might be as simple as “I like money, you like money” – the money saved on not implementing instrumentation and monitoring will flow away later much faster when the expenses of deploying the resulting product happen. Somebody later suggested that “I like sleep, you like sleep, I like money, you like money” might sound a bit strange in the same sentence, so remember – those are intended for different audiences 🙂
As for getting the buy-in, the “devops hierarchy of needs” was presented. To get to the upper levels, the lower levels are needed. If you skip the lower levels, it’s a house of cards, supporting the top stone, waiting to tremble down in the worst possible moment. Want continuous delivery? Get everything else in place first.
I resonated with this talk more than I expected – and that was also true for one of the closing statements:
Leave things better than you found it