Monitoring items for uneven values, how odd is that?

As someone working in IT infrastructure, every now and then you are confronted with a problem that you are not certain how to solve. Often times I have found myself overthinking things and ending up with a complex solution that isn’t very elegant but get’s the job done.

One such occasion was my solution to monitor Link Aggregation Group (or LAG) interfaces on switches.

dice
A game of Chō-han, anyone?

LAG interfaces explained

LAG interfaces combine multiple physical interfaces into one logical interface, doing so will give you some advantages. To people more familiar with Cisco devices, these might be better know as port-channel interfaces.

One of the benefits of using these interfaces is higher bandwidth and theoretical throughput as traffic is usually load-balanced over the physical interfaces in some way.  Another benefit is added redundancy: If one link in a LAG of multiple physical interfaces fails, traffic will still flow through the remaining interfaces uninterrupted.

However, sometimes an interface in a LAG might fail for some reason. A cable can be faulty or an operator might unplug a link interface without realising it and things will still keep running… until of course the last interface in the LAG also fails.

Needless to say, we wanted to monitor the state of the physical connections that are combined into a LAG interface.  There is no set standard of which interfaces will be joined in a LAG but we always deploy a LAG with an even number of interfaces (2, 4, 6, etc.). We can therefore safely assume that there is a problem with one of the links if the LAG is running with an odd number of active interfaces.

Making due

At the time I first setup items and triggers to monitor our LAG interfaces, I created an external script that would use some snmpget commands to retrieve the number  of active physical interfaces in a provided LAG. However, I couldn’t figure out how to let Zabbix trigger on odd or even item value so I decided to monitor the bandwidth (speed) of the LAG interface instead (via native SNMP items). I then triggered on the most common bandwidths in a failed LAG, namely 100Mbps, 1Gbps, 10Gbps and 300Mbps, 3Gbps and 30Gbps . As you can image the trigger expression was a sore sight to look at. However, it did it’s job and we at least had some way of being notified of problems.

Recently, when going over our templates, I rediscovered this little ‘gem’ of a trigger and started wondering if there really wasn’t a more elegant solution. It turns out, there is!

Band

No, not this one, these or even this one. We are talking about band(), the ‘bitwise and‘ trigger function in Zabbix. It was already introduced in Zabbix 2.2 but I hadn’t realised it’s potential.

So, what does band() do?

According to the Zabbix manual, the result of the function is ‘Value of “bitwise AND” of an item value and mask’. The mask is supplied as a parameter to the trigger function:

band (#num,mask,<time_shift>)

Any integer item value is basically stored as a combination of bits in the database. For simplicity sake, we’ll show some examples following an 8-bit integer (instead of the 64-bit used in the Zabbix database). 8-bit (or 1 byte) will allow us to represent any number from 0 to 255, so 256 combinations in total. Each bit in the byte represents a fixed numeric value. By setting the bits to 1 (on) or 0 (off) we can store any value within that range.

See this mandatory ASCII art representation on the value of each bit in the byte:

NNNNNNNN
||||||||
|||||||-   1
||||||--   2
|||||---   4
||||-----  8
|||------ 16
||------- 32
|-------- 64
---------128

So, if we wanted to represent a value of 1, it would look like this:

00000001 = 1

And likewise, if we would like to represent a value of 6, it would look like this:

00000110 = 6

Here are some other examples for your entertainment:

00000000 =   0
00001110 =  14
00110011 =  51
10001001 = 137
11100010 = 226
11111111 = 255

If you have a close look at these numbers, you’ll see that for even values the last bit in the byte is always 0 and that for odd values it is always 1.

Now, knowing this we can use a ‘bitwise and‘ operation and use it to find out if a value is odd or even. Wikipedia states that:

A bitwise AND takes two equal-length binary representations and performs the logical AND operation on each pair of the corresponding bits, by multiplying them. Thus, if both bits in the compared position are 1, the bit in the resulting binary representation is 1 (1 × 1 = 1); otherwise, the result is 0 (1 × 0 = 0 and 0 × 0 = 0).

First, let’s have a look how this works with an even value like 20. If we use a bitwise and operation with the value 1, the following would happen:

00010100 = 20
00000001 =  1
-------- &
00000000 =  0

Now, let’s try this with an uneven value like 25:

00011001 = 25
00000001 =  1
-------- &
00000001 =  1

As you can see, the operation returns a decimal 0 for an odd value and a decimal 1 for an even value, thus solving our problem in just one operation.

Creating Zabbix triggers

To use the explained logic in Zabbix we can use the band() trigger function. As we just want to look at the last value we can ignore the #num and timeshift parameters of the function, leaving us with the following trigger syntax:

{HOST:item.key.band(,1)}=1

This will trigger if the last known value is an odd value. If you’d like to trigger on even values instead, try the following trigger expression:

{HOST:item.key.band(,1)}=0

By combining these trigger expressions with the external check script I mentioned earlier, we can now keep track of any unbalanced LAG interfaces more easily without the inefficient, ugly and complex trigger expression we used to maintain.

Leave a Reply