There’s one big difference between Zabbix and Nagios which is a absolute game changer for me…
I’m into monitoring for over 12 years now and I designed and built installations of all sizes for a lot of different companies. Most of these monitoring systems were based on Nagios, Icinga or self developed software. Nagios was my absolute favorite and for a long time and I absolutely focussed on solutions based on Nagios. But in the last year one of my customers insisted on building a solution with Zabbix, which I hardly knew at this point and had never before used in a real world scenario. So I had to get the documentation and learn the principles behind this piece of software. At first I was underwhelmed of the user interface and it’s limited repertoire of checks. Also many parts of the documentation are not as detailed as they should be. But I had no choice and therefore I had to deal with Zabbix.
Most comparisations of Nagios and Zabbix describe the differences in setup and configuration in detail and how to get the same results you are used to get from Nagios. But the big difference is not within the UI or how a check is configured, it’s the fundamental principle of how the decision for triggering an alert is made.
In Nagios a check plugin contains everything you need for monitoring a single aspect of a service. I gathers a piece of information and decides about the operational status of this service based on given thresholds. Nagios receives a numeric value for the service status and a set of optional performance data for statistical reasons or rendering as a graph. This enables the administrator to extend Nagios with a vast number of independent plugins to monitor every kind of application in any programming language you like.
Zabbix by the other hand has segregated the different steps of data gathering, information processing and alerting into different stages within the Zabbix core. One connects to a service via a so called item to get a piece of information and store it into the database. Zabbix brings a lot of built in checks to conveniently connect to standard services or get operational parameters of a running server. Whenever you have to extend this standard set you realise that this is an exceptional case. As soon as the data from an item runs into the Zabbix database you can write a trigger to match against a threshold. Whenever this condition is met the trigger is fired ad the associated action is executed. This can be the sending of an alert or execution of an arbitrary script.
The advantage in Zabbix‘ approach to decide whether a service is ok or not is that it is independent from a single item and a single value. It can use the full range of past events and also of every other item instead of comparing the last gathered value with a fixed threshold. So it’s possible to compare the current value to the one from last week or even last year. You can combine independently determined values or predict how long a resource will last based on historical data. This is so much more powerful than the restricted context of an independent all-in-one script.
For example I use this ability to calculate the load over all working nodes of a cluster and to escalate over different severities and therefore different notification paths dependent on which fraction of a cluster is affected. In my monitoring every hard disk partition has the same two trigger levels: Send an e-mail 18 hours and a text message 4 hours before the partition will get full. 5 % left on /boot? No problem because this partition didn’t grow over the last year. 50 % left on /data? Seems to be critical, because one hour ago it was 75 %.
Another asset is the auto discovery which is able to automatically create new configuration items based on newly detected resources. This enables me to add my hard disk checks to every new partition without one single manual step or to create items and triggers for any new service. So I’ve got a self extending monitoring which needs a lot less attention then any other monitoring system I built before.
I still love Nagios for it’s reliability and flexibility but in a lot of scenarios I now prefer the abilities of Zabbix.