14. 03. 2022 Rocco Pezzani NetEye, Unified Monitoring

Hosts, Zones and Broken Icinga 2 Configurations

During my experience as a Würth Phoenix consultant, I’ve seen a pretty long list of broken Icinga 2 configurations. Several times, customers have begun a scheduled meeting with something like “Hey mister consultant, ever since the last deploy some objects have stopped being monitored, but I don’t see any errors!”. After some troubleshooting, everything always comes down to one or two Director Object variables that have been set with the wrong value.

Solving these issues is usually pretty easy, but this is not where I am heading right now. We all know Icinga 2 is not the easiest software to manage, and its difficulty increases exponentially with two main factors:

The number of people working with Director
The size of the monitoring infrastructure itself

Almost all errors arising from managing your monitoring infrastructure can be avoided by sticking to a series of best practices. Some of them are “universal” (like all your Host Objects’ names must be lower case), others come as a consequence of the current Icinga 2 monitoring configuration. And then there’s the difficult part: when people don’t have a good enough comprehension of Icinga 2’s innards and don’t understand the whole monitoring architecture, it’s really easy to make mistakes. Some of them can be spotted easily for us with Icinga 2 itself (e.g., a failed deploy is easy to notice), but others are much more treacherous. I’m pretty positive you’ve seen some of these cases:

You changed some custom variable on a Director Object, but the corresponding Monitoring Object still keeps the old values
You changed the Cluster Zone of a Director Host Object, but its corresponding Monitoring Object is still assigned to the old one
You added a new Service to a Director Host Object, but it does not appear on the corresponding Monitoring Host Object
Services on a Monitored Host Object become late, or stay in the Unknown state for no apparent reason
A new Monitoring Service Object remains in the Pending state even several hours after the deploy that created it

Obviously, the last Deploy completed with a green check mark, which enforces the idea that Icinga is making fun of us, but this really isn’t the case. Like it or not, the harsh reality is: “someone made a mistake while changing your monitoring configuration”. Acknowledge this, and you’ll naturally move on to the obvious next step: troubleshooting.

An Example Scenario: Broken Distributed Monitoring

Several problems can lead to the symptoms described, but the way to get out of them is always the same. But first, since it’s most helpful to talk about an actual case, I’d like to illustrate the most common scenario: a broken monitoring configuration in a distributed monitoring environment.

Imagine a company big enough to require multiple people taking care of the monitoring configuration in deployed NetEye infrastructure built like this:

a NetEye Master instance (master.test)
a first child Cluster Zone (zonea), managed by a NetEye Satellite instance (satellitea.test)
a second child Cluster Zone (zoneb), managed by a NetEye Satellite instance (satelliteb.test)
Zones zonea and zoneb are actually siblings

At the beginning, the main NetEye administrator prepares a Host Template structure that can accommodate generic monitoring of a simple Linux-based server across all Cluster Zones: a “root” Host Template named linux-server, containing all the monitoring definitions, and two children, one for each zone, that explicitly set the supposed Cluster Zone. Assume everything works as expected.

After some time, in zoneb a new kind of Linux-based server is added: a web server. This kind of server will be deployed only to zoneb, and its creation is tasked to the NetEye administrator in charge of zoneb. He or she then creates a new Host Template child of linux-server-zoneb and names it linux-web-server. Now on zoneb all the Linux-based Web Servers are happily monitored.

Some time passes, then a new issue arises: a Web Server similar to the ones on zoneb must be deployed on zonea. The NetEye administrator in charge of zonea is tasked with monitoring this new server. While looking at the existing NetEye configuration, he finds the linux-web-server Host Template and likes it right away; he finds out that this Template is perfect except for the Cluster Zone, therefore he does the obvious thing: create a new Host Template child of linux-web-server, changes its Cluster Zone to zoneb and uses it to monitor the new server. To complete the whole opera, the new Host Template’s name is linux-web-server-zonea. It seems a really fine piece of work.

Now that he’s ready, the zonea NetEye administrator creates a new Host Object using its new Host Template, performs a deploy and… the new Host Object does not appear in the Monitoring View: there’s just no trace of it. But the Deploy completed without errors, Icinga 2 is still up and running everywhere, and all the other objects are still monitored without any problems. Truly a masterpiece.

The logic behind Deploy and Configuration distribution

Before performing some troubleshooting, it’s better to get an overview of the Icinga 2 monitoring configuration and the way it’s handed to Icinga2 by Director. Remember: this is only a simplification of how Icinga 2 and Director work, and its only aim is to give you enough info to understand the issue at hand here. Please refer to the Icinga 2 documentation for further details.

The monitoring configuration is composed of several objects, and each of them must be assigned to one specific Cluster Zone. This means that an object is owned by exactly one Cluster Zone, neither more nor less. This ensures the whole configuration is divided into disjoint sets. Cluster Zones are roughly divided into two groups: Global Zones and Non-Global Zones. While the contents of a Non-Global Zone are available to the Icinga 2 Endpoints managing that Cluster Zone, the contents of a Global Zone are sent to all available Endpoints. Two Global Zones are defined on each Icinga 2 setup by default: director-global and global-templates. A default Non-Global Zone called master is configured by NetEye Setup routines. Any Host/Service Object that is not explicitly assigned to a Zone is implicitly assigned to the master Zone. Any other Object that is not explicitly assigned to a Zone is implicitly assigned to the global-templates Zone.

When the monitoring configuration is deployed:

Director generates a temporary monitoring configuration and asks Icinga 2 to check it
Icinga 2 checks the temporary monitoring configuration as a whole, ensuring syntax and overall relationships between objects are proper
Director sends the temporary monitoring configuration to the Icinga 2 Master Instance through a dedicated API, creating a staging configuration
The Icinga 2 Master Instance reloads, loading the new monitoring configuration
The Icinga2 Master Instance reconnects to all its endpoints
To each endpoint, the Icinga 2 Master Instance sends a copy of the Non-Global Zone it manages and a copy of all the Global Zones
Each endpoint will store its part of the monitoring configuration as a staging configuration, then reloads, which then also loads the new monitoring configuration; if that configuration has some issues, the original configuration will remain operative and the new one will be discarded

Please remember this procedure applies to anything, Icinga 2 Agents included.

From Director’s perspective, the deploy operation is OK if step 3 ends well. As you can guess, the Icinga 2 Master Instance has knowledge of everything, while all endpoints only know what the Master Instance lets them know.

Troubleshooting Broken Configurations

From what I said above, it’s obvious that the fact that the Deploy goes well doesn’t guarantee that the new monitoring configuration is actually loaded onto every component in our monitoring infrastructure. We have to explicitly look for the problem.

First of all, we need to determine all the endpoints involved in the delivery of the configuration:

Locate the missing/faulty Object inside Director
If the Object is managed by an Agent, the first endpoint is the Agent itself
Next, identify the zone that should be managing the Object: its NetEye Satellites are the next endpoints
Repeat step 3 until you get to Master Zone (the Icinga2 Master Endpoint is not included)

Now we should inspect the Icinga 2 log files for each endpoint, in order of previous discovery. If the log files have been purged, it’s possible to restart an Icinga 2 endpoint, but the so called “new configuration” is still there, so Icinga 2 is most likely going to fail the restart, and thus stop working. So be prepared for that possibility.

While checking Icinga 2 log files, we have to look for some errors during the processing of the configuration. If we find a line like the one below, we’ve surely found the Icinga 2 instance that has issues:

[2022-03-11 10:35:18 +0100] critical/ApiListener: Config validation failed for staged cluster config sync in '/neteye/local/icinga2/data/lib/icinga2/api/zones-stage/'. Aborting. Logs: '/neteye/local/icinga2/data/lib/icinga2/api/zones-stage//startup.log'

This line points to a specific log file that appears only when there are issues while reloading the Icinga 2 Instance Configuration. This file will report the source issue that we need to fix.

In the current scenario, the startup.log file resides on satellitea, and reports this error:

[2022-03-11 10:35:18 +0100] information/cli: Icinga application loader (version: r2.11.9-1)
[2022-03-11 10:35:18 +0100] information/cli: Loading configuration file(s).
[2022-03-11 10:35:18 +0100] information/ConfigItem: Committing config item(s).
[2022-03-11 10:35:18 +0100] information/ApiListener: My API identity: satellitea.test
[2022-03-11 10:35:18 +0100] critical/config: Error: Import references unknown template: 'linux-web-server'
Location: in /neteye/local/icinga2/data/lib/icinga2/api/zones-stage//zonea/director/host_templates.conf: 7:5-7:29
/neteye/local/icinga2/data/lib/icinga2/api/zones-stage//zonea/director/host_templates.conf(5):
/neteye/local/icinga2/data/lib/icinga2/api/zones-stage//zonea/director/host_templates.conf(6): template Host "linux-web-server-zonea" {
/neteye/local/icinga2/data/lib/icinga2/api/zones-stage//zonea/director/host_templates.conf(7):     import "linux-web-server"
                                                                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^
/neteye/local/icinga2/data/lib/icinga2/api/zones-stage//zonea/director/host_templates.conf(8):
/neteye/local/icinga2/data/lib/icinga2/api/zones-stage//zonea/director/host_templates.conf(9): }

[2022-03-11 10:35:18 +0100] critical/config: 1 error
[2022-03-11 10:35:18 +0100] critical/cli: Config validation failed. Re-run with 'icinga2 daemon -C' after fixing the config.

In this case, the faulty object is “unexpectedly” Host Template linux-web-server-zonea. But why? Because it’s trying to import an unknown object named linux-web-server.

Note: this troubleshooting method must be used every time an Object is not behaving the way we intended and we’re not able to find the cause in the monitoring configuration contents. The results might not just be related to the monitoring configuration, but also to endpoint settings and so on, so always carefully read the contents of icinga2.log and startup.log.

Understanding and Fixing the Error

This error is quite simple to understand. Some lines above we said that the Icinga 2 monitoring configuration is divided into disjoint sets. And this is the key: a configuration sent to satellitea contains all the Objects not explicitly assigned to a Zone, and all the Objects that are explicitly assigned to zonea. Because it is derived from linux-server-zoneb, Host Template linux-web-server is assigned to zoneb. Therefore, it’s not available for satellitea to use, and so here is the source of the error.

This issue can be fixed in a number of ways, but the easiest one is to make the entire Host Template hierarchy aware of the two zones:

Assign all Host Objects using linux-web-server-zonea to a new Host Template (a temporary measure)
Delete Host Template linux-web-server-zonea
Clone Host Template linux-web-server to linux-web-server-zonea and change its parent Host Template to linux-server-zonea
Assign Host Template linux-web-server-zonea to the Host Objects edited at step 1
Rename Host Template linux-web-server to linux-web-server-zoneb

And now everything is solved.

Where This Error Comes from

First of all: this error is not caused by a bug in Icinga 2. Maybe a stricter check during Deploy can help in identifying this kind of situation, but this is an improvement over a lack of functionality, not a bug. This error is a human error: it can happen only because of humans, and can be fixed only by humans. You might call it a design error.

Why did it happen? Because the NetEye administrator of zonea has little or no knowledge of what happens in zoneb. Several of you might say that Icinga Director and NetEye provide tools to prevent this kind of situation, like the Tree View: the Tree View gives you enough information to understand, but it’s not easily usable when the number of objects increases: in that case it’s better to perform a text search in the Table View. Then, after finding a Template, it’s always a good idea to understand the whole Tree Branch containing the Template we’re about to use, but keep in mind that the knowledge a given operator has might not be at a high enough level. Or it might even simply be that we make an error due to distractions… As you can see, in the case of human errors everything can easily fall on the shoulders of the one making the error. But, we are talking about human errors! Because they are humans, they cannot be eliminated, so it’s part of our responsibility as Architects to prepare designs that are more usable with the current toolset, simplifying the life of the operator, and of course our lives as well.

Preventing This Error

So, how we can prevent this kind of error? Preventing human errors completely is pretty much impossible. Usually, the most efficient way is to rebuild the design in a way that this behavior can no longer occur, and the situation is solved.

Creating a monitoring configuration that can’t be broken is impossible, but if an error does appear, it’s better to show it directly at Deploy time inside Director, rather than hunting for it in the logs, server by server. Also, the configuration broke because the functionalities and configurations related to the concept of monitoring are bound together with settings related to the concept of distributivity. So decoupling them can be an effective solution. To implement this kind of scenario, we must make use of multiple inheritance: in short, we have to relegate all settings related do Cluster Zones to a dedicated set of Host Templates, which creates another Host Template Tree.

The Concept of Location

We can now define the concept of location as the information describing the positioning of a Host Object inside the monitoring infrastructure. As a consequence, we have to create a Host Template that describes Zone zonea as a possible location, as well as zoneb. On these two templates, only the Cluster Zone attribute is set. Anything else is pretty much forbidden. And, for the sake of order, both of them have to inherit from an empty Host Template named generic-location. Then, Host Templates created because of the Cluster Zone could just as well disappear. That would make for a neat layout.

Creating Host Objects with Location

Now, creating Host Objects becomes easy: to create a Host Object for a Linux Web Server and assign it to Zone zonea, just assign two Templates to it: linux-web-server and location-zonea. You want to move it to Zone zoneb? Simply remove location-zonea and replace it with location-zoneb, and everything is ready. You want to move it to Zone master? Just remove the location template entirely. Both Deploy and reload will always go well for all endpoints, making it a foolproof design.

Considerations about Location-based Design

We at Würth Phoenix have given this location-based design more than a little thought. It’s born from the necessity of reducing the number of Host Templates in multi-zone monitoring infrastructure with the added bonus of having only one template responsible for managing settings and services for a specific kind of monitoring (in the past, like the current scenario, you would have as many linux-web-server Template clones as the number of Zones). And, like anything in science, it effectively avoids breaking the monitoring configuration by converting a wrong behavior into perfectly correct practice.

This design is so effective that it has been used as the foundation for the Template Structure used in our NetEye Extension Pack project. Therefore, even if some of you think this is an unnecessary burden, it’s not: your Host Template tree will undoubtedly increase in size, and soon you will face management difficulties or configuration errors: it’s not really a matter of if, but when.

This design can also be applied to other concepts. The NetEye Extension Pack project defines three types of concepts, named Type Templates or Template Types: Monitoring type, Location type and Custom type. This segmentation might be a bit too extreme, but if you have some settings that force you to clone one or more Templates several times, they might be a good candidate for a new concept definition. For example, suppose you usually have a check interval of 5 minutes, but sometimes you require 1 minute, other times 15. You are forced to duplicate the same Template and change only Check Interval and Retry Interval. This leads to the concept of Check Frequency type, enabling the design we discussed.

A Bonus for Hardy Readers

Because you’ve endured up to this point, I’ll give you a little bonus to further ease the life of your operators: you really deserve it. You can make effective use of the Template Choices feature in Director: templates with the same purpose can be grouped together with some conditions, and then Director will show a combo box containing each Host Object, allowing them to be easily selected and removing the necessity of notifying all your operators about every single addition/removal. You can also force the operators to select at least one element of the combo box to ensure everything has the expected layout.