04. 12. 2024 Andrea Mariani Business Service Monitoring, NetEye, Unified Monitoring

Correlate Services without a Business Process

In today’s episode of “Fantastic Checks and Where to Find Them”, I’ll share how I managed to correlate two or more services on a single host, or even across different hosts.

This story begins with a recent customer request. Initially, I considered using the Business Process module that’s already integrated in NetEye. However, after analyzing the requirements in detail, I realized that this solution would be too complex to implement and not easily scalable for future use.

The Customer Request

The customer needed to correlate three specific services, which in their case were passive traps related to the status of three protocols. The overall status of the correlated check had to work as follows:

OK if all three protocols were up
WARNING if 1 or 2 protocols were down
CRITICAL if all of the protocols were down

The goal was to determine the communication status of a link between two sites.

First Approach and Initial Challenges

I began by brainstorming how to implement this control. If each host had only these three services, I could have used a command developed by my colleague Rocco Pezzani (https://www.neteye-blog.com/2024/11/the-story-of-a-strange-business-process/).

Unfortunately, the situation was more complex because each host managed multiple links, with three services to correlate for each link, spread across various hosts.

At that point, I briefly considered changing jobs, but after taking a deep breath, I made a list of the information needed to tackle the problem.

Analyzing the Available Information

Initially, the only information I had was the service name. However, this wasn’t sufficient to build a robust and reliable structure, especially since service names can change.

I also had to account for additional complexities:

Each link between sites uses one of two available Carriers
Each host is connected to two devices at the remote site, potentially with dual links depending on the Carrier used

In a typical scenario, with two devices in site A and two in site B, there could be up to eight links, divided between the two Carriers.

To monitor the overall service status provided by a single Carrier, I needed to correlate all links managed by that Carrier to the same destination on each host.

The Solution

After several experiments, including trying to use Memcached with Python scripts (a story for another time), I decided to develop two new check commands based on Icinga DSL:

The first command correlated the services related to the individual protocols to determine the status of a single link
The second command correlated multiple links to determine the overall status of a route for each Carrier

Implementation

The first step was to create a Service Template that included variables populated with all the key information necessary to distinguish one service from another, even if they belonged to the same type.

Below is the list of variables I created and the structure of the Service Templates I implemented:

The first Service Template was designed to be associated with the existing services, adding a specific tag to identify both the protocol type and the route name on which the service operates.

After modifying the Tornado rules and enriching the existing services, each updated service will include the populated variables as follows:

Creating the Commands

Here’s the definition of the first command used to perform the correlation between the services associated with the protocols:

object CheckCommand "nx-c-service-host-correlation-status" {
    import "plugin-check-command"
    command = [ "/neteye/shared/monitoring/plugins/check_dummy" ]
    timeout = 1m
    arguments += {
        dummy_state = {
            order = 0
            required = true
            skip_key = true
            value = {{
                var host_name = macro("$host.name$")
            var filter_tags = macro("$nx_correlation_filter_service_tag$")
            var route_destination = macro("$nx_correlation_service_param$")
            var services_list = get_services(host_name)
            var states_list = [[0, 0], [0, 0], [0, 0], [0, 0]]
            for (service in services_list) {
                for (tag in filter_tags) {
                    if (service.vars.nx_correlation_service_param == route_destination && service.vars.nx_correlation_service_tag == tag) {
                        matches = true
                        var state = service.state
                        var current_state = states_list[state]
                        var acknowledged = service.acknowledgement
                        current_state[acknowledged] += 1
                        break
                    }
                }
            }
            var ok_count = states_list[0][0] + states_list[0][1]
            var critical_count = states_list[2][0] + states_list[2][1]
            var total_services = ok_count + critical_count
            var state = 0
            if (total_services == critical_count) {
                state = 2
            } else if (ok_count == total_services) {
                state = 0
            } else {
                state = 1
            }
            return state
            }}
        }
        dummy_text = {
            order = 1
            required = true
            skip_key = true
            value = {{
                var host_name = macro("$host.name$")
            var filter_tags = macro("$nx_correlation_filter_service_tag$")
            var route_destination = macro("$nx_correlation_service_param$")
            
            var services_list = get_services(host_name)
            var states_list = [[0, 0], [0, 0], [0, 0], [0, 0]]  // OK, WARNING, CRITICAL, UNKNOWN
            
            var state_names = [ "OK", "WARNING", "CRITICAL", "UNKNOWN" ]
            
            for (service in services_list) {
                var match_found = false
            
                // Verifica se il servizio corrisponde ai criteri
                if (service.vars.nx_correlation_service_param == route_destination) {
                    for (tag in filter_tags) {
                        if (service.vars.nx_correlation_service_tag == tag.to_string()) {
                            match_found = true
                            break
                        }
                    }
                }
                
                if (match_found) {
                    var state = service.state
                    var current_state = states_list[state]
                    
                    var acknowledged = service.acknowledgement
                    current_state[acknowledged] += 1
                }
            }
            
            var ok_count = states_list[0][0] + states_list[0][1]
            var critical_count = states_list[2][0] + states_list[2][1]
            var total_services = ok_count + critical_count
            
            var overall_status = ""
            if (total_services == critical_count) {
                overall_status = total_services + "/" + total_services + " KO => Stato Generale CRITICAL"
            } else if (ok_count == total_services) {
                overall_status = total_services + "/" + total_services + " OK => Stato Generale OK"
            } else {
                overall_status = ok_count + "/" + total_services + " OK => Stato Generale WARNING"
            }
            
            var message = overall_status + "\nDettagli Servizi:\n"
            for (service in services_list) {
                var match_found = false
            
                if (service.vars.nx_correlation_service_param == route_destination) {
                    for (tag in filter_tags) {
                        if (service.vars.nx_correlation_service_tag == tag) {
                            match_found = true
                            break
                        }
                    }
                }
            
                if (match_found) {
                    var service_state_text = state_names[service.state]
                    message += service.name + ": " + service_state_text + "\n"
                }
            }
            
            return message
            }}
        }
    }
    vars.dummy_state = 0
    vars.dummy_text = "Check was successful."
}

And here is the Service Template associated with it:

In this Service Template, you can see the pre-configured service tags that should be searched for among the services on the host in order to identify the three types of protocols to correlate. You’ll also notice an additional Master TAG, which is pre-populated. This value is auto-created and will be used to correlate the services generated by this template. The only missing value is the Service Param, which will contain the identifier, in our case the route.

Below you can see an example of the final result of the correlation between the services related to the protocols.

And here you can see how the variables have been populated in the service:

The definition of the command for correlating the various links, along with its associated Service Template:

object CheckCommand "nx-c-service-correlation-status" {
    import "plugin-check-command"
    command = [ "/neteye/shared/monitoring/plugins/check_dummy" ]
    timeout = 1m
    arguments += {
        dummy_state = {
            order = 0
            required = true
            skip_key = true
            value = {{
                var correlation_tag = macro("$nx_correlation_master_tag$")
            var routers_names = macro("$nx_correlation_hostnames$")
            var route_destinations = macro("$nx_correlation_filter_service_param$")
            
            var states_list = [[0, 0], [0, 0], [0, 0], [0, 0]]
            var warning_count = 0
            var critical_count = 0
            var ok_count = 0
            var total_services = 0
            
            for (router_name in routers_names) {
                var services_list = get_services(router_name)
            
                for (service in services_list) {
                    var match_found = false
            
                    if (service.vars.nx_correlation_master_tag == correlation_tag) {
                        for (route_destination in route_destinations) {
                            if (service.vars.nx_correlation_service_param == route_destination) {
                                match_found = true
                                break
                            }
                        }
                    }
            
                    if (match_found) {
                        var state = service.state
                        total_services += 1
            
                        if (state == 0) {
                            ok_count += 1
                        } else if (state == 1) {
                            warning_count += 1
                        } else if (state == 2) {
                            critical_count += 1
                        }
                    }
                }
            }
            
            var state = 0
            if (critical_count == total_services) {
                state = 2
            } else if (warning_count > 0 || critical_count > 0) {
                state = 1
            }
            
            return state
            }}
        }
        dummy_text = {
            order = 1
            required = true
            skip_key = true
            value = {{
                var correlation_tag = macro("$nx_correlation_master_tag$")
            var routers_names = macro("$nx_correlation_hostnames$")
            var route_destinations = macro("$nx_correlation_filter_service_param$")
            var states_list = [[0, 0], [0, 0], [0, 0], [0, 0]]
            var state_names = [ "OK", "WARNING", "CRITICAL", "UNKNOWN" ]
            var message = ""
            var warning_count = 0
            var critical_count = 0
            var ok_count = 0
            var total_services = 0
            
            for (router_name in routers_names) {
                var services_list = get_services(router_name)
                
                for (service in services_list) {
                    var match_found = false
                    
                    if (service.vars.nx_correlation_master_tag == correlation_tag) {
                        for (route_destination in route_destinations) {
                            if (service.vars.nx_correlation_service_param == route_destination) {
                                match_found = true
                                break
                            }
                        }
                    }
                    
                    if (match_found) {
                        var service_state_text = state_names[service.state]
                        message += service.name + ": " + service_state_text + "\n"
                        var state = service.state
                        total_services += 1
                        
                        if (state == 0) {
                            ok_count += 1
                        } else if (state == 1) {
                            warning_count += 1
                        } else if (state == 2) {
                            critical_count += 1
                        }
                    }
                }
            }
            
            var overall_status = ""
            if (critical_count == total_services) {
                overall_status = critical_count + "/" + total_services + " CRITICAL => Stato Generale CRITICAL"
            } else if (warning_count > 0 || critical_count > 0) {
                overall_status = (warning_count + critical_count) + "/" + total_services + " WARNING/CRITICAL => Stato Generale WARNING"
            } else {
                overall_status = ok_count + "/" + total_services + " OK => Stato Generale OK"
            }
            message = overall_status + "\nDettagli Servizi:\n" + message
            return message
            }}
        }
    }
    vars.dummy_state = 0
    vars.dummy_text = "Check was successful."
}

And now you can see the final result of the correlation service between the various routes:

As well as how the variables are defined:

To conclude, the service names to be monitored, or those created by the correlations, will remain completely independent from the checks performed by the new Commands. Furthermore, this structure of Command and Template can be reused in contexts other than the one presented. In fact I hope to bring this new check soon to an NEP, as I’m confident it will be useful to many of you.

These Solutions are Engineered by Humans

Did you find this article interesting? Does it match your skill set? Our customers often present us with problems that need customized solutions. In fact, we’re currently hiring for roles just like this and others here at Würth Phoenix.