If you’re familiar with the NetEye SIEM module you probably also know El Proxy, the solution integrated into NetEye to ensure the integrity and inalterability of the logs produced by the SIEM module.
Since its introduction in NetEye, the only way to understand what El Proxy was doing was to inspect its logs, but as we know this is not an ideal solution for getting an overview of the behavior of any piece of software. This means that until now, El Proxy has been like a black box for most users, who may be have been wondering for example:
Is El Proxy signing and processing all logs correctly? Or is it perhaps encountering some error?
What is the workload in El Proxy? Are El Proxy and Elasticsearch keeping up with all the logs produced by the SIEM module?
To answer these types of questions we started introducing observability into El Proxy. In particular, we started with metrics, which will allow users to easily spot anomalies in the infrastructure and analyze the behavior of El Proxy over time.
The technologies involved in the process of exposing and visualizing El Proxy metrics in NetEye are:
OpenTelemetry: used by El Proxy to generate the metrics and expose them via an HTTP endpoint using the Prometheus format
Telegraf: polls the metrics from the HTTP endpoint and writes them to InfluxDB
Grafana: visualizes the metrics via multiple dashboards installed in NetEye
To design the metrics and the visualizations, we divided the metrics into two main topics. The first one is troubleshooting. For which users may ask: Did El Proxy fail to process some logs? If so, for what reason? Did it store logs in DLQ? If so, when?
To answer these questions we created the “Troubleshooting” dashboard, based on metrics constructed from these use cases.
El Proxy Troubleshooting dashboard. In the first 2 panels, we can see that around 15:28 El Proxy had an error contacting Elasticsearch, probably due to an infrastructure incident, which led to 2 logs not being processed and sent back to Logstash. On the bottom half instead, we gain insights into the single requests performed by El Proxy to Elasticsearch. In particular, the 1st panel shows all the failed requests to Elasticsearch, while the 2nd and 3rd ones show the number of “retries”, i.e., the ones that failed and the ones that were successful, respectively.
Another topic of interest is the performance metrics of El Proxy and Elasticsearch. Hence NetEye also provides a dedicated dashboard for this:
El Proxy Performance dashboard. The 2 top-left panels give an overview of how many requests and logs El Proxy receives over time, while the third panel on the left shows how many logs are inside El Proxy at any moment, which can help you understand if El Proxy is managing to process and write logs at a faster rate than the rate of logs received as input. On the top right instead we have information on the timing of the requests to El Proxy and to Elasticsearch over time. This lets you understand for example if Elasticsearch is slowing down when it’s under heavy load, which may be a sign of a lack of resources.
Finally, a third dashboard gives an overview of the number of logs generated by each Tenant present in the infrastructure:
El Proxy Tenant Performance dashboard. In this dashboard, a panel is generated for each tenant present in your environment. Each panel displays the number of logs received from each Tenant, and also gives an insight into each of the Tenant’s blockchains if they has multiple ones.
We hope this first improvement on the observability of El Proxy will enable users to better and more easily get a grasp on the behavior of El Proxy. Any feedback is appreciated, please report it through the Wuerth Phoenix channels!
These Solutions are Engineered by Humans
Are you passionate about performance metrics or other modern IT challenges? Do you have the experience to drive solutions like the one above? Our customers often present us with problems that need customized solutions. In fact, we’re currently hiring for roles just like this as well as other roles here at Würth Phoenix.ext
Fix NagVis navigation using IcingaDB Web URLs When clicking on a host or service from a NagVis map, you were redirected to the legacy Monitoring module. The links have been updated to correctly point to the IcingaDB Web module. List Read More
Fixing Misplaced Plugin Output in Icinga Web Interface When plugin output contained HTML content (like links), it was incorrectly displayed near the service name instead of in the Plugin Output section. The plugin output section now correctly renders all content. Read More
Icinga Director Now Responsive During Configuration Deployments Previously, users were unable to interact with Icinga Director while configuration deployments were running. Any attempt to access the interface or API would be blocked until the deployment completed, causing unnecessary delays in Read More
Bug Fix We updated the version of GLPI in order to fix some relevant vulnerabilities. List of updated packages The following packages have been updated for NetEye 4.45: glpi, glpi-autosetup, glpi-configurator, glpi-neteye-config to version 10.0.22_neteye1.17.5-1.
Bug Fix in Tornado Module We solved an issue in Tornado's rule configuration where the action_name field in director actions was being cleared after saving and deploying. When users created a rule with a director action and filled in both Read More