When a degradation occurs within a complex system, understanding the root cause can be extremely challenging. If the issue happens sporadically, the difficulty increases even more. This is because modern systems involve numerous components that interact in complex ways. For example, if your application’s Web UI becomes slow, the underlying cause could be anywhere in the stack.
Even if you collect observability data for all involved components, the sheer volume of information can be overwhelming, leaving you without a clear starting point for investigation.
In such cases, machines can help us. They are far better than humans to process and analyze large amounts of data.
The Elastic Stack integrated with NetEye offers Machine Learning (ML) jobs that assist in analyzing observability data. These jobs can automatically detect anomalous behavior in your system, helping operators focus on what matters.
Recently, we had the opportunity to explore these Elastic Stack capabilities to pinpoint the root cause of an actual degradation occurring within NetEye itself.
Some users reported that Kibana occasionally felt slower than usual, but we couldn’t determine why or when these slowdowns occurred.
To investigate, we set up an Alyvix Test Case simulating real user behavior in Kibana. This allowed us to measure latency during the actual user journey and identify when the degradation happened.
Shortly after deploying the Alyvix Test Case, we gathered clear evidence of the issue, as shown in the dashboard below:

On April 22nd at around 12 PM, the test case reported a significant increase in duration. This gave us a concrete point in time to start our investigation.
But how could we determine what caused this degradation?
At this stage, we knew what the degradation was and when it occurred, but not why. The components involved in delivering Kibana’s UI are the Kibana service itself and its backend, Elasticsearch. While we had monitoring data for both (provided out-of-the-box by Elastic Stack), the sheer amount of data made it nearly impossible for a human operator to manually identify correlations.
This is where Machine Learning comes to the rescue. Elastic allows you to run ML jobs on data stored in Elasticsearch, automatically highlighting time-based correlations between anomalies in your infrastructure.

In our case, Elastic ML detected anomalies in the Alyvix test case transactions (bottom row) during the morning of April 22nd. To find potential correlations, we looked for anomalies in other parameters that occurred only during this timeframe, since recurring anomalies would likely cause similar issues at other times.
Following this reasoning, we noticed an anomaly in Kibana jobs processed that morning, which did not appear on other days. This was a strong indicator of the actual root cause behind Kibana’s performance degradation.
By combining Alyvix for real user experience monitoring with Elastic’s Machine Learning anomaly detection, we were able to identify when the degradation occurred and uncover strong indicators of its root cause. However, it’s important to note that human expertise is still essential to setup meaningful ML jobs and interpret the results effectively. This synergy between automated analysis and human insight can significantly reduce the root cause investigation time.