Hello to you all. It’s been a while. Don’t worry though, this won’t be a long and technical post. It’s just to let you know I’m doing (almost) well and to tell you about our latest news.
In the last year we’ve had a lot on our plate, but this hasn’t affected our capacity to design and implement new solutions with NetEye. Recently we were tasked with solving a new quest: use real-time metrics to implement near-real-time monitoring.
My first thought was “Ok, let’s do this the usual way: store metrics inside InfluxDB, then use NEP InfluxDB Query to process performance data“. But when I asked “How many devices are we talking about?“, they responded with “The whole Infrastructure“. That’s when I realized that we weren’t talking about one or two hundreds metrics per minute, but around ten thousand metrics per second, or perhaps even more.
At that moment I understood the meaning of “deafening silence”: Everyone in the room froze up for I-don’t-know-how-many seconds; then I collected myself off the floor and said “Give me some time to think about it“.
The fact is that some years ago, while talking with our Division Head about the pros and cons of Metric-based Monitoring I was adamant in saying that this is the best solution for monitoring and we absolutely must do it. In that room, the chance to prove my statement dropped straight into my lap, so my pride forced me to at least present a possible solution. I had no idea of the challenge ahead.
It was an unheard-of thing. I mean, up until now, everyone had been avoiding this topic. Using metrics to perform near-real-time monitoring with a plugin-driven system (i.e., Icinga 2) is almost suicidal: you have to query InfluxDB one or more times every minute for each Service Object that you have to monitor.
And this cannot be delegated to Satellites because data is centralized on NetEye (Single Node or Cluster). This generates a very high level of stress on the monitoring infrastructure and it’s not able to scale quickly enough. In fact, this approach doesn’t scale at all, meaning you will kill your NetEye Cluster and not get anything to show for it.
We quickly started looking for a solution. The first idea was using an approach that in fact already runs on NetEye Cloud to check the status of all Elastic Agents and their data sources: use a Poller to get all metrics in one swoop, then set all related services via Tornado. This can definitely improve scalability, but the data backend is still InfluxDB: in a NetEye Cluster InfluxDB has High Availability support, but this only applies to the data. InfluxDB itself is still a single-instance service, meaning that this setup cannot scale at all.
Using more InfluxDB Instances can improve scalability, but the resulting infrastructure will become a nightmare to manage, load balancing would be “manual” (no dynamic relocation of data between instances) and still there is no real High Availability. Therefore, we needed to switch to another architecture.
As I mentioned before, on NetEye Cloud we already use a promising strategy: use One Big Query to get all data for the same class of services, process it and set related Services Status using Tornado. We were therefore able to set several thousand services using just a handful of Elasticsearch queries.
Since Elasticsearch is designed to efficiently handle lots of data per query, the whole workload decreased; also, Icinga and Tornado kept the pace by updating the status of several thousand services per minute, resulting in great performance and scalability improvements.
So, why not apply this approach to metrics as well? Suppose we’re receiving 10 metrics per host every minute. Then with 60K different hosts sending metrics we’re getting a flow of 10K EPS. That’s nothing for Elasticsearch.
And we would still achieve a resolution higher than a standard plugin-based monitoring approach (i.e., a poll every 3 minutes). So, why not invest in a “sophisticated” poller script to elaborate the metrics? Based on how many classes of metrics you need to analyze, each class requires just one dedicated query.
So, we would need to develop from 30 to 50 different queries: a perfectly doable request. And an Elasticsearch Cluster can scale horizontally with ease, easily increasing EPS ingress rate and storage size. This made me believe that Elasticsearch (and the full Elastic Stack) could be a solution to the Metrics Challenge.
While searching for more details, we found out that Elasticsearch was in the process of releasing support for TimeSeries Data Indices, promising improved performance and reduction of storage consumption. This was the last shoe to fall. I immediately set out to design a workable architecture, since I felt I had a promising solution at hand.
Now, I don’t want to spoil the rest of it. The fact is that, after a year of work, we are now ready to say that NetEye can efficiently perform Metric-based Near-Real-Time Monitoring. We were able to fuse together Elasticsearch, Tornado, Icinga 2 and passive monitoring strategies to implement metric-based monitoring that’s flexible, scalable and robust, which can tell you if something is going wrong, and can even report to you if some data in your gigantic Flow of Metrics is no longer arriving.
And we built it over a fairly standard NetEye deployment (a “common” 3-way cluster). And, on top of that, it integrates so well with NetEye that, even if it can be considered a heavy and pervasive customization, it doesn’t impact on update/upgrade procedures at all.
This has yet to be integrated in our main product, but if you want, we can build it out for you (of course in much less than a year). And so, NetEye Rules again.