31. 12. 2021 Damiano Chini Development, NetEye

Tornado Monitoring and Statistics

When I’m running a service which processes a lot of data, sooner or later I start to wonder: what is the service doing? What data is it processing?

This also applies to our event processor Tornado. For the Tornado Engine, the administrator may wonder for example how many events Tornado is receiving, how many actions it’s executing and how many of these actions are failing.

Until NetEye 4.21 Tornado was exposing this information only via logs. But when Tornado processes thousands of events per second, it’s not humanly possible to read the log and really understand the big picture of Tornado’s status.

To address this, Tornado now exposes metrics of the events and actions that it’s performing. Moreover, NetEye provides an out-of-the-box integration which reads these metrics from Tornado and stores them in a dedicated InfluxDB database.

The administrator can then create Grafana dashboards based on this data to monitoring the Tornado’s performance.

Tornado Metrics Exposed

When you decide to expose metrics about a service, clearly one crucial decision is which metrics to expose. If the metrics aren’t well thought, they may be useless for the user.

Currently Tornado exposes the following metrics (and more), but in the future we’ll add more according to any feedback we get:

How many Events the Tornado Engine received
- differentiated by Event type
How many Events the Tornado Engine finished processing
- differentiated by Event type
How many Actions the Tornado Engine executed
- differentiated by Event type and outcome (failure or success)

How It Works

Tornado collects various metrics via opentelemetry and exposes them via an endpoint. The endpoint exposes the metrics using the popular Prometheus format.

At this point, how does NetEye store these metrics in InfluxDB?

To achieve this, NetEye runs a Telegraf instance with a Promethues input plugin pointing to Tornado, and writes the data to a dedicated NATS subject. Another Telegraf instance is then in charge of reading from this NATS subject and writing the metrics to InfluxDB.

Why did we choose to pass metrics via NATS when we could directly have the Telegraf instance read from the Tornado Prometheus endpoint and write them out to InfluxDB?

The reason for this architectural design is that in the future NetEye will also integrate metrics coming from Tornado Collectors running on the NetEye master and on NetEye satellites, and these metrics will necessarily travel via NATS. This means that passing via NATS allows all the writers to InfluxDB to have the same logic (reading from NATS and writing to InfluxDB) and be agnostic on the source of the metrics.

Future Improvements

Everything I explained in this blog post is just the first step to making Tornado performance more easily inspectable.

Some future steps to further improve the overall utility of the Tornado performance metrics will be to:

Install out-of-the-box Grafana dashboards giving various insights into Tornado’s performance
Collect the metrics coming from Tornado Collectors running on the NetEye Master and on NetEye Satellites