30. 09. 2021 Damiano Chini Development, NetEye

Tornado: Tracing

How can we allow a Tornado administrator to successfully track down the flow of an event through Filters, Rules and Actions of Tornado, when Tornado is processing thousands of events per second?

Tornado administrators can have a hard time reading Tornado logs to understand where for example an action error comes from. Take this log as an example:

Aug 20 11:53:49 neteye tornado[59069]: Aug 20 11:53:49.649 ERROR tornado_executor_smart_monitoring_check_result: SmartMonitoringExecutor - Director host creation action failed with error ActionExecutionError { message: "DirectorExecutor API returned an error. Response status: 500 Internal Server Error. Response body: {\n \"error\": \"Error during the creation of the object at runtime\"\n}\n", can_retry: true, code: None }.

How can I understand why this action was triggered? Which Tornado Action do I have to fix? In which Rule?

What I would need to do is to look back through the logs and try to find a related log, which tells me the Rule that was matched.

But this becomes an impossible task if Tornado is processing thousands of events each second.

This is where Tracing can help us! While “Logging” aims to give information about a single incident happening in the application, “Tracing” has awareness of the whole application flow and can then provide information on the journey that an event went through before arriving at a certain point in the application.

In the context of our Tornado application, this means that with Tracing, we cannot only for example inform the user that an Action failed, but also tell them which Event triggered the Action, and which Filters and Rules the Event went through before triggering the Action.

Beginning with NetEye 4.19, Tornado integrates the Tokio Tracing project and from NetEye 4.20 onwards, Tornado implements traces for Event flows which depict the journey of an Event through the Tornado application. To see what this actually means in practice, take a look at the following extract:

Aug 20 13:08:08 neteye tornado[8624]: Aug 20 13:08:08.879 ERROR MatcherActor{trace_id="4c72f980-bae5-4bb3-afb2-0dbbea18d084"}:dispatch_filter{name="root"}:dispatch_ruleset{name="ruleset_01"}:dispatch_rule{name="smart_monitoring_check_result"}:dispatch_action{action=0 action_id="smart_monitoring_check_result"}: tornado_executor_smart_monitoring_check_result: SmartMonitoringExecutor - Director host creation action failed with error ActionExecutionError { message: "DirectorExecutor API returned an error. Response status: 500 Internal Server Error. Response body: {\n    \"error\": \"Error during the creation of the object at runtime\"\n}\n", can_retry: true, code: None, data: {"method":"POST","url":"https://httpd.neteyelocal/neteye/director/host?live-creation=true","payload":{"object_name":"acme-host","object_type":"Object"}} }.

Here you can see that the Event identified by the given “trace_id”, went through the Filter “root”, the Ruleset “ruleset_01”, the rule “smart_monitoring_check_result”, triggered the Action “smart_monitoring_check_result”, and this action execution then resulted in an error. With this information in hand, we can now know for sure which action we need to modify to fix the error.

We’ve applied Tracing to this particular use case in Tornado, but we’ve only just started exploring its possibilities, so it’s very likely that in the near future many more improvements will be made in Tornado thanks to Tracing!