In a previous post we showed how Distributed Tracing and Elastic APM can help Tornado administrators track down events from their generation on Tornado Collectors to the Actions they trigger in Tornado.
This blog post will be more technical and will give you an insight on how we managed to implement Distributed Tracing in Tornado and how OpenTelemetry enabled the whole implementation.
A common use case for Tornado administrator is to understand the reasons why a given Event did not correctly trigger a certain Tornado Action. For example, an administrator may want to understand if an Event was not sent by the Tornado Collector, or if the Tornado Processing Tree did not match the incoming Event, or if the Action triggered by the Processing Tree had some problems during its execution.
To address this use case we then need to trace the Event through different services (i.e. the Tornado Collectors and the Tornado Engine). This is a common problem in microservice architectures, and the method to address it is named Distributed Tracing. If you would like to better understand what Distributed Tracing is and why alternatives like simple application logs are not a good solution to our use case, you can take a look at this interesting article.
Once we understood that the best solution for our use case was to implement Distributed Tracing in Tornado, we began investigating how we could achieve that.
Our requirements were:
Before implementing Distributed Tracing, our Tornado services, written in Rust, were already tracing Events via the Tokio tracing library. The trace data was sent to Elastic APM via an unofficial community library.
The problem with this implementation was that the trace of an Event E in a service S1 was not related to the trace of the Event E in the service S2, so Elastic APM was reporting two separate traces for the Event E, and thus it wasn’t possible to analyze the entire trace via Elastic APM.
We soon understood that developing our own Distributed Tracing implementation was not an option, and instead we needed to rely on some existing Rust libraries for Distributed Tracing that would need to support the integration with Tokio tracing and with Elastic APM.
After some investigation we discovered that what the OpenTelemetry project promised it could provide was exactly what we needed. OpenTelemetry is an incubator project of the Cloud Native Computing Foundation that aims to become a standard regarding metrics, logs and traces. Despite not yet being a fully mature project, a big part of the industry is investing in OpenTelemetry and ready-for-use libraries are already available for several languages, including Rust and many others.
The OpenTelemetry standard is becoming so popular that we could:
Another great advantage of OpenTelemetry is that the standard is not language-dependent, meaning that the trace data of an event that goes through a Rust service can be easily passed to a service written in another language, without problems. So in our case, if in the future we want to trace Tornado Events when they go through services which are not written in Rust but e.g. in Go, this integration will be very easy.
The principle with which OpenTelemetry implements Distributed Tracing is easy. Basically, when service S1 sends an event to service S2, together with the event itself, it also needs to send the so-called trace context.
The trace context is a piece of information that describes the status of a trace at a particular moment. In particular, the most important information stored in the trace context is the ID of the currently open span.
When an event is received by service S2 together with the trace context, service S2 will initialize the trace context received in the request and will then start its own spans within this context. In practical terms, what happens is that the first span created in the service S2 will automatically inherit its parent span from the trace context. In this way, when analyzing the trace e.g. on Elastic APM, the trace for the event will be a single one, where the first span of service S2 will be a child of the last span of service S1.
Implementing Distributed Tracing in Tornado did not seem to be an easy task at all when we first approached the problem. And in fact the solution would have required a lot of work and workarounds if we hadn’t encountered the OpenTelemetry project.
We were lucky enough to start implementing Distributed Tracing in Tornado at the right moment, when the OpenTelemetry project was mature enough to allow us to integrate all the technologies that were in use in Tornado and in NetEye. OpenTelemetry really solved many problems for us in a very clean and future-proof way.
The relating of trace data between different services was done out-of-the-box by OpenTelemetry. Official libraries were available for Rust and Tokio tracing, and the export of the traces to Elastic APM was also powered by OpenTelemetry thanks to its OpenTelemetry protocol, which also allowed us to get rid of an unofficial and limited Rust library for exporting Tokio tracing data to Elastic APM.
We were impressed by the potential that OpenTelemetry offers to developers, and we are looking forward to further improving our solution together with upcoming features that will be brought into OpenTelemetry.
Did you find this article interesting? Are you an “under the hood” kind of person? We’re really big on automation and we’re always looking for people in a similar vein to fill roles like this one as well as other roles here at Würth Phoenix.