31. 12. 2023 Damiano Chini Development, DevOps, NetEye

Speeding up the NetEye CI Testing Phase

Over the course of the last few years, we’ve introduced more and more features in NetEye 4. This fact has had a side effect that’s not directly visible to customers, namely that we keep adding more and more tests to the testing phase of the NetEye 4 Continuous Integration pipelines. While this ensures that regression bugs are not introduced in the various NetEye 4 features, it also implies that the NetEye CI pipelines takes longer and longer to run with each test that’s added, to the point where the mere execution of tests now takes around an hour and a half. Considering that each NetEye CI pipeline also performs an environment preparation, the total time taken by NetEye 4 CI pipelines now adds up to 2+ hours.

But is there really a problem with that? Well, think about being a developer who implements their task in ~4 hours and now wants to release the task they they’ve just implemented. They’ll have to wait for 2+ hours before maybe discovering that that task breaks a test. And most likely our developer has already started another task in the meantime to avoid wasting time, and so now has to stop work on the new task and go back to the previous task to fix it. This is frustrating for developers and increases overhead due to what is called a context switch. Ideally, when I’m done implementing a task, I’d like to know if all tests are fine within 5-10 minutes, about the time it takes for a coffee break.

Moreover, long CI pipelines (caused mainly by long-lasting test executions) make the time-to-market longer for bugfixes, for example. In fact it often happens that a when a small bug is found, the fix takes only 5 minutes, but the fix is only generally available after several hours (which often means the next day).

Given that this problem is only going to get worse with the addition of more tests, we decided to start addressing it now. But how?

Executing Tests in Parallel

Since until now CI tests for NetEye were executed sequentially, the logical way to reduce the duration of the testing phase is to perform the test execution in parallel.

So, can we just rewrite our sequential test runner to turn it into a parallel test runner in order to achieve our goal? Unluckily no, because some subset of the tests would break. In fact, many of the already present integration tests are performing operations that may break other tests if executed in parallel.

For example, some integration tests that ensure that different configurations of a service work as expected, modify the configuration of the service (e.g. Icinga 2) and restart that service to apply the configuration. Of course if in the meantime another test is performing some checks on the service assuming it’s running with the standard configuration, that test is going to break.

Nonetheless most of the tests that break when executed in parallel could be adapted to make them ready to be run in parallel, but it’s not trivial and requires some effort from the developers since there are a lot of tests to be adapted. If we were to decide to migrate all existing tests to be ready to run in parallel, this would probably block our overall development for a few weeks, which is not sustainable.

Gradually Passing to Parallel Tests

When doing something requires a big effort, the best thing is often to split it in smaller, more manageable parts. We’ve tried to carry this over to the context of running the tests in parallel.

So in order to gradually switch to parallel tests we decided to take the following approach. Since we cannot migrate all tests in a single step, we kept a stage in our pipelines that still executes tests in a sequential way. Then, a new separate stage in our pipelines executes tests in a parallel way – this stage was empty at the beginning, i.e. it was not executing any tests.

The idea is that newly introduced tests must always be written in such a way that they can be run in parallel with other tests, which should alleviate the problem of the testing phase getting longer and longer with every test that we add. At the same time, we will gradually migrate existing tests from the sequential test phase to the parallel test phase, which will allow us to actually reduce the current duration of the CI pipelines.

On a side note, in order to improve our Cluster testing pipelines, we also decided to write the new parallel tests in such a way that they can be executed on NetEye Cluster environments. In this way, we can run these tests in our Cluster testing pipelines in addition to the simple NetEye Single Instances, which will greatly help us in ensuring that all features also work as expected on NetEye Cluster environments, which always have some particularities with respect to Single Instances.

Current Status

We’ve already started migrating tests to the parallel stage and have managed to migrate all Elastic Stack tests to parallel tests, but this is only a small subset of all the tests we have. Unfortunately we don’t yet have enough metrics to tell how much time was gained due to this migration, but we’re working on our Continuous Integration metrics, so soon we’ll be able to give you more precise numbers, maybe even in the next blog post!

These Solutions are Engineered by Humans

Did you find this article interesting? Does it match your skill set? Programming is at the heart of how we develop customized solutions. In fact, we’re currently hiring for roles just like this and others here at Würth Phoenix.

Damiano Chini

Damiano Chini

Author

Damiano Chini

Leave a Reply

Your email address will not be published. Required fields are marked *

Archive