30. 12. 2025 Damiano Chini Automation, Development, Log Management, Log-SIEM, NetEye

Optimizing Rolling Restarts in Elasticsearch Clusters

Introduction

For on-premise Elasticsearch installations, performing a rolling restart across a cluster can be a time-consuming task, especially when dealing with large clusters. Rolling restarts are typically required when changing node configurations or upgrading the cluster to a new version.

Elastic provides an official procedure to ensure service continuity during this process. However, after analyzing that approach, we identified opportunities to optimize it without compromising availability guarantees.

We had two goals:

Reduce the overall duration of the restart process
Avoid potential pitfalls that can lead to stalled upgrades

In this post, we’ll first review the official procedure, then highlight its limitations, and finally present an alternative approach that addresses these challenges.

The Official Procedure: Current State

Elastic’s documentation describes the rolling restart process in detail. A key recommendation is to wait for the cluster to return to a green state after each node restart before proceeding to the next node.

This precaution ensures service continuity. When the cluster is yellow, some shards are not fully allocated. Restarting another node at this point could make certain shards completely unavailable, disrupting service.

For example:

Shard A has its primary on Node 1 and a replica on Node 2
If Node 1 is restarted, shard A is temporarily available only on Node 2
Restarting Node 2 before shard A is reallocated would make shard A entirely unavailable, impacting read/write operations

This explains why the official procedure insists on waiting for the cluster to return to green before moving on.

Where Can We Improve?

While the official approach is safe, it can be overly conservative and occasionally problematic. We identified two areas for improvement:

Timing Optimization
Avoiding Upgrade Deadlocks

1. Timing Optimization

Waiting for the cluster to become green after every node restart may not always be necessary. If the shards that remain unallocated after restarting Node 1 have no replicas on Node 2, restarting Node 2 does not introduce any risk of data unavailability.

2. Deadlock During Upgrades

In some cases, the cluster can remain in a yellow state indefinitely after upgrading a node, requiring manual intervention. This happens because:

If Node 1 is upgraded first and receives a new shard, its replica cannot be allocated to nodes running an older version (as per Elasticsearch rules described in the official documentation)
The cluster then waits for allocation that cannot happen until more nodes are upgraded, effectively blocking the process

Our Solution

To address these issues, we designed a simple yet effective algorithm:

After restarting Node X, identify all shards that are currently unallocated
Before restarting Node X+1, check whether it hosts replicas of any unallocated shards
- If yes, wait until those shards are allocated
- If no, proceed with the restart

This approach ensures:

No shard becomes completely unavailable during the process
Nodes upgrade faster as we no longer wait for the entire cluster to return to green before moving on.
Deadlocks are prevented, since nodes hosting replicas of newly allocated shards are handled correctly.

We implemented and automated this logic, and the results have been very promising in our environments.

Conclusion

Rolling restarts are essential for maintaining and upgrading Elasticsearch clusters, but the official procedure can be slow and occasionally problematic. Our alternative approach optimizes timing and prevents upgrade deadlocks while preserving high availability guarantees.

By automating this algorithm, we’ve significantly improved the efficiency of cluster upgrades in our installations and we believe this method could also be beneficial for other operators facing similar challenges.

These Solutions are Engineered by Humans

Did you find this article interesting? Does it match your skill set? Programming is at the heart of how we develop customized solutions. In fact, we’re currently hiring for roles just like this and others here at Würth IT Italy.