30. 12. 2025 Damiano Chini Automation, Development, Log Management, Log-SIEM, NetEye

Optimizing Rolling Restarts in Elasticsearch Clusters

Introduction

For on-premise Elasticsearch installations, performing a rolling restart across a cluster can be a time-consuming task, especially when dealing with large clusters. Rolling restarts are typically required when changing node configurations or upgrading the cluster to a new version.

Elastic provides an official procedure to ensure service continuity during this process. However, after analyzing that approach, we identified opportunities to optimize it without compromising availability guarantees. We had to goals:

  1. Reduce the overall duration of the restart process
  2. Avoid potential pitfalls that can lead to stalled upgrades

In this post, we’ll first review the official procedure, then highlight its limitations, and finally present an alternative approach that addresses these challenges.

The Official Procedure: Current State

Elastic’s documentation describes the rolling restart process in detail. A key recommendation is to wait for the cluster to return to a green state after each node restart before proceeding to the next node.

This precaution ensures service continuity. When the cluster is yellow, some shards are not fully allocated. Restarting another node at this point could make certain shards completely unavailable, disrupting service.

For example:

  • Shard A has its primary on Node 1 and a replica on Node 2.
  • If Node 1 is restarted, shard A is temporarily available only on Node 2.
  • Restarting Node 2 before shard A is reallocated would make shard A entirely unavailable, impacting read/write operations.

This explains why the official procedure insists on waiting for the cluster to return to green before moving on.

Where Can We Improve?

While the official approach is safe, it can be overly conservative and occasionally problematic. We identified two areas for improvement:

  1. Timing Optimization
  2. Avoiding Upgrade Deadlocks

1. Timing Optimization

Waiting for the cluster to become green after every node restart may not always be necessary. If the shards that remain unallocated after restarting Node 1 have no replicas on Node 2, restarting Node 2 does not introduce any risk of data unavailability.

2. Deadlock During Upgrades

In some cases, the cluster can remain in a yellow state indefinitely after upgrading a node, requiring manual intervention. This happens because:

  • If Node 1 is upgraded first and receives a new shard, its replica cannot be allocated to nodes running an older version (as per Elasticsearch rules described also in the official documentation).
  • The cluster then waits for allocation that cannot happen until more nodes are upgraded, effectively blocking the process.

Our Solution

To address these issues, we designed a simple yet effective algorithm:

  1. After restarting Node X, identify all shards that are currently unallocated.
  2. Before restarting Node X+1, check whether it hosts replicas of any unallocated shards.
    • If yes, wait until those shards are allocated.
    • If no, proceed with the restart.

This approach ensures:

  • No shard becomes completely unavailable during the process.
  • Faster upgrades, as we no longer wait for the entire cluster to return to green before moving on.
  • Deadlock prevention, since nodes hosting replicas of newly allocated shards are handled correctly.

We implemented and automated this logic, and the results have been very promising in our environments.

Conclusion

Rolling restarts are essential for maintaining and upgrading Elasticsearch clusters, but the official procedure can be slow and occasionally problematic. Our alternative approach optimizes timing and prevents upgrade deadlocks while preserving high availability guarantees.

By automating this algorithm, we’ve significantly improved the efficiency of cluster upgrades in our installations and we believe this method could be beneficial also for other operators facing similar challenges.

Damiano Chini

Damiano Chini

Author

Damiano Chini

Leave a Reply

Your email address will not be published. Required fields are marked *

Archive