Optimizing Rolling Restarts in Elasticsearch Clusters
Introduction
For on-premise Elasticsearch installations, performing a rolling restart across a cluster can be a time-consuming task, especially when dealing with large clusters. Rolling restarts are typically required when changing node configurations or upgrading the cluster to a new version.
Elastic provides an official procedure to ensure service continuity during this process. However, after analyzing that approach, we identified opportunities to optimize it without compromising availability guarantees. We had to goals:
Reduce the overall duration of the restart process
Avoid potential pitfalls that can lead to stalled upgrades
In this post, we’ll first review the official procedure, then highlight its limitations, and finally present an alternative approach that addresses these challenges.
The Official Procedure: Current State
Elastic’s documentation describes the rolling restart process in detail. A key recommendation is to wait for the cluster to return to a green state after each node restart before proceeding to the next node.
This precaution ensures service continuity. When the cluster is yellow, some shards are not fully allocated. Restarting another node at this point could make certain shards completely unavailable, disrupting service.
For example:
Shard A has its primary on Node 1 and a replica on Node 2.
If Node 1 is restarted, shard A is temporarily available only on Node 2.
Restarting Node 2 before shard A is reallocated would make shard A entirely unavailable, impacting read/write operations.
This explains why the official procedure insists on waiting for the cluster to return to green before moving on.
Where Can We Improve?
While the official approach is safe, it can be overly conservative and occasionally problematic. We identified two areas for improvement:
Timing Optimization
Avoiding Upgrade Deadlocks
1. Timing Optimization
Waiting for the cluster to become green after every node restart may not always be necessary. If the shards that remain unallocated after restarting Node 1 have no replicas on Node 2, restarting Node 2 does not introduce any risk of data unavailability.
2. Deadlock During Upgrades
In some cases, the cluster can remain in a yellow state indefinitely after upgrading a node, requiring manual intervention. This happens because:
If Node 1 is upgraded first and receives a new shard, its replica cannot be allocated to nodes running an older version (as per Elasticsearch rules described also in the official documentation).
The cluster then waits for allocation that cannot happen until more nodes are upgraded, effectively blocking the process.
Our Solution
To address these issues, we designed a simple yet effective algorithm:
After restarting Node X, identify all shards that are currently unallocated.
Before restarting Node X+1, check whether it hosts replicas of any unallocated shards.
If yes, wait until those shards are allocated.
If no, proceed with the restart.
This approach ensures:
No shard becomes completely unavailable during the process.
Faster upgrades, as we no longer wait for the entire cluster to return to green before moving on.
Deadlock prevention, since nodes hosting replicas of newly allocated shards are handled correctly.
We implemented and automated this logic, and the results have been very promising in our environments.
Conclusion
Rolling restarts are essential for maintaining and upgrading Elasticsearch clusters, but the official procedure can be slow and occasionally problematic. Our alternative approach optimizes timing and prevents upgrade deadlocks while preserving high availability guarantees.
By automating this algorithm, we’ve significantly improved the efficiency of cluster upgrades in our installations and we believe this method could be beneficial also for other operators facing similar challenges.
When performance degradation occurs within a complex system, understanding the root cause can be extremely challenging. If the issue happens sporadically, this difficulty increases even more. This is because modern systems involve numerous components that interact in complex ways. For Read More
At first glance, rebuilding an RPM may sound like a purely mechanical task: take a patch, rebuild the package, ship it. In reality, that small fix goes through a much longer journey that touches reliability, security, trust, and long-term maintainability. Read More
While working with Kibana, we recently encountered a puzzling situation: queries involving the field event.original returned unexpected results. Let’s break down what happened, why it occurs, and how to identify other fields with similar behavior. The observed "strange behavior" no Read More
If you've worked with Elastic APM, you're probably familiar with the APM Server: a component that collects telemetry data from APM Agents deployed across your infrastructure. But what happens when you need to segregate that data by tenant, especially in Read More
Dependencies (frameworks, modules, plugins, etc.) are the lifeblood of modern software libraries. But managing them manually is a burden. By automating dependency updates (in a controlled, smart way), you can stay ahead of security issues, reduce technical debt, and make Read More