Familiar with the feeling when a critical system’s backup and recovery time is measured in hours? Operations teams rely heavily on the availability of monitoring data, so scheduling long periods of downtime is simply not an option.
We recently faced this exact challenge: we had to schedule a multi-hour downtime because our MariaDB database had grown increasingly large. This was primarily due to factors like ‘undo logs’ and long-running transactions, which store their information in the MySQL system-level file ibdata1.
We were already monitoring these activities, but the problem is a known one: despite eliminating the root causes, MariaDB cannot automatically free the space occupied by ibdata1. Unfortunately, the only way to resolve this is to export the entire contents, delete the databases and critical files, and then completely rebuild the databases from backup. This isn’t ideal, as restoring large databases from a logical backup can potentially cause significant downtime.
However, we chose a parallel approach that combined two key elements: containerization (Podman) and lightning-fast backup tools (Mariabackup and MyDumper). With these tools we were able to reduce NetEye’s real-world downtime from several hours to just 8 minutes.
The right tools and parallel work made it possible for us to drastically shorten not only the amount of downtime but also the entire process.
So let’s get to know these tools as we go through the whole process step by step.
First, of course, we made a backup of the entire database. For this, we chose the Mariabackup tool, which is an official tool from the MariaDB team.
Why Mariabackup? It’s extremely fast and reliable, creating a bit-by-bit accurate physical (file system) backup without having to shut down the database server. This physical backup can be used for very fast recovery if something goes wrong, and a copy of this backup can be used as a basis for the next step, working with containers.
The entire backup took about 10 minutes.
The biggest time saver was containerization. In this step, we started a Podman (Docker) container with the MariaDB version used by NetEye, and copied the backup made with Mariabackup into it. This created an exact copy of the NetEye database, which allowed the entire rebuild process to run in parallel, separately and in a container, without disturbing the live NetEye system, which could continue to operate while we worked.
I must emphasize here how important it was to work quickly: since we were working on a “copy” of the database in a container, the original NetEye database continued to work in the meantime. So despite the fact that we strictly monitored that no changes were made to NetEye and thus to the database during the process, the creation of (minor) differences due to other running processes was inevitable. That’s why we had to finish as quickly as possible, to minimize the differences between the two databases.
We now have a copy of our MariaDB running in the Podman container, but this instance of MariaDB still has the problem with the Ibdata file that we’re trying to resolve. The reason we made a clone of the MariaDB inside a container is that inside this instance we can perform a logical backup using the MyDumper application which will create a dump file without disturbing the live system.
Again we have the question: Why mydumper and not mysqldump? Although mysqldump is the most common tool, mydumper is drastically faster and more efficient because it takes advantage of CPU threads. Of course, in all previous tests I also tested the process with mysqldump, so I have a basis for comparison, and the difference is drastic:
This difference in speed made it possible for the entire process to be completed in less than an hour.
After the logical backup, we dropped all the server databases in the container except for the system tables, then stopped the container and thus the database server, and finally deleted the ibdata1 and ib_logfile0 files in the data directory, ensuring that a restore would start from a completely clean base.
It’s very important that the files should only be deleted after the database server has been gracefully shut down, otherwise the databases will be corrupted.
We then restarted the container with the now empty database, so the restore could begin with myloader, which is part of the MyDumper package and is specifically used for restores.
The difference in restore speed between mysql(dump) vs myloader/mydumper:
Interesting fact: this is such a big difference that I had to try it several times during the testing phase to confirm I wasn’t doing anything wrong.
After the restore was successfully completed and the database with a clean structure was created, we stopped the container.
Since all the rebuild work was already done in the container in the previous phase, this phase is limited to copying and restarting.
We unmanaged the cluster in pacemaker and stopped the MariaDB systemd service. This initiated the NetEye shutdown.
Afterwards we started the data migration, so we deleted the old database files from the production directory, then copied the clean, rebuilt data files from the container’s working directory and set the necessary permissions on the files.
Then we started the MariaDB service and checked the service status. After confirming that everything was up-and-running, we returned the cluster management to pacemaker.
The whole process, including deleting, copying and setting file permissions, took a total of 8 minutes and in fact, the NetEye downtime lasted until then.
The secret to achieving minimal downtime was parallelism. The entire rebuild process – including physical and logical backup, cleanup and rebuild – took place in the background, without disturbing the live system. The use of the right tools was essential for this:
The Mariabackup and MyDumper/MyLoader duo drastically accelerated the backup/restore phase. The containerized environment (Podman) enabled simple, fast parallelism, which significantly contributed to the short downtime. Thus, the actual downtime was only 8 minutes, which was limited to copying files and restarting the service.
Results:
I hope you liked this little summary of the project. Why don’t you try it out too?
Useful links:
Authors:
Sultan Balawal and Csaba Remenar
Did you find this article interesting? Does it match your skill set? Our customers often present us with problems that need customized solutions. In fact, we’re currently hiring for roles just like this and others here at Würth Phoenix.