19. 12. 2025 Csaba Remenar NetEye

Optimizing Galera: Safe Backups via Asynchronous Replication

If a database goes down unexpectedly, it causes a domino effect on the systems that depend on it: the web shop won’t load, CRM freezes, and business processes get disrupted. NetEye is no different – if MariaDB fails, we lose access to WebUI, logins, Grafana dashboards, Assets management, and all other related functions.

That’s why we replaced the single server running under the PCS-DRBD pair with a Galera cluster. This setup helps avoid a full database shutdown and restart, preventing even short-term outages of dependent services in many cases.​

For the Galera cluster to perform well, we need to watch several factors, including CPU, memory, and disk I/O. It’s unwise to make backups during a busy day since physical backups can overload nodes, causing synchronization loss or temporary cluster freezing.

Quieter periods like nighttime or early morning are better for backups – if those times exist. But what if we want more frequent, fresher backups, say every four hours, without heavily taxing the active nodes?

In this post, I’ll show you how to set up a dedicated, read-only replica. You can back up this replica at any time, as often as you want, without interrupting the operational nodes.

Step 1 – Configure Galera Cluster

Proper configuration is crucial for reliable replication and high availability in a Galera cluster. Let’sbegin with the setup.

​First, create a dedicated user to initiate and manage the replication process:

CREATE USER 'replicant'@'%' identified by '<PASSWORD>'; 
GRANT REPLICATION SLAVE ADMIN on *.* to 'replicant'@'%'; 
FLUSH PRIVILEGES; 

Apply the following configuration changes to all operational nodes in the cluster. No changes are required on a dedicated arbitrator node, if you’re using that.

Create a new file, e.g. vim /neteye/local/mariadb/conf/my.cnf.d/replica.cnf and copy the following:

[mysqld] 
log_slave_updates        = ON 
# wsrep_gtid_domain_id and server-id need to have the 
# same value on all nodes in the cluster. 
server-id                = 01 
wsrep_gtid_domain_id     = 0 
wsrep_gtid_mode          = ON 
log-bin                  = /neteye/local/mariadb/log/master-bin 
log-bin-index            = /neteye/local/mariadb/log/master-bin.index 
# Should be set to a different value on all nodes in a given cluster, 
# and each of these values should be different than the configured wsrep_gtid_domain_id value. 
# E.g.: node1= 10 and node2=11 
gtid_domain_id           = 10 
expire_logs_days         = 1 

The replica-related settings won’t be activated by restarting the nodes (rolling restart), you have to restart the entire cluster. In practice, shut down the nodes one by one, then bootstrap from the node with the most advanced state.​

Here’s a brief explanation of the settings:

  • [mysqld]: This section indicates that the settings below apply to the MariaDB server (mysqld) daemon.
  • log_slave_updates = ON: This allows a replication “slave” (secondary) server to also record in its own binary log the changes it has received and applied from the “master” (primary) server.
  • server-id = 01: Gives a unique identifier to the MariaDB cluster in the replication topology. Each server/cluster must have a unique server-id, but within the cluster, the operational nodes must have the same identifier, so all 01.
  • wsrep_gtid_domain_id = 0: This is the GTID (Global Transaction Identifier) ​​domain identifier associated with Galera Cluster replication. This is used independently of gtid_domain_id in the Galera environment. As with server-id, the operational nodes must have the same ID, so all 0.
  • wsrep_gtid_mode = ON: Turns on Galera Cluster GTID mode, which allows transactions to be uniquely identified and tracked across the cluster.
  • log-bin: Specifies the base name and path of the binary log files. MariaDB records all database modification operations in this file, which is essential for replication and point-in-time recovery.
  • log-bin-index: Specifies the path and name of the binary log index file. This stores the list and order of binary log files.
  • gtid_domain_id = 10: This is the GTID domain ID associated with traditional MariaDB replication. Each replication domain requires a unique ID to avoid conflicts. Here, nodes in the cluster must also be unique. For example, one operational node is 10, the other is 11, and the replica is 12.
  • expire_logs_days = 1: Specifies how many days MariaDB should keep binary log files before automatically deleting them. This helps with disk space management.

A quick note: GTID is a MariaDB-specific addition. It’s not essential, but its use greatly simplifies configuration and operation.

For the replica we will use a dedicated, separate virtual machine (VM). This ensures complete isolation from the cluster, allowing us to make backups at any time without disturbing the production environment.

In order for requests from the replica running outside the cluster to reach the “internal” MariaDB address, we also have to configure the firewall appropriately. We solved this problem with the following command:

firewall-cmd --permanent --zone=public --add-rich-rule="rule family=\"ipv4\" source address=\"<SLAVE-IP>\" forward-port port=\"30306\" protocol=\"tcp\" to-port=\"3306\" destination address=\"<CLUSTER-IP>\"" 

firewall-cmd --permanent --add-port=30306/tcp && firewall-cmd --reload 

So we enabled an “external” port (30306) that only the replica can connect to. The keepalive configured on our operational nodes ensures that the slave node can always reach a Galera node via the Cluster IP to start and maintain the replication process.

Step 2 – Configure the replica

For simplicity (and because I like it), I use Podman for the whole process on a Red Hat VM, so I’ll show you how to configure it.

Installing the required packages:

  • Podman: For managing containers
  • Maria-backup: For physical backup – its installation also includes the MariaDB server, but we won’t use it
  • Zstd: Needed for compression, but you can also use other methods (e.g. gzip, zip)
  • Mbstream: zstd is very efficient, but it can’t compress directories and multiple files at once, so this is also needed
dnf install zstd podman mariadb-backup mbstream

Configure the MariaDB replica instance for the Podman container:

mkdir -p /srv/podman/mariadb/{conf,data,log} 

Note: Since this is where the entire database and logs will be stored, make sure you have enough storage space. Obviously, the slave node must have enough storage space to store and save all the information from our database.

Create a file /etc/containers/systemd/podman-mariadb-slave.container with the following content:

Unit] 
Description=MariaDB Slave Service 
After=network-online.target 
Wants=network-online.target 

[Container] 
ContainerName=podman-mariadb-slave 
Image=localhost/mariadb:10.11.10 
PublishPort=3308:3306 

# Volume 
Volume=/srv/podman/mariadb/conf/replica_slave.cnf:/etc/mysql/conf.d/replica.cnf 
Volume=/srv/podman/mariadb/data:/srv/podman/mariadb/data 
Volume=/srv/podman/mariadb/log:/srv/podman/mariadb/log 

Exec=--datadir=/srv/podman/mariadb/data 

[Service] 
Restart=on-failure 
TimeoutStopSec=70 

[Install] 
WantedBy=default.target 

Check it with the command below, but don’t run the container yet as there are still some configuration steps to do first.

/usr/libexec/podman/quadlet -dryrun

Step 3 – Import production data and start the replica

If you don’t have a relatively recent backup, take a physical backup of your primary MariaDB server. It’s important to include the “replicant” user.

I use the daily nightly backup, which is a compressed backup, so I uncompress it first:

zstd -d /tmp/galera-backup.zstd -c | mbstream -x -C /srv/podman/mariadb/data/

Then “prepare” the data:

mariabackup --prepare --target-dir=/srv/podman/mariadb/data 

Set the permissions to 999 – this is the mysql user in the container:

chown -R 999:999 /srv/podman/mariadb/data 

After preparation you can get the binlog status, which you need to start the replication. For example:

cat /srv/podman/mariadb/data/xtrabackup_binlog_pos_innodb /neteye/local/mariadb/log/master-bin.000160 858584413 

Now you can configure the replica (vim /srv/podman/mariadb/conf/replica.cnf):

[mysqld]
server-id         = <Must be unique number different from other nodes e.g.: 99> 
relay-log-index   = /srv/podman/mariadb/log/slave-relay-bin.index 
relay-log         = /srv/podman/mariadb/log/slave-relay-bin 
log-bin           = /srv/podman/mariadb/log/slave-bin 
log-bin-index     = /srv/podman/mariadb/log/slave-bin.index 
binlog_format     = mixed 
log_slave_updates = ON 
expire_logs_days  = 1 
read-only         = 1 
gtid_domain_id    = <Must be unique number different from other nodes e.g. 12>

Note that server-id and gtid_domain_id must be unique. For instance, if your master nodes use server-id 01 with gtid_domain_id 10 or 11, assign different values to the replica.

Next, start the podman service:

systemctl daemon-reload 
systemctl start podman-mariadb-slave.service 

Tip: If the service fails to start, start the podman container “manually” to debug:

podman run \ 
  --name podman-mariadb-slave \ 
  -p 3308:3306 \ 
  -v /srv/podman/mariadb/conf/replica_slave.cnf:/etc/mysql/conf.d/replica.cnf:Z \ 
  -v /srv/podman/mariadb/data:/srv/podman/mariadb/data:Z \ 
  -v /srv/podman/mariadb/log:/srv/podman/mariadb/log:Z \ 
  localhost/mariadb:10.11.10 \ 
  --datadir=/srv/podman/mariadb/data 

I use the same data directory path in the container because mariabackup, which runs outside the container, will look for this directory – so there’s no need to override it separately in the config.

Connect to MariaDB in the container and start replication, first with binlog position:

mysql -h 127.0.0.1 -P 3308 -u <admin user> -p 

CHANGE MASTER TO 
    MASTER_HOST = '<CLUSTER IP>', 
    MASTER_PORT = 30306, 
    MASTER_USER = 'replicant', 
    MASTER_PASSWORD = '<password>', 
    MASTER_LOG_FILE = 'master-bin.000160', # xtrabackup_binlog_pos_innodb 
    MASTER_LOG_POS = 858584413, # xtrabackup_binlog_pos_innodb 
    MASTER_USE_GTID = NO; 

START SLAVE;     

# wait 20-30 sec then stop 
STOP SLAVE; 

## change to GTID 
CHANGE MASTER TO MASTER_USE_GTID = slave_pos; 

# start again 
START SLAVE; 

This starts the replication, which you can verify with the query below:

SHOW SLAVE STATUS\G 

If you see “Slave_IO_Running: Yes” and “Slave_SQL_Running: Yes”, your replica is up and running.

Step 4 – Configure Backup Script

Creating a backup strategy depends on individual preferences, but I’ll present my own approach.

In the case of the Galera cluster, I schedule the backup for nighttime hours when the load is lower, so only one full backup is made per day. On the replication server, however, I can vary it more freely, as it doesn’t interfere with the operation of the cluster; here, backups are possible every 4-6 hours, or even hourly.

I chose 6 backups per day, for a frequency of 4 hours. I store these backups on local storage for three days. I don’t need to take care of long-term data retention here, as the backups are backed up separately to the dedicated backup server every day.

Here are the backup script steps in a nutshell:

  • Dependencies: Checks if all required commands (e.g. mariabackup, mydumper, podman, lvcreate) are installed.
  • Locking: Uses flock to ensure that only one instance of the script can run at a time.
  • Replica Health: Queries the health of the replica. If something goes wrong, the script exits with an error. Since the backup process is monitored in NetEye, we are automatically notified of the error.
  • Snapshot: Creates a “crash-consistent” LVM snapshot of the replica volume (e.g. /srv/podman/mariadb). Creating the snapshot is almost instantaneous and does not stop the database.
  • Start temporary database: I mount the newly created LVM snapshot to the file system.
  • Podman container: Starts a Podman container with the same version of MariaDB server. This container gets the attached snapshot as a data directory (–datadir) and starts on a unique port (e.g. 3309). The script waits for this temporary database server to fully start and become available.
  • Perform backups: The script now connects to the secure, isolated, temporary database instance (on port 3309).
  • Backs up the MariaDB configuration files.
  • Creates a full, compressed physical backup (a binary copy of the data files) using mariabackup for fast, complete restores
  • Creates a logical backup using mydumper, also compressed. This is more flexible, allowing for single table or database restores.
  • Cleanup: After the backups are completed (whether successful or not), the script automatically starts a cleanup process (guaranteed by a trap, so it will run even if there is an error).
  • Stops and removes the temporary Podman container.
  • Detaches the LVM snapshot.
  • Deletes the LVM snapshot.
  • Finalize: If everything was successful, creates a “latest” symbolic link that always points to the folder of the most recent backup.
  • Delete: deletes backups older than 3 days.

Summary

With this solution, we’ve made backing up the MariaDB Galera cluster much safer and more flexible. By introducing a dedicated read-only replica, we can now run consistent backups as often as every four hours without putting any load on the active cluster nodes. This eliminates the risk of synchronization lag or database “freezes”.

The benefits go beyond just backups. Because the replica is a real-time mirror of production, it’s the perfect place to run heavy reports or queries, keeping production fast and responsive.​

Finally, this setup is vital for business continuity. If a critical error occurs, the replica can be promoted to a live production database in seconds, drastically reducing downtime. This makes the dedicated replica an essential part of any Disaster Recovery plan.

Although setting up the firewall rules and Podman containers takes some initial effort, the payoff is a stable, scalable, and fully automated system.

Useful links:

Using MariaDB Replication with MariaDB Galera Cluster
Global Transaction ID
Podman with Quadlet
Maria-backup

These Solutions are Engineered by Humans

Did you find this article interesting? Are you an “under the hood” kind of person? We’re really big on automation and we’re always looking for people in a similar vein to fill roles like this one as well as other roles here at Würth IT Italy.

Csaba Remenar

Csaba Remenar

Technical Consultant at Würth IT Italy

Author

Csaba Remenar

Technical Consultant at Würth IT Italy

Leave a Reply

Your email address will not be published. Required fields are marked *

Archive