02. 07. 2019 Benjamin Gröber Clustering, NetEye

How To Recover from a DRBD9 Metadata Split Brain Situation

As soon as you manage more than a few DRBD resources distributed over a wide set of hardware, split brain situations cannot always be avoided. Standard split brains are caused by multiple nodes having different opinions about the latest state of the data on their local disk.


Disclaimer: If applied incorrectly, commands in this blog post may potentially cause data loss. If you are unsure about any step here, please be sure to back up your data first.

Standard problematic cluster situations are commonly resolved by disconnecting and discarding the data of the faulty node, while reconnecting to the primary node as outlined below.

Standard Split Brain

In the following example we will assume a two node setup with primary.example.com being the node with good data in primary state, and faulty.example.com being the node which needs to be “fixed”.

Before proceeding with either procedure, please make sure that your primary node contains the copy of the data you want to keep!

[root@faulty]# drbdadm disconnect $RESOURCE_NAME
[root@faulty]# drbdadm --discard-my-data connect $RESOURCE_NAME:primary.example.com

[root@primary]# drbdadm connect $RESOURCE_NAME:faulty.example.com

This is the standard procedure, however sometimes this does not resolve the split brain in all cases. Sometimes the so-called “metadata”, used by DRBD to keep track of its own actions, gets corrupted.

Metadata Split Brain

If you find yourself in the situation that, after following the aforementioned procedure, the disk is still “Inconsistent” or the connection between nodes doesn’t advance further than the “Connecting” state, you’re most probably a victim of metadata corruption.

In this case you’ll need to invalidate the DRBD resource on the “faulty” node, which will overwrite the local data with data from its peers and recreate the metadata from scratch, such that after the procedure it synchronizes again with the state of the local data and remote nodes.

[root@faulty]# drdbdadm invalidate $RESOURCE_NAME
[root@faulty]# drdbdadm down $RESOURCE_NAME
[root@faulty]# drdbdadm create-md $RESOURCE_NAME
[root@faulty]# drdbdadm adjust $RESOURCE_NAME

[root@primary]# drdbdadm adjust $RESOURCE_NAME

As mentioned in earlier posts, I strongly suggest that you use the drbdtop tool, available from the neteye-extras repository. You can use it to supervise and analyze the progress and state of the drbd resources both during and after the recovery process.

Benjamin Gröber

Benjamin Gröber

R&D Software Architect at Wuerth Phoenix
Hi, my name is Benjamin and I'm Software Architect in the System Integration Research & Development Team at Wuerth Phoenix. I discovered my passion for Computers and Technology when I got my first PC shortly after my 7th birthday in 1999. Using computers and playing with them soon got boring and so, just a few years later, I taught myself Visual Basic and entered the world of Software Development. Since then I loved trying to keep up with the short lived, fast evolving IT world and exploring new technologies, eventually putting them to good use. Lately I'm investing my free time in the relatively new languages Go and Rust.

Author

Benjamin Gröber

Hi, my name is Benjamin and I'm Software Architect in the System Integration Research & Development Team at Wuerth Phoenix. I discovered my passion for Computers and Technology when I got my first PC shortly after my 7th birthday in 1999. Using computers and playing with them soon got boring and so, just a few years later, I taught myself Visual Basic and entered the world of Software Development. Since then I loved trying to keep up with the short lived, fast evolving IT world and exploring new technologies, eventually putting them to good use. Lately I'm investing my free time in the relatively new languages Go and Rust.

Leave a Reply

Your email address will not be published. Required fields are marked *

Archive