As soon as you manage more than a few DRBD resources distributed over a wide set of hardware, split brain situations cannot always be avoided. Standard split brains are caused by multiple nodes having different opinions about the latest state of the data on their local disk.
Disclaimer: If applied incorrectly, commands in this blog post may potentially cause data loss. If you are unsure about any step here, please be sure to back up your data first.
Standard problematic cluster situations are commonly resolved by disconnecting and discarding the data of the faulty node, while reconnecting to the primary node as outlined below.
In the following example we will assume a two node setup with primary.example.com being the node with good data in primary state, and faulty.example.com being the node which needs to be “fixed”.
Before proceeding with either procedure, please make sure that your primary node contains the copy of the data you want to keep!
[root@faulty]# drbdadm disconnect $RESOURCE_NAME [root@faulty]# drbdadm --discard-my-data connect $RESOURCE_NAME:primary.example.com [root@primary]# drbdadm connect $RESOURCE_NAME:faulty.example.com
This is the standard procedure, however sometimes this does not resolve the split brain in all cases. Sometimes the so-called “metadata”, used by DRBD to keep track of its own actions, gets corrupted.
If you find yourself in the situation that, after following the aforementioned procedure, the disk is still “Inconsistent” or the connection between nodes doesn’t advance further than the “Connecting” state, you’re most probably a victim of metadata corruption.
In this case you’ll need to invalidate the DRBD resource on the “faulty” node, which will overwrite the local data with data from its peers and recreate the metadata from scratch, such that after the procedure it synchronizes again with the state of the local data and remote nodes.
[root@faulty]# drdbdadm invalidate $RESOURCE_NAME [root@faulty]# drdbdadm down $RESOURCE_NAME [root@faulty]# drdbdadm create-md $RESOURCE_NAME [root@faulty]# drdbdadm adjust $RESOURCE_NAME [root@primary]# drdbdadm adjust $RESOURCE_NAME
As mentioned in earlier posts, I strongly suggest that you use the drbdtop tool, available from the neteye-extras repository. You can use it to supervise and analyze the progress and state of the drbd resources both during and after the recovery process.