25. 06. 2020 Alessandro Valentini NetEye

Configuring Fencing on Dell Servers

As a NetEye User I want to handle node failures when they happen in my cluster. When a node becomes unresponsive, it might still be accessing your data: the only way to ensure that a node is truly offline is to shut it down. This procedure is called fencing.

NetEye 4 relies on Corosync/Pacemaker, also known as PCS, to implement a Red Hat High Availability cluster. While part of the configuration is vendor-specific, PCS setup is similar for all manufacturers: it basically uses a PCS resource which restarts the node via a fence device should the node became unresponsive.

This article explains how to configure fencing on a Dell physical server, which is the most commonly used server in NetEye 4 installations. A fencing configuration is not required for voting-only cluster nodes or for elastic-only nodes as they are not part of the PCS cluster.

Configuring iDRAC

Dell Remote Access Controller (iDRAC) is a hardware component located on the motherboard which provides both a web interface and a command line interface to perform remote management tasks.

Before beginning, you should properly configure IPMI settings (Intelligent Platform Management Interface) and create a new account.

You can access the iDRAC web interface and enable IPMI access Over Lan at: iDRAC Settings > Connectivity > Network > IPMI Settings:

Then create a new user with the username and password of your choice, read-only privileges for the console, and administrative privileges on IPMI.

Please note that you must replicate this configuration on each physical server.

Install Fence Devices

Next you need to install ipmilan fence devices on each server in order to use fencing on Dell servers:

yum install fence-agents-ipmilan

Now you will be able to find several new fence devices including fence_iDRAC and show its properties:

pcs stonith list
pcs stonith describe fence_idrac

Test that the iDRAC interface is reachable using the default port 623:

nmap -sU -p623 <idrac_ip>

Finally you can safely test your configuration by printing the chassis status on each node remotely.

ipmitool -I lanplus -H <iDRAC IP> -U <your_IPMI_username> -P <your_IPMI_password> -y <your_encryption_key> -v chassis status

Configuring PCS

Fencing can be enabled by setting the property called stonith, which is an acronym for Shoot-The-Other-Node-In-The-Head. Disable stonith until fencing is correctly configured in order to avoid any issues during the procedure:

pcs property set stonith-enabled=false
pcs stonith cleanup

At this point you can create a stonith resource for each node. In a 2-node cluster it may happen that both nodes are unable to contact each other and then each node tries to fence the other one. But you can’t reboot both nodes at the same time since that will result in downtime and possibly harm cluster integrity. To avoid this you need to configure a different delay (e.g., one without delay, and the other one with at least a 5 second delay). To ensure the safety of your cluster, you should set the reboot method to “cycle“ instead of “onoff”.

pcs stonith create fence_node1 fence_iDRAC ipaddr="<iDRAC ip or fqdn>" "delay=0" lanplus="1" login="IPMI_username" passwd_script="IPMI_password" method="cycle" pcmk_host_list="node1.neteyelocal"
pcs stonith create fence_node2 fence_iDRAC ipaddr="<iDRAC ip or fqdn>" "delay=5" lanplus="1" login="IPMI_username" passwd_script="IPMI_password" method="cycle" pcmk_host_list="node2.neteyelocal"

You should set up a password script instead of directly using your password, for instance with a very simple bash script like the one below. The script should be readable only by the root user, preventing your iDRAC password from being extracted from the PCS resource. You should place this script in /usr/local/bin/ allowing you to invoke it as a regular command:

#! /bin/bash
echo “my_secret_psw“

If everything has been properly configured, then running pcs status should show the fence device with status Started.

To prevent unwanted fencing in the event of minor network outages, increase the totem token timeout to at least 5 seconds by editing /etc/corosync/corosync.conf as follows:

totem {
    version: 2
    cluster_name: neteye
    secauth: off
    transport: udpu
    token: 5000  
}

then sync this config file to all other cluster nodes and reload corosync:

pcs cluster sync
pcs cluster reload corosync

Unwanted fencing might happen also when a node “commit suicide”, i.e., shut itself down because it was not able to contact the other node of the cluster. This is an unwanted situation because all nodes of a cluster might be fenced at the same time. To avoid this you should set a constraint to prevent a node’s stonith resource from running on the cluster node itself:

pcs constraint location fence_node1 avoids node1.neteyelocal

Now that fencing is configured, you only need to set the stonith property to true to enable it:

pcs property set stonith-enabled=true
pcs stonith cleanup

Always remember to temporarily disable fencing during updates/upgrades.

Alessandro Valentini

Alessandro Valentini

Consultant at Würth Phoenix

Author

Alessandro Valentini

Consultant at Würth Phoenix

Leave a Reply

Your email address will not be published. Required fields are marked *

Archive