Fencing allows surviving node to be sure that it's partner definetely dead. You have to configure fencing device that disconnect suspicious node from service. It can be FC switch to disconnect from data disk, it could be ethernet switch to isolate node from network, it could be intelligent PDU to shut server. We have HP made servers that have iLO installed, that will shut servers for us.
Once cluster node suspects split-brain sutuation (we will test this by stopping network on one node), it will try to kill partner. One of them (without network in our case) will not successed to kill friend, when second (healthy one) will. It is almost 100% true that healthy node wins fence round. Even in opposite case, a winner can be sure that split-brain situation resolved and data corruption cannot happen.
Some words about SBD solution proposed by SuSE. I do not recommend to use it in any case. SBD just add other than ethernet communication channel between nodes. This can solve limited amount of problems, related to network. It will helps exact the study case we did (network stop) but will not solve other problems like stucked, unresponsing , even hardware broken server. In this case no server will read message to make suicide, like it supposed by SBD engine.
Set complex password to it, this password will not easy to change if in use by cluster. I prefer to use SSH keys instead of user/password pair. A clusterware runs as user hacluster on SLES, lets prepare SSH keys usable for this user and copy it between nodes:
node1:/etc/cluster # ssh-keygen -t rsa -b 2048 -f stonith_id_rsa Generating public/private rsa key pair. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in stonith_id_rsa. Your public key has been saved in stonith_id_rsa.pub. .... node1:/etc/cluster # chown hacluster stonith_id_rsa node1:/etc/cluster # chmod 400 stonith_id_rsa node1:/etc/cluster # ls -l stonith_id_rsa -r--------. 1 hacluster root 1679 Nov 8 11:19 stonith_id_rsa node1:/etc/cluster # scp -p stonith_id_rsa* node2:/etc/cluster/ node1:/etc/cluster # ssh node2 "chown hacluster /etc/cluster/stonith_id_rsa*" node1:/etc/cluster # cat stonith_id_rsa.pubCopy output of last command to paste it into iLO SSH key authorization page.
First of all we should check that every node can connect to iLO via SSH:
node1: # ssh -i /etc/cluster/stonith_id_rsa stonith@ilonode2 </>hpiLO-> exit node2: # ssh -i /etc/cluster/stonith_id_rsa stonith@ilonode1 </>hpiLO-> exitPlease add ilonode1 and ilonode2 to /etc/hosts file on both nodes. Of course you can use real IP address instead, but this less flexible if you will decide to change your infrastructure and move cluster to other location.
Fix problems with direct access to SSH, probably you will need fix FW rules or add some routes.
Next step we test fence_ilo4_ssh command before its configuring:
node1: # fence_ilo4_ssh --ip=ilonode2 --ssh --username=stonith \ --identity-file=/etc/cluster/stonith_id_rsa --action=status Power: On node2: # fence_ilo4_ssh --ip=ilonode1 --ssh --username=stonith \ --identity-file=/etc/cluster/stonith_id_rsa --action=status Power: On
<primitive id="stonith-node1" class="stonith" type="fence_ilo4_ssh" > <instance_attributes id="stonith-node1-params" > <nvpair id="stonith-node1-params-secure" name="secure" value="1" /> <nvpair id="stonith-node1-params-ipaddr" name="ipaddr" value="ilonode1" /> <nvpair id="stonith-node1-params-login" name="login" value="stonith" /> <nvpair id="stonith-node1-params-identity_file" name="identity_file" value="/etc/ssh/stonith_id_rsa" /> <nvpair id="stonith-node1-params-action" name="action" value="reboot" /> <nvpair id="stonith-node1-params-pcmk_host_list" name="pcmk_host_list" value="node1" /> </instance_attributes> <operations > <op name="monitor" interval="40s" timeout="20s" id="stonith-node1-monitor-40s"/> </operations> </primitive>This file will create fencing device for killing node1, therefore it goes to ilonode1. Please double check this. Replicate file:
# sed -e 's/node1/node2/g' fencing1.xml > fencing2.xmlInsert our configuration using cibadmin command:
# cibadmin -C -o resources --xml-file fencing1.xml # cibadmin -C -o resources --xml-file fencing2.xmlIf you see lots of XML lines as result of command, probably you have a problem with XML syntax of your file. Check if you have all " closed (most common problem).
Now connect to HAWK interface and check both stonith resources running. It is not important resource location, cluster will use it in correct way when needed. Please check /var/log/messages if you have problems with resorces starting.
Once both resources are in green state, you can turn on stonith=enable parameter using HAWK. Now you have full functional SuSE cluster.
I will recommend to turn STONITH off when adding resources and any changes in cluster. This will eliminate unnecessary ping-pong in cluster. Do not forget to turn it ON back when finished.