It is a second edition of Building cluster with shared FS (GFS2) on RedHat6.
This time, the POC will be built in the KVM environment. Two nodes with two shared disks that connected via multipath will simulate a complete SAN environment. Read the KVM recipes, how to implement it. We will assume that the SAN is well simulated.
One shared disk (1G in size) will be for the data and will be formatted as GFS2. The second disk (10M) will be used as a quorum disk to resolve split-brain caused by the network.
We have no DNS in POC, then put node names into /etc/hosts.
root@node1:~ # cat /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 192.168.122.198 node1 192.168.122.181 node2
Generate root SSH keys and exchange it over cluster nodes:
root@node1:~ # ssh-keygen -t rsa -b 1024 -C "root@vorh6t0x" ..... root@node1:~ # cat .ssh/id_rsa.pub >> .ssh/authorized_keys root@node1:~ # scp -pr .ssh node2:
Do not forget to disable firewall (iptables) and selinux.
Install these RPMs on both nodes (with all depencies):
# yum install openssh-clients wget rsync ntp ntpdate vim-common gfs2-utils \ device-mapper-multipath pcs pacemaker fence-agents lvm2-cluster
This is not only cluster software, but also the software required in this POC, which is not installed in the minimal installation of the OS.
RedHat 7 uses a pacemaker with corosync as its default cluster software. Prior to this, rgmanager and cman were used instead. The first edition of this article was about the configuration of rgmanager and cman. This time we will use a pacemaker with corosync, and on RedHat 6.9.
It is important to understand for ourselves what we want from the cluster to resolve the split-brain.
In addition to the usual fencing, I'm going to add a quorum disk that will help resolve the split-brain caused by a network failure. This additional vote will help the cluster understand which node should be fenced.
I do not like the password-based method recommended by the vendor. Here's a workaround that avoids using a password:
root@node1:~ # /etc/init.d/pcsd start Starting pcsd: [ OK ] root@node1:~ # /etc/init.d/pcsd stop Stopping pcsd: [ OK ] root@node1:~ # cd /var/lib/pcsd root@node1:/var/lib/pcsd # ll total 12 -rwx------. 1 root root 60 Oct 10 18:04 pcsd.cookiesecret -rwx------. 1 root root 1180 Oct 10 18:04 pcsd.crt -rwx------. 1 root root 1679 Oct 10 18:04 pcsd.key
You've started and immediately stopped the pcsd daemon. As result, some files were created in the /var/lib/pcsd directory. The next step is to create missing authorization files:
root@node1:/var/lib/pcsd # TOKEN=$(uuidgen) root@node1:/var/lib/pcsd # cat > pcs_users.conf << EOFcat [ { "creation_date": "$(date)", "username": "hacluster", "token": "$TOKEN" } ] EOFcat root@node1:/var/lib/pcsd # cat > tokens << EOFcat { "format_version": 2, "data_version": 2, "tokens": { "node1": "$TOKEN", "node2": "$TOKEN" } } EOFcat root@node1:/var/lib/pcsd # chmod 600 tokens root@node1:/var/lib/pcsd # ll total 20 -rw-r--r--. 1 root root 141 Oct 10 18:06 pcs_users.conf -rwx------. 1 root root 60 Oct 10 18:04 pcsd.cookiesecret -rwx------. 1 root root 1180 Oct 10 18:04 pcsd.crt -rwx------. 1 root root 1679 Oct 10 18:04 pcsd.key -rw-------. 1 root root 224 Oct 10 18:07 tokens
Finally, copy the entire directory /var/lib/pcsd to the neighbors, enable and start the pcsd daemon:
root@node1:~ # rsync -a /var/lib/pcsd/ node2:/var/lib/pcsd/ root@node1:~ # for h in node{1,2} ; do ssh $h "chkconfig pcsd on ; /etc/init.d/pcsd start" done
Verify that the authorization works:
root@node1:~ # pcs cluster auth node1 node2 node1: Already authorized node2: Already authorized
root@node1:~ # pcs cluster setup --start --enable --name mycluster node1 node2 --transport udpu Warning: Using udpu transport on a RHEL 6 cluster, cluster restart is required after node add or remove Destroying cluster on nodes: node1, node2... node1: Stopping Cluster (pacemaker)... node2: Stopping Cluster (pacemaker)... node2: Successfully destroyed cluster node1: Successfully destroyed cluster Sending cluster config files to the nodes... node1: Updated cluster.conf... node2: Updated cluster.conf... Starting cluster on nodes: node1, node2... node1: Starting Cluster... node2: Starting Cluster... node1: Cluster Enabled node2: Cluster Enabled Synchronizing pcsd certificates on nodes node1, node2... node1: Success node2: Success Restarting pcsd on the nodes in order to reload the certificates... node1: Success node2: Success
I am using transport="udpu" here, because my network does not support multicasts and broadcasts are not welcomed too. Without this option, my cluster works upredictable.
Check the results:
root@node1:~ # pcs status Cluster name: mycluster WARNING: no stonith devices and stonith-enabled is not false Stack: cman Current DC: node1 (version 1.1.15-5.el6-e174ec8) - partition with quorum Last updated: Tue Oct 10 18:15:54 2017 Last change: Tue Oct 10 18:14:51 2017 by root via crmd on node1 2 nodes and 0 resources configured Online: [ node1 node2 ] No resources Daemon Status: cman: active/disabled corosync: active/disabled pacemaker: active/enabled pcsd: active/enabled
The status shows cman, used as a stack. This is good, because CLVM only knows to work with cman. However, you can also see the running process of corosync. It is run by cman and controlled by it. No need to configure corosync then.
We will create an LVM structure on a multipath device, rather than on underlaying SCSI disks. It is important to configure the correct filter string in the /etc/lvm/lvm.conf file to explicitly include multi-level devices and exclude others, otherwise you will got "Duplicate PV found", and LVM can deside to use single path disk instead of multipathed. Here is an example of my "filter" line, adding only the "rootvg" device and the multipath devices:
filter = [ "a|^/dev/vda2$|", "a|^/dev/mapper/pv_|", "r|.*|" ]
Replicate configuration file to other node:
root@node1:~ # rsync -a /etc/lvm/lvm.conf node2:/etc/lvm/lvm.conf
Create /etc/multipath.conf configuration file. As usual, I setting up names (aliases) for multipath devices for easy management:
defaults { user_friendly_names yes flush_on_last_del yes queue_without_daemon no no_path_retry fail } blacklist { wwid "*" } blacklist_exceptions { wwid "0QEMU QEMU HARDDISK 1010101" wwid "0QEMU QEMU HARDDISK 1010102" } multipaths { multipath { wwid "0QEMU QEMU HARDDISK 1010101" alias pv_gfs } multipath { wwid "0QEMU QEMU HARDDISK 1010102" alias quorum } }
Your wwid will be differ from mine. Do not forget to add them to blackilst_exception too, not only to aliases list.
Replicate configuration file to other node:
root@node1:~ # rsync -a /etc/multipath.conf node2:/etc/multipath.conf
Add multipath to be started at system boot (on both nodes):
# /etc/init.d/multipathd start # chkconfig --add multipathd # chkconfig multipathd on # multipath -F # multipath create: quorum (0QEMU QEMU HARDDISK 1010102) undef QEMU,QEMU HARDDISK size=10M features='0' hwhandler='0' wp=undef |-+- policy='round-robin 0' prio=1 status=undef | `- 2:0:1:0 sdb 8:16 undef ready running `-+- policy='round-robin 0' prio=1 status=undef `- 3:0:1:0 sdd 8:48 undef ready running create: pv_gfs (0QEMU QEMU HARDDISK 1010101) undef QEMU,QEMU HARDDISK size=1.0G features='0' hwhandler='0' wp=undef |-+- policy='round-robin 0' prio=1 status=undef | `- 2:0:0:0 sda 8:0 undef ready running `-+- policy='round-robin 0' prio=1 status=undef `- 3:0:0:0 sdc 8:32 undef ready running
As you see, my data LUN would appear as /dev/mapper/pv_gfs, exactly matching to LVM's filter line.
Enable LVM cluster featires on both nodes and start clvmd. Make it start at system boot.
# lvmconf --enable-cluster # /etc/init.d/clvmd start # chkconfig --add clvmd # chkconfig clvmd on
Create PV and CLV on one node:
root@node1:~ # pvcreate --dataalignment 4k /dev/mapper/pv_gfs Physical volume "/dev/mapper/pv_gfs" successfully created root@node1:~ # vgcreate -c y vg_gfs /dev/mapper/pv_gfs Clustered volume group "vg_gfs" successfully created root@node1:~ # lvcreate -n export -l100%FREE /dev/vg_gfs Logical volume "export" created.
Check on second node by commands pvs, vgs and lvs that everything visible there too.
root@node2:~ # vgs VG #PV #LV #SN Attr VSize VFree rootvg 1 2 0 wz--n- 19.80g 15.89g vg_gfs 1 1 0 wz--nc 1020.00m 0 root@node2:~ # lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert slash rootvg -wi-ao---- 2.93g swap rootvg -wi-ao---- 1000.00m export vg_gfs -wi-a----- 1020.00m
The bold c in "vgs" command states that this VG is clustered. Stop "clvmd", the clustered VG will be not shown. Start it again, we will need it at next step.
Create GFS2 on one node as following:
root@node1:~ # mkfs.gfs2 -p lock_dlm -t mycluster:export -j 2 /dev/vg_gfs/export This will destroy any data on /dev/vg_gfs/export. It appears to contain: symbolic link to `../dm-4' Are you sure you want to proceed? [y/n] y Device: /dev/vg_gfs/export Blocksize: 4096 Device Size 1.00 GB (261120 blocks) Filesystem Size: 1.00 GB (261118 blocks) Journals: 2 Resource Groups: 4 Locking Protocol: "lock_dlm" Lock Table: "mycluster:export" UUID: 729885fb-9052-77c0-ae7b-da37ac498c1c
where: mycluster is ClusterName, export is FS name, -j 2 using two journals as we have two nodes.
Then, mount it (on both nodes):
# mkdir /export # mount -o noatime,nodiratime -t gfs2 /dev/vg_gfs/export /export
Copy there some data on node1, read it on node2:
root@node1:~ # rsync -av /etc /export/ .. root@node2:~ # find /export/ -ls ..
Add GFS2 FS to /etc/fstab on both nodes, like:
# grep gfs2 /etc/fstab /dev/vg_gfs/export /export gfs2 noatime,nodiratime 0 0 # chkconfig --add gfs2 ; chkconfig gfs2 on
/etc/init.d/gfs2 script as part of "gfs2-utils" will mount/umount GFS2 from /etc/fstab at appropriate time, after cluster started and before it goes down.
First of all, stop (on both nodes) all services, that we will define later as cluster resources:
# umount /export # /etc/init.d/clvmd stop
Our quorum disk has already been defined in the multipath part, it remains only to format it as a quorum device.
root@node1:~ # mkqdisk -c /dev/mapper/quorum -l QD1 mkqdisk v3.0.12.1 Writing new quorum disk label 'QD1' to /dev/mapper/quorum. WARNING: About to destroy all data on /dev/mapper/quorum; proceed [N/y] ? y Initializing status block for node 1... Initializing status block for node 2... Initializing status block for node 3... Initializing status block for node 4... Initializing status block for node 5... Initializing status block for node 6... Initializing status block for node 7... Initializing status block for node 8... Initializing status block for node 9... Initializing status block for node 10... Initializing status block for node 11... Initializing status block for node 12... Initializing status block for node 13... Initializing status block for node 14... Initializing status block for node 15... Initializing status block for node 16... root@node1:~ #
Check that node2 can see the quorum device too:
root@node2:~ # mkqdisk -L mkqdisk v3.0.12.1 /dev/block/253:3: /dev/disk/by-id/dm-name-quorum: /dev/disk/by-id/dm-uuid-mpath-0QEMU\x20\x20\x20\x20QEMU\x20HARDDISK\x20\x20\x201010102: /dev/dm-3: /dev/mapper/0QEMU QEMU HARDDISK 1010102: /dev/mapper/quorum: Magic: eb7a62c2 Label: QD1 Created: Wed Oct 11 14:07:55 2017 Host: node1 Kernel Sector Size: 512 Recorded Sector Size: 512 root@node2:~ #
There is no tool to define the quorum disk online, so it's time to shut down the cluster:
root@node1:~ # pcs cluster stop --all node1: Stopping Cluster (pacemaker)... node2: Stopping Cluster (pacemaker)... node2: Stopping Cluster (cman)... node1: Stopping Cluster (cman)...
Open in your favorite text editor /etc/cluster/cluster.conf and fix it:
.. <cman broadcast="no" expected_votes="2" transport="udpu"/> <quorumd interval="1" label="QD1" tko="9" votes="1"> <heuristic program="ping -c1 -W1 -w1 192.168.122.1" interval="1" score="1" tko="3" /> </quorumd> <totem token="20000"/> ..
Find the definitions of cman and correct this: remove two_nodes and increase expected_votes. Then add the quorumd section; label is what you've created in step mkqdisk. The heuristic ping is set to the default gateway, your gateway will be different. The totem token must be large enough to cover quorumd timeouts.
Copy configuration file to node2 and start cluster on both nodes:
root@node1:~ # rsync -a /etc/cluster/cluster.conf node2:/etc/cluster/cluster.conf root@node1:~ # pcs cluster start Starting Cluster... root@node2:~ # pcs cluster start Starting Cluster...
It is time to add a fencing to your cluster. Just because I am in KVM environment, I'll use fence_xvm as described in the KVM recepies. You must use fencing that fits your enfironment. Cluster will not work without fencing.
To see all fencing methods avaliable for you:
root@node1:~ # pcs stonith list
Pick up suitable for you and see configuration options:
root@node1:~ # pcs stonith describe fence_xvm
Simply because my guest's and node names are the same, I have not need to provide the mapping information and define separate fencing for nodes. It is enough for me define very generic fenicing method:
root@node1:~ # pcs stonith create kvm-kill fence_xvm
It is time to do some tests. Turn network off on one node. Cause kernel crush on one node (HINT: echo c > /proc/sysrq-trigger )
The only resource will be gfs2-healthcheck script, that we will create and put into /etc/init.d (on both nodes):
# cat /etc/init.d/gfs2-healthcheck #!/bin/bash # # chkconfig: - 24 76 # description: Check if GFS2 FS healthy # Short-Description: Check if GFS2 FS healthy # Description: Check if GFS2 FS healthy rtrn=0 MOUNTS=$(awk '/ gfs2 /{print $2}' /proc/mounts) for M in $MOUNTS ; do # Check for RW access: touch $M/.healthcheck.$$ || rtrn=1 rm -f $M/.healthcheck.$$ || rtrn=1 done exit $rtrn # chmod +x /etc/init.d/gfs2-healthcheck
Then add it to cluster:
root@node1:~ # pcs resource create gfs-check lsb:gfs2-healthcheck clone
This script will check if GFS2 FS still available for RW operations. If not, a bad node will be fenced, the second node will continue it's job.