This is an attempt to build something similar to metro cluster ability using RedHat 6 only. The POC is to simulate two different sites with server and storage on each one. Both servers see both LUNs from both sites. Linux LVM (or md) do mirror between LUNs coming from different sites. Cluster software do failover between nodes (sites). FS reads should be from closest disk, writes should go to both.
This POC uses two CISCO UCS blades in same cage and two LUNs coming from same NetApp, therefore it is only simulation of metro.
During this POC, LVM way prove itself unusable, because once one of "mirrored" LUNs goes offline (that supposed to be happen by design), LVM hung and cause hung everything. Therefore "MD Way" was added.
Then was discovered that tuning multipath parameters solves hung problem, therefore LVM way was rechecked and seems to be working well.
Two UCS servers were installed with minimal RH6 installation, configured with boot from SAN (you can use local boot either, I just does not have internal disks installed).
All nodes in HA cluster have to have same SSH host keys to not make SSH clients crazy after fail over.
vorh6t01 # scp vorh6t02:/etc/ssh/ssh_host_\* /etc/ssh/ ... vorh6t01 # service sshd restart
Generate root SSH keys and exchange it over cluster nodes:
vorh6t01 # ssh-keygen -t rsa -b 1024 -C "root@vorh6t" ..... vorh6t01 # cat .ssh/id_rsa.pub >> .ssh/authorized_keys vorh6t01 # scp -pr .ssh vorh6t02:
Install theese RPMs on both nodes (with all depencies):
# yum install ccs cman rgmanager
vorh6t01 and vorh6t02 are two nodes of HA (fail-over) cluser named vorh6t. Take care to make all names resolvable by DNS and add all names to /etc/hosts on both nodes.
Define cluster (on any one node):
# ccs_tool create -2 vorh6t
The command above create /etc/cluster/cluster.conf file. It can be editted by hand and have to be redistributed to every node in cluster. -2 option required for two-node cluster; usual configuration suppose more than two nodes, to make quorum clear.
Open file by editor and change nodenames to be real names. I am using transport="udpu" here, because my network does not support multicasts and broadcasts are not welcomed too. Without this option, my cluster works upredictable. The resulting file should be like:
<?xml version="1.0"?> <cluster name="vorh6t" config_version="1"> <cman two_node="1" expected_votes="1" transport="udpu" /> <clusternodes> <clusternode name="vorh6t01.domain.com" votes="1" nodeid="1"> <fence> <method name="single"> </method> </fence> </clusternode> <clusternode name="vorh6t02.domain.com" votes="1" nodeid="2"> <fence> <method name="single"> </method> </fence> </clusternode> </clusternodes> <fencedevices> </fencedevices> <rm> <failoverdomains/> <resources/> </rm> </cluster>
Then check:
# ccs_tool lsnode Cluster name: vorh6t, config_version: 1 Nodename Votes Nodeid Fencetype vorh6t01.domain.com 1 1 vorh6t02.domain.com 1 2 # ccs_tool lsfence Name Agent
Copy /etc/cluster/cluster.conf to second node:
vorh6t01 # scp /etc/cluster/cluster.conf vorh6t02:/etc/cluster/cluster.conf
You can start sluster services now to see it working. Start it by /etc/init.d/cman start on both nodes. Check /var/log/messages. See clustat output:
vorh6t01 # clustat Cluster Status for vorh6t @ Thu Sep 27 15:04:58 2012 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ vorh6t01.domain.com 1 Online, Local vorh6t02.domain.com 2 Online vorh6t02 # clustat Cluster Status for vorh6t @ Thu Sep 27 15:05:07 2012 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ vorh6t01.domain.com 1 Online vorh6t02.domain.com 2 Online, Local
Stop cluster services on both nodes by /etc/init.d/cman stop
There are two sections related to resources: <resources/> and <service/>. First section is about "Global" resources shared between services (like IP). Second is for resources grouped by service (like FS + script). Our cluster is single purpose cluster, then open only <service> section.
... <rm> <failoverdomains/> <resources/> <service autostart="1" name="vorh6t" recovery="relocate"> <ip address="192.168.131.12/24" /> </service> <rm> ...
Add cluster services to init scripts. Start cluster and resource manager on both nodes:
# chkconfig --add cman # chkconfig cman on # chkconfig --add rgmanager # chkconfig rgmanager on # /etc/init.d/cman start # /etc/init.d/rgmanager start # clustat Cluster Status for vorh6t @ Tue Oct 2 12:55:38 2012 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ vorh6t01.domain.com 1 Online, rgmanager vorh6t02.domain.com 2 Online, Local, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:vorh6t vorh6t01.domain.com started
Switch Service to another node:
# clusvcadm -r vorh6t -m vorh6t02 Trying to relocate service:vorh6t...Success service:vorh6t is now running on vorh6t02.domain.com
Freeze resources (for maintenance):
# clusvcadm -Z vorh6t Local machine freezing service:vorh6t...Success
Resume normal operation:
# clusvcadm -U vorh6t Local machine unfreezing service:vorh6t...Success
RH cluster behaviour is almost broken without well configured fencing. You can see available fencing methods in /usr/sbin/fence*.
CISCO had supplied fencing for UCS, lets use it. Create user for fencing at UCS or use existing one. Give him "poweroff" + "profile change" roles. Check if you can use this user:
vorh6t02:~ # fence_cisco_ucs -z --action=status --ip=UCSNAME --username=USERNAME --password=PASSWORD --suborg=SUBORG --plug=vorh6t01 Status: ON vorh6t02:~ # fence_cisco_ucs -z --action=off --ip=UCSNAME --username=USERNAME --password=PASSWORD --suborg=SUBORG --plug=vorh6t01 Success: Powered OFF vorh6t02:~ # fence_cisco_ucs -z --action=on --ip=UCSNAME --username=USERNAME --password=PASSWORD --suborg=SUBORG --plug=vorh6t01 Success: Powered ON
The --suborg string is usually your "Sub-Organization" name with prefix "org-". If you called your "Sub-Organization" "Test", then results will be --suborg=org-Test.
Once fencing tests worked, fix cluster.conf:
# cat /etc/cluster/cluster.conf <?xml version="1.0"?> <cluster name="vorh6t" config_version="2"> <logging syslog_priority="error"/> <fence_daemon post_fail_delay="20" post_join_delay="30" clean_start="1" /> <cman two_node="1" expected_votes="1" transport="udpu" /> <clusternodes> <clusternode name="vorh6t01.domain.com" votes="1" nodeid="1"> <fence> <method name="single"> <device name="ucsfence" port="vorh6t01" action="off" /> </method> </fence> </clusternode> <clusternode name="vorh6t02.domain.com" votes="1" nodeid="2"> <fence> <method name="single"> <device name="ucsfence" port="vorh6t02" action="off" /> </method> </fence> </clusternode> </clusternodes> <fencedevices> <fencedevice name="myfence" agent="fence_manual" /> <fencedevice name="ucsfence" agent="fence_cisco_ucs" ipaddr="UCSNAME" login="USERNAME" passwd="PASSWORD" ssl="on" suborg="SUBORG" /> </fencedevices> <rm> <failoverdomains/> <resources/> <service autostart="1" name="vorh6t" recovery="relocate"> <ip address="192.168.131.12/24" /> </service> </rm> </cluster>
Do not forget increment config_version number and save changes. Verify config file:
# ccs_config_validate Configuration validates
Distribute file and update cluster:
vorh6t01 # scp /etc/cluster/cluster.conf vorh6t02:/etc/cluster/cluster.conf vorh6t01 # cman_tool version -r -S
Let's stop network (/etc/init.d/network stop) on active node and see cluster kill "bad" server !
Looks like you need restart cman to reread config file.
Lets create two volumes with LUNs and map both them to both clusters nodes:
netapp> igroup create -f -t linux vorh6t01 $WWN1 $WWN2 netapp> igroup set vorh6t01 alua yes netapp> igroup create -f -t linux vorh6t02 $WWN1 $WWN2 netapp> igroup set vorh6t02 alua yes netapp> vol create vorh6t01 -s none $AGGRNAME 600g netapp> exportfs -z /vol/vorh6t01 netapp> vol options vorh6t01 minra on netapp> vol autosize vorh6t01 -m 14t on netapp> lun create -s 250g -t linux -o noreserve /vol/vorh6t01/data netapp> lun map /vol/vorh6t01/data vorh6t01 netapp> lun map /vol/vorh6t01/data vorh6t02 netapp> vol create vorh6t02 -s none $AGGRNAME 600g netapp> exportfs -z /vol/vorh6t02 netapp> vol options vorh6t02 minra on netapp> vol autosize vorh6t02 -m 14t on netapp> lun create -s 250g -t linux -o noreserve /vol/vorh6t02/data netapp> lun map /vol/vorh6t02/data vorh6t01 netapp> lun map /vol/vorh6t02/data vorh6t02
Rescan FC for changes (on both nodes):
# for FC in /sys/class/fc_host/host?/issue_lip ; do echo "1" > $FC ; sleep 5 ; done ; sleep 20
Use NetApp LUN serial to WWID converter to calculate LUN's WWID. Fix LUN's name in /etc/multipath.conf. I've choosed to use name "site01" and "site02" to reflect simulation of two storages for local and remote site.
multipaths { multipath { wwid 360a980004176596d6a3f447356493258 alias site01 } multipath { wwid 360a980004176596d6a3f44735649325a alias site02 } }
Run multipath command on both nodes and see both new LUNs recognized and use multipaths.
# multipath # multipath -ll
Easy part is over.
Fix /etc/lvm/lvm.conf filter line to filter out plain SCSI disk. My line is an example to explicit adding used devices, ignoring other:
... filter = [ "a|/dev/mapper/mroot|", "a|/dev/mapper/site|", "r/.*/" ] ...
Create mirrored LV on one node. Make "site01" disk preferrable for reading:
vorh6t01:~ # pvcreate --dataalignment 4k /dev/mapper/site01 Physical volume "/dev/mapper/site01" successfully created vorh6t01:~ # pvcreate --dataalignment 4k /dev/mapper/site02 Physical volume "/dev/mapper/site02" successfully created vorh6t01:~ # vgcreate orahome /dev/mapper/site0? Volume group "orahome" successfully created vorh6t01:~ # pvs PV VG Fmt Attr PSize PFree /dev/mapper/mroot0p2 rootvg lvm2 a-- 39.88g 34.78g /dev/mapper/site01 orahome lvm2 a-- 250.00g 250.00g /dev/mapper/site02 orahome lvm2 a-- 250.00g 250.00g vorh6t01:~ # lvcreate -n export -L 20g --type raid1 -m1 --nosync /dev/orahome WARNING: New raid1 won't be synchronised. Don't read what you didn't write! Logical volume "export" created vorh6t01:~ # lvchange --writemostly /dev/mapper/site02:y /dev/orahome/export Logical volume "export" changed. vorh6t01:~ # mkfs.ext3 -j -m0 -b4096 /dev/orahome/export vorh6t01:~ # mkdir /export && mount /dev/orahome/export /export
Make some IO tests on /export and verify that writes goes to both LUNs, while reads only from site1 disk.
Test it works on partner:
vorh6t01:~ # umount /export/ vorh6t01:~ # vgchange -a n orahome 0 logical volume(s) in volume group "orahome" now active
vorh6t02:~ # vgscan Reading all physical volumes. This may take a while... Found volume group "orahome" using metadata type lvm2 Found volume group "rootvg" using metadata type lvm2 vorh6t02:~ # vgchange -ay orahome 1 logical volume(s) in volume group "orahome" now active vorh6t02:~ # mount /dev/orahome/export /export
Running IO test shows that volume remember previous settings and "site02" remains writemostly LUN. It is not a desired situation at node(site) 02. Lets fix it:
vorh6t02:~ # lvchange --writemostly /dev/mapper/site02:n /dev/orahome/export Logical volume "export" changed. vorh6t02:~ # lvchange --writemostly /dev/mapper/site01:y /dev/orahome/export Logical volume "export" changed.
Repeat IO tests. Now it behaive as desired.
Fix /etc/lvm/lvm.conf to name explicit VGs activated on LVM start (this is just a list of VGs and tag - hearthbeat NIC's hostname) :
volume_list = [ "rootvg", "@vorh6t01.domain.com" ]
Initrd have to be rebuild to include new lvm.conf in it (otherwice cluster refuse to start):
mkinitrd -f /boot/initramfs-$(uname -r).img $(uname -r)
Repeat with /etc/lvm/lvm.conf changes on other node.
Let's add our LV to clusters resources. Edit /etc/cluster/cluster.conf, do not forget increment config_version:
... <cluster name="vorh6t" config_version="3"> ... <rm> <failoverdomains/> <resources/> <service autostart="1" name="vorh6t" recovery="relocate"> <ip address="192.168.131.12/24"> <lvm name="vorh6tlv" lv_name="export" vg_name="orahome"> <fs name="vorh6tfs" device="/dev/orahome/export" mountpoint="/export" fstype="ext3" force_unmount="1" self_fence="1" /> </lvm> </ip> </service> </rm> ...
Distribute file and update cluster:
vorh6t01 # scp /etc/cluster/cluster.conf vorh6t02:/etc/cluster/cluster.conf vorh6t01 # cman_tool version -r -S
See cluster took LV and mount it somewhere.
Now we will create script that will tune "writemostly" parameter. This file should be "LSB" comlient (these scripts in "/etc/init.d" are). I've copied shortest script from /etc/init.d and this is a result:
#!/bin/sh # /export/site-tune # description: Adjust writemostly parameter for mirrored /dev/orahome/export # Everything hardcoded. case "$1" in start) if grep -q 01 /proc/sys/kernel/hostname ; then lvchange --writemostly /dev/mapper/site01:n /dev/orahome/export lvchange --writemostly /dev/mapper/site02:y /dev/orahome/export echo "tuned 01 read, 02 write" else lvchange --writemostly /dev/mapper/site01:y /dev/orahome/export lvchange --writemostly /dev/mapper/site02:n /dev/orahome/export echo "tuned 02 read, 01 write" fi ;; status|monitor) ;; stop) ;; restart|reload|force-reload|condrestart|try-restart) ;; *) echo "Usage: $0 start|stop|status" ;; esac exit 0
Make it executable and test it functionality.
Now we'll add new cluster resource type script. It will be nested in "fs" resource because the script located on this FS.
... <cluster name="vorh6t" config_version="4"> ... <lvm name="vorh6tlv" lv_name="export" vg_name="orahome"> <fs name="vorh6tfs" device="/dev/orahome/export" mountpoint="/export" fstype="ext3" force_unmount="1" self_fence="1"> <script name="vorh6tfstune" file="/export/site-tune" /> </fs> </lvm> ...
Distribute file and update cluster:
vorh6t01 # scp /etc/cluster/cluster.conf vorh6t02:/etc/cluster/cluster.conf vorh6t01 # cman_tool version -r -S
Make cluster failovers, check "writemostly" bit changed as desired:
vorh6t01:~ # lvs -a LV VG Attr LSize Pool Origin Data% Move Log Cpy%Sync Convert export orahome Rwi-aor--- 20.00g 100.00 [export_rimage_0] orahome iwi-aor--- 20.00g [export_rimage_1] orahome iwi-aor-w- 20.00g [export_rmeta_0] orahome ewi-aor--- 4.00m [export_rmeta_1] orahome ewi-aor--- 4.00m ...
On second node, after failover:
vorh6t02:~ # lvs -a LV VG Attr LSize Pool Origin Data% Move Log Cpy%Sync Convert export orahome Rwi-aor--- 20.00g 100.00 [export_rimage_0] orahome iwi-aor-w- 20.00g [export_rimage_1] orahome iwi-aor--- 20.00g [export_rmeta_0] orahome ewi-aor--- 4.00m [export_rmeta_1] orahome ewi-aor--- 4.00m ...
Now, take off-line one of mirrored LUNs. The result is unexpected. LVM hungs, cluster hungs, then file system IO hungs also. Googling found that the problem was arized already at RedHat bugzilla, however there is no hope for quick fix. This bug marked as lack of interest at end user. Please vote for bug fix.
If you did "LVM Way", restore all files as they was prior LVM part.
Create mirror on one node as:
vorh6t01:~ # mdadm -C /dev/md0 -n2 -l1 /dev/mapper/site0? mdadm: Note: this array has metadata at the start and may not be suitable as a boot device. If you plan to store '/boot' on this device please ensure that your boot-loader understands md/v1.x metadata, or use --metadata=0.90 Continue creating array? y mdadm: Defaulting to version 1.2 metadata mdadm: array /dev/md0 started.
If LUN size is quite big, you can use --assume-clean flag to eliminate initial resynchronization. This flag is not recommended by mdadm documentation, but I found it usefull for my thin devices.
Stop array for now:
vorh6t01:~ # mdadm -S /dev/md0 mdadm: stopped /dev/md0
IMPORTANT: Lets assume that "passive" node rebooted for some reason. This will be dangerous that this node will try to assemble MD RAID during boot time when "active" partner still use it. Cluster software will care about brain split during run time, but not at boot time.
Disable MD assembling during initrd boot time. Add rd_NO_MD to kernel boot line at /boot/grub/grub.conf. Other init scripts will try to start MD raids too. It is a lot of places to patch, therefore I've did lazy thing renaming /sbin/dmadm to /sbin/dmadm.real and creating stub script instead of original. OS updates will replace it back, but this will be solved in script you will see later.
# /sbin/mdadm --version | grep -q Fake || mv -f /sbin/mdadm /sbin/mdadm.real && \ echo -e '#!/bin/bash\necho "Fake stub. Use /sbin/mdadm.real"' > /sbin/mdadm && chmod +x /sbin/mdadm
Now reboot node(s) and check that there is no /dev/md?* devices created at boot time. Please do not continue without this phase.
This is my /root/bin/mdscript script. The script should be LSB complient by exit codes. This script starts md0 device and set "writemostly" bit on relevant device. The longest string detects real name masked by dm-XX.(Mapping DM device to name).
#!/bin/sh # Everything hardcoded. # Take care about mdadm RPM updates: /sbin/mdadm --version | grep -q Fake || mv -f /sbin/mdadm /sbin/mdadm.real && \ echo -e '#!/bin/bash\necho "Fake stub. Use /sbin/mdadm.real"' > /sbin/mdadm && chmod +x /sbin/mdadm MDADM="/sbin/mdadm.real" case "$1" in start) $0 stop $MDADM -A md0 /dev/mapper/site0? [ ! -b /dev/md0 ] && exit 1 # Find names of devices: SITE01=$(grep -H site0 /sys/block/md0/md/dev-*/block/dm/name|awk '/site01/{gsub("/block/dm/name:.*","");print $1}') SITE02=$(grep -H site0 /sys/block/md0/md/dev-*/block/dm/name|awk '/site02/{gsub("/block/dm/name:.*","");print $1}') if grep -q 01 /proc/sys/kernel/hostname ; then echo "writemostly" > ${SITE02}/state else echo "writemostly" > ${SITE01}/state fi ;; status|monitor) [ ! -b /dev/md0 ] && exit 3 ;; stop) for md in $( ls /dev/md?* 2>/dev/null) ; do $MDADM -S $md done sleep 1 [ -b /dev/md0 ] && exit 1 ;; *) echo "Usage: $0 start|stop|status" ;; esac exit 0
Copy script to partner node and add script resource to cluster:
... <cluster name="vorh6t" config_version="3"> ... <rm> <failoverdomains/> <resources/> <service autostart="1" name="vorh6t" recovery="relocate"> <ip address="192.168.131.12/24"> <script name="vorh6tmd" file="/root/bin/mdscript" /> </ip> </service> </rm> ...
Spread config file, restart services and check cat /proc/mdstat. You should see md0 running on active node. If not, recheck configuration.
If it looks working, lets format it on active node. I've decided not use LVM over my md device as unneccesary overhead.
vorh6t01:~ # mkfs.ext3 -j -m0 -b4096 -E nodiscard /dev/md0 mke2fs 1.41.12 (17-May-2010) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=1 blocks, Stripe width=16 blocks 16375808 inodes, 65503184 blocks 0 blocks (0.00%) reserved for the super user First data block=0 Maximum filesystem blocks=4294967296 1999 block groups 32768 blocks per group, 32768 fragments per group 8192 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872 Writing inode tables: done Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 29 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override.
Add FS resource to cluster:
... <cluster name="vorh6t" config_version="4"> ... <rm> <failoverdomains/> <resources/> <service autostart="1" name="vorh6t" recovery="relocate"> <ip address="192.168.131.12/24"> <script name="vorh6tmd" file="/root/bin/mdscript"> <fs name="vorh6tfs" device="/dev/md0" mountpoint="/export" fstype="ext3" force_unmount="1" self_fence="1" /> </script> </ip> </service> </rm> ...
Distribute file and update cluster:
vorh6t01 # scp /etc/cluster/cluster.conf vorh6t02:/etc/cluster/cluster.conf vorh6t01 # cman_tool version -r -S
Check /export mounted by rgmanager.
Simple test, turning off one of mirrored data LUN. Again, unexpected result occure. Multipath cry about lost paths, cat /proc/mdstat hungs, sync command hungs. Turning online missing LUN solves problem. The problem is exactly the same as in "LVM Way". May be problem with multipath configuration?
YES! The default feature queue_if_no_path cause IO hang to all-path-dead device. There is a command to change multipath features on-line. Let's add it to our script at "status" :
... status|monitor) /sbin/dmsetup message site01 0 "fail_if_no_path" /sbin/dmsetup message site02 0 "fail_if_no_path" [ ! -b /dev/md0 ] && exit 3 ;; ...
Wow! Much better now. Everything works as it expected. The big minus of solution is full resync of devices and manual intervention to make mirror back.
Let's recheck "LVM Way" with multipath feature trick.
Restore everything were done for "LVM Way" previous chapter.
Let's add multipath feature resetting to our /export/site-tune script status case:
... status|monitor) /sbin/dmsetup message site01 0 "fail_if_no_path" /sbin/dmsetup message site02 0 "fail_if_no_path" sleep 1 ;; ...
Check with multipath -ll that there is no queue_if_no_path in feature field for "site0?" devices.
Take offline one of mirrored LUNs. Huh! No more hungs. Great. Let's see it resynchronizing back when LUN return:
# lvs -a LV VG Attr LSize Pool Origin Data% Move Log Cpy%Sync Convert export orahome Rwi-aor-r- 20.00g 100.00 [export_rimage_0] orahome iwi-aor-w- 20.00g [export_rimage_1] orahome iwi-aor-r- 20.00g [export_rmeta_0] orahome ewi-aor--- 4.00m [export_rmeta_1] orahome ewi-aor-r- 4.00m ...
According to manual, this "r" bit means "refresh needed". A repair occur automatically, only this bit remains. Try command lvchange --refresh to clean bit.
# lvchange --refresh /dev/orahome/export # lvs -a LV VG Attr LSize Pool Origin Data% Move Log Cpy%Sync Convert export orahome Rwi-aor--- 20.00g 100.00 [export_rimage_0] orahome iwi-aor-w- 20.00g [export_rimage_1] orahome iwi-aor--- 20.00g [export_rmeta_0] orahome ewi-aor--- 4.00m [export_rmeta_1] orahome ewi-aor--- 4.00m ...
Everything back to OK status. Another command lvconvert --repair orahome/export may be used for full resync if LUNs were out of sync for long time.