The preparation should be done on both nodes.
There are several points to consider before installing a cluster:
User and group IDs should be the same on all nodes. This is less important for our particular case, but it is very important for services with shared storage, either active-active or failover.
Both cluster nodes must be synchronized in time. It is good to keep all your servers synchronized in time in general.
SUSE 15 SP1 uses the chronyd time synchronization tool by default. Edit the /etc/chrony.conf file to fix the server line:
.. server <YOUR NTP SERVER NAME OR IP> ..
Then enable and start the chronyd daemon:
# systemctl stop chronyd # systemctl enable --now chronyd # chronyc sources
The last command shows that the daemon is actually connecting to specific NTP servers. The string must have an incrementing in time "Reach" counter value and some other than 0 jitter values.
Name resolution in SUSE 15 is managed by wicked and it's hard to get it relaxed and follow the legacy /etc/resolv.conf. There are several steps to get this:
# sed -e 's/^NETCONFIG_DNS_POLICY.*/NETCONFIG_DNS_POLICY=/' -i /etc/sysconfig/network/config # netconfig update -f
The above commands should convince wicked to stop mess with /etc/resolv.conf. In practice, this was not enough, since the file still linked to a wicked managed file. Restoring a classic file solves the problem:
# rm -f /etc/resolv.conf # vi /etc/resolv.conf
Place the regular content in the file as the search and nameserver definitions.
Even with well-functioning DNS, put the IP addresses of all hosts in /etc/hosts, including the IP addresses of the replication segment (use different names for them). Replicate the /etc/hosts to the second node.
This is a general recommendation not related to SAP or even SUSE. Let's imagine a virtual IP that will follow the active cluster node. SSH to this IP will always fail after a cluster failover due to changes in the SSH keys of the active host. Good practice here is to use the same SSH host keys on all nodes of the cluster.
node1# rsync -av /etc/ssh/ssh_host_* node2:/etc/ssh/
If there are multipath drives shared between cluster nodes (at least in our case, this is SBD), the following changes should be applied to the multipath daemon:
# cat /etc/multipath.conf defaults { no_path_retry fail queue_without_daemon no flush_on_last_del yes } ..
In a default configuration, a multipath daemon will forgive failed paths, even the last one. He hopes that one of the paths will return soon and will not complain to the kernel about a faulty device. This is probably good in a stand-alone server configuration, but it causes the cluster to ignore storage subsystem failures. The changes above corrects this behavior.
The STONITH block device acts as a message box between cluster nodes. This adds another message channel besides the network one. The difference is that corosync does not use this channel, but only fencing messages.
A special service is needed to read the message and decide on suicide. The message sent by the initiator node does not guarantee that the partner will read it and act accordingly. This leads to the idea of using a watchdog. Although Intel physical servers have a hardware watchdog timer, this is not the case in Power and virtualized environments. Therefore, you should use a software watchdog timer:
# echo softdog > /etc/modules-load.d/watchdog.conf # systemctl restart systemd-modules-load # lsmod | grep dog softdog 16384 0OPTIONAL: Making alias in /etc/multipath.conf for shorter name. This may be usefull later to resuming cluster referencing to handy name.
It is possible to initialize and configure SBD manually, but the cluster initialization script will do it better. However, you should enable SBD service, the script will not do this:
# systemctl enable sbd.service
NOTE: These actions should be done on both nodes.
Use the SUSE provided ha-cluster-init script on the first node to begin configuring the cluster.
node1# ha-cluster-init -u -n CLUSTERNAME-u forces the cluster to use unicast interconnect instead of multicast. Most production customer networks do not support multicast, so use -u anytime and anywhere.
The script is interactive and will ask you some questions.
/root/.ssh/id_rsa already exists - overwrite (y/n)? n
This question means that the script is trying to generate an SSH key pair and discovers that they already exist. An existing couple usually means that it is already being used for some purpose. Therefore, the correct answer is n.
Address for ring0 [xxx.xxx.xxx.xxx]
There is no question mark, but the script is waiting for your input. If the proposed IP is suitable for the main network (ring0), just press Enter. If the proposal is incorrect, it's time to specify the IP address of another interface.
/etc/pacemaker/authkey already exists - overwrite (y/n)? y
This question may arise if you have already set up a cluster and now repeat the action. In my case, I want to start over, so I overwrite the cluster with an answer y.
Do you wish to use SBD (y/n)? y Path to storage device ...cutted output... []/dev/mapper/sbd WARNING: All data on /dev/mapper/sbd will be destroyed! Are you sure you wish to use this device (y/n)? y
Answer yes to use the SDB and give its name (/dev/mapper/36XXXX.. or a short alias if sat in multipath.conf). Agree to initialize the device.
Do you wish to configure a virtual IP address (y/n)? n
We will configure two VIPs later and assign one for the active HANA database and the other for the secondary. Since we want to use a secondary database for read-only queries, the second VIP can be useful.
node2# ha-cluster-join
This script will ask less questions.
.. IP address or hostname of existing node ... []node1 .. /root/.ssh/id_rsa already exists - overwrite (y/n)? n .. Address for ring0 [xxx.xxx.xxx.xxx]<Press Enter here>
The explanation is similar to node1 part.
The result of script works is working cluster as shown:
# crm status Stack: corosync Current DC: node1 2 nodes configured 1 resource configured Online: [ node1 node2 ] Full list of resources: stonith-sbd (stonith:external/sbd): Started node1
During cluster initialization, a hacluster user is created with a default password. This user is very powerfull, especially when connected to the HAWK interface, which is also widely available after cluster initialization. Set a strong password for the user:
node1# passwd hacluster node2# passwd hacluster
The cluster for the HANA database is slightly different from a common cluster. After failover, the primary database becomes secondary, register itself to the promoted database and resume data replication in opposite direction. It is not possible to repeat a cluster failover until the database is fully synchronized, otherwise data loss may occur. To prevent cluster ping-pong, the SBD_STARTMODE option may be useful. Edit the SBD configuration file and set the option to clean.
node1# vi /etc/sysconfig/sbd SBD_STARTMODE=clean
Synchronize the file between nodes. The csync2 tool can be used for this job:
node1# csync2 -xv Marking file as dirty: /etc/sysconfig/sbd .. Updating /etc/sysconfig/sbd on node2 .. Finished with 0 errors.
As a result, the cluster software will not start on the failed server until manual intervention. Once the HANA database administrator has verified that the data replication is working properly, the block can be removed and the cluster can be resumed using the following commands (assuming node 1 is a failed node):
node1# sbd -d /dev/mapper/sbd message LOCAL clear node1# systemctl start pacemaker.service
Since we have a replication network between our nodes, it can be added as a redundant network in corosync. This seems like a good idea, but causes corosync to ignore primary network failures. After a series of tests, I dropped the idea of adding a redundant network to use the corosync service.
As mentioned earlier, SBD is not a real fencing device, it is more like a message box, and the fencing operation depends on the partner’s ability to read the message and follow the order. The real fencing device does not depend on the state of the node. A good fencing device in a POWER environment is the HMC. It is good to configure both HMCs, if possible.
There are two fencing agents available for the HMC: hmchttp, which works over HTTPS using a username and password, and ibmhmc, which works over SSH and can use a passwordless connection based on a key exchange. I chose the second method, which does not use passwords.
Find the public SSH key:
# cat .ssh/id_rsa.pub ssh-rsa Very..very..Long..String Cluster Internal
A cluster script replicates SSH keys between nodes and they are the same on both nodes.
Login into HMC via SSH using user hcsroot, then:
~> mkauthkeys -a "ssh-rsa Very..very..Long..String Cluster Internal"
Repeat the same with HMC 2
Important! You should approve SSH host fingerprint on both nodes for both HMC ! Othervice the fencing agent will fail on yes/no question.
node1# ssh -l hscroot <IP-ADDR-HMC1> node1# ssh -l hscroot <IP-ADDR-HMC2> node2# ssh -l hscroot <IP-ADDR-HMC1> node2# ssh -l hscroot <IP-ADDR-HMC2>
Please verify all four option that it is possible do passwordless SSH connection to both HMC.
Now it is the time to define fencing itself:
# crm configure primitive fence-hmc1 stonith:ibmhmc params ipaddr="xxx.xxx.xxx.xx1" # crm configure primitive fence-hmc2 stonith:ibmhmc params ipaddr="xxx.xxx.xxx.xx2"
Now we have three fencing devices. The SUSE cluster will use one of them and will not continue to other devices if the first action was successful. As we have already said, writing a message on an SBD device is almost always successful, although it may not really fence a node. We must make sure that at least one HMC fence is executed. This is achieved using the fencing_topology definition. The comma in the list of fencing devices acts as the AND operator, and the space as the OR operator.
# stonith_admin -l node2 stonith-sbd fence-hmc1 fence-hmc2 3 devices found # crm configure fencing_topology stonith-sbd,fence-hmc1 fence-hmc2
A two-node cluster is a special kind of cluster. Despite the correct configuration of the fence, it will never reach a quorum. Therefore, its default policy should be adapted to this fact.
# crm configure property no-quorum-policy=ignore
It's also time to turn on STONITH (or fencing):
# crm configure property stonith-enabled=true
Most of the following commands are executed by the administrator of the HANA instance. The HANA database is identified by the SID made up of three uppercase characters and a two-digit value called the instance number. Location and file names are often a combination of hostname, sid, and instance. There are some system variables matching to HANA values. For example you can found SID as $SAPSYSTEMNAME and instance number as $TINSTANCE. The SID admin username as <sid>adm (here the SID is used in lower case) created during installation. Become a <sid>adm :
node1# su - <sid>adm node1>
A HANA instance can be started automatically when the server boots. This is not suitable for cluster configuration, so you should disable this. A parameter named "Autostart" is apear in the instance profile and must be equal to zero. Use the handy alias "cdpro" to go directly to the profile location and check the Autostart parameter:
node1> cdpro node1> grep Auto <SID>_HDB<instance>_<hostname> Autostart = 0
The next step is to create a full backup. HANA 2 works in multitenant mode, and backups can be performed for any particular part of the instance. We need a backup for the entire instance, so the backup operator includes the FOR FULL SYSTEM option and the connection is made to the system database with help of the -d SYSTEMDB option.
node1> hdbsql -u SYSTEM -i <instance num> -d SYSTEMDB "BACKUP DATA FOR FULL SYSTEM USING FILE ('FIRSTFULL')"
The resulting files will be created in the $DIR_INSTANCE/backup/data directory. If you have free space elsewhere, indicate the full path in the previous command.
Replication must use the replication network, this can be determined by the system_replication_hostname_resolution parameter in the global.ini file. The file can be updated online using the HANA Studio tool or similar. I did not have such a tool available to me. Therefore, I shut down the database and made changes to the file directly, and then bring the database up. The file is located in the place for the custom configuration, you can go there with the handy alias "cdcoc":
node1> HDB stop node1> cdcoc node1> vi global.ini
Once we dealing with global.init file, there are another nice options to include, like traffic compression:
.. [system_replication] enable_log_compression = true enable_data_compression = true enable_log_retention = auto [system_replication_communication] listeninterface = .internal [system_replication_hostname_resolution] <Replication IP of node1 in form XXX.XXX.XXX.XXX> = node1 <Replication IP of node2 in form XXX.XXX.XXX.XXX> = node2
Save the file and start the databse:
node1> HDB start
NOTE: It is possible to update files online using ALTER SYSTEM hdbsql command and this is an example of such command:
node1> echo "ALTER SYSTEM ALTER CONFIGURATION ('global.ini', 'System') SET ('system_replication_communication', 'listeninterface') = '.internal' WITH RECONFIGURE" | hdbsql -u SYSTEM -i <instance num> -d SYSTEMDB
Enable replication using hdbnsutil command:
node1> hdbnsutil -sr_enable --name=PRODSITE node1> netstat -tlnp
Check output of last command to verify replication processes listening on replication IP.
Do almost same actions as at primary database:
node2# su - <sid>adm node2> cdpro node2> grep Auto <SID>_HDB<instance>_<hostname> Autostart = 0 node2> HDB stop node2> cdcoc node2> vi global.ini
Put exactly the same update into global.ini file.
You must copy the SSFS encryption keys for successful replication. Do this using root, as it is already configured for passwordless actions between cluster nodes:
node2# rsync -av node1:/usr/sap/<SID>/SYS/global/security/rsecssfs/ /usr/sap/<SID>/SYS/global/security/rsecssfs/
NOTE: When XSA in use, its own SSFS keys also should be copied over to node2:
node2# rsync -av node1:/usr/sap/<SID>/SYS/global/xsa/security/ssfs/ /usr/sap/<SID>/SYS/global/xsa/security/ssfs/ node2# su - <sid>adm node2> cdcoc node2> cat xscontroller.ini [communication] default_domain = <FQDN of primary cluster VIP> api_url = https://<FQDN of primary cluster VIP>:30030
Replicate the xscontroller.ini to node1 after cluster will sat.
Register secondary database and start it:
node2# su - <sid>adm node2> hdbnsutil -sr_register \ --remoteHost=node1 \ --remoteInstance=<instance num> \ --replicationMode=syncmem \ --operationMode=logreplay_readaccess \ --name=DRSITE node2> HDB start
SUSE provides two resource agents that help manage HANA in a high availability cluster. The first is SAPHanaTopology, which tracks and understands the current state of HANA database. The second is SAPHana, which actually handles database switching.
The agent can work with the hdbsql interface, what requires a lot of preparation of the database itself. It can also use the systemReplicationStatus.py script, which is located in /hana/shared/<SID>/HDB<instance>/exe/python_support. This option is preferred; the script is available starting from SPS9, if it is missing, then perhaps some part of the HANA software is not installed.
An easy way to configure is to use the wizard that comes with the HAWK. You can connect by browser to any node via https to port 7630. I prefer to use the mobaxterm feature to forward graphics through the SSH tunnel, so I just do:
# firefox https://localhost:7630
Log in with user hacluster - do you remember the password you set earlier? Then go to CONFIGURATION -> Wizards -> SAP -> SAP HANA SR Scale-Up Performance-Optimized. Fill the form: enter the HANA SID, instance number and virtual IP address that will follow the primary database. Click Verify, check the contents of the proposal and click Apply.
Since we are setting up replication with read-only request access, we need an additional IP address for this service.
# crm configure primitive rsc_ip_<SID>_RO IPaddr2 params ip=xxx.xxx.xxx.xxx cidr_netmask=24 # crm configure colocation col_saphana_ip_<SID>_RO 2000: rsc_ip_<SID>_RO:Started msl_SAPHana_<SID>_HDB<instance>:Slave
If you followed me with the procedure, then the Master database is on node1. Then start the cluster monitor on node2, which is the secondary HANA database (displayed as a slave in cluster state).
node2# crm_mon
Lets stop node1's network:
node1# ifdown eth0
On the monitor screen, you will see how soon node2 fence node1 and promote itself be primary.
If you remember that we did the SDB setup so as to avoid attaching the fenced node back to the cluster, so it will never appear online in the crm monitor without our intervention.
Log in to the fenced node (this is node1 in the context of my article) and make sure that the cluster services are not working:
node1# crm status ERROR: status: crm_mon (rc=102): Error: cluster is not available on this node
Good.
Let's register the HANA database for replication, start it and wait for full synchronization.
node1# su - <sid>adm node1> hdbnsutil -sr_register \ --remoteHost=node2 \ --remoteInstance=<instance num> \ --replicationMode=syncmem \ --operationMode=logreplay_readaccess \ --name=PRODSITE node1> HDB start
Check the replication status using HANA tools until the database is in synchronized state.
Once you ready to activate cluster back, remove SBD blocking and start cluster services:
node1# sbd -d /dev/mapper/sbd list 0 node1 reset node2 1 node2 clear node1# sbd -d /dev/mapper/sbd message LOCAL clear node1# sbd -d /dev/mapper/sbd list 0 node1 clear node1 1 node2 clear node1# systemctl start pacemaker.service
The cluster will start HANA resource agents that detect normal database behavior and will not try to stop or start HANA. The cluster will only start the missing secondary IP address and begin to monitor the status of the cluster.