HA NFS server on SLE15

While setting up various clusters for SAP on SUSE, I met customers without a redundant NFS storage solution. S/4 HANA uses a small NFS volume in their HA solution. To resolve this case, I was asked to provide an HA solution for the NFS service.

I will be using SLE15 in this installation since it is already used for the rest of the SAP installation.

Preparing the nodes

Both nodes must be prepared in the same way. There are a number of important points that need to be configured and checked.

Name resolution

The HA solution cannot be implemented using DHCP, use fixed IP addresses. Names for nodes and VIPs must be DNS resolvable. In any case, add all used names and IP addresses to /etc/hosts on both nodes. This will reduce the dependency on third-party services and also reduce name resolution time.

Time settings (NTP)

Cluster services are closely related to time and timeouts. It is important to synchronize time with any available external time source.

The cluster software configuration only looks for working chronyd, so check the contents of the /etc/chrony.conf and /etc/chrony.d/*.conf files and enable the chronyd service.

# systemctl enable --now chronyd
# chronyc sources

Replicating SSH Host Keys

This may seem wrong, but it can be a life saver when connecting to a cluster VIP over SSH.

root@nfs1:~ # rsync -av /etc/ssh/ssh_host_*  root@nfs2:/etc/ssh/

Install base cluster

Since SLE is used here, you need to activate the HA module. It is already enabled if you use SLE for SAP.

# SUSEConnect -l | grep -B1 sle-ha/15.5/x86_64
            SUSE Linux Enterprise High Availability Extension 15 SP5 x86_64 (Activated)
            Deactivate with: suseconnect -d -p sle-ha/15.5/x86_64

Then install all the packages needed to run HA on both nodes:

# zypper in -y iputils
 ..
# zypper in -y -t pattern ha_sles
 ..

On one node, say node1, start the initial cluster:

root@nfs1:~ # crm cluster init -u -n NFS
INFO: Loading "default" profile from /etc/crm/profiles.yml
INFO: SSH key for root does not exist, hence generate it now
INFO: The user 'hacluster' will have the login shell configuration changed to /bin/bash
Continue (y/n)? y
INFO: SSH key for hacluster does not exist, hence generate it now
INFO: Configuring csync2
INFO: Starting csync2.socket service on nfs1
INFO: BEGIN csync2 checking files
INFO: END csync2 checking files
INFO: Configure Corosync (unicast):
  This will configure the cluster messaging layer.  You will need
  to specify a network address over which to communicate (default
  is eth0's network, but you can use the network address of any
  active interface).

Address for ring0 [192.168.120.11]<Enter>
Port for ring0 [5405]<Enter>
INFO: Configure SBD:
  If you have shared storage, for example a SAN or iSCSI target,
  you can use it avoid split-brain scenarios by configuring SBD.
  This requires a 1 MB partition, accessible to all nodes in the
  cluster.  The device path must be persistent and consistent
  across all nodes in the cluster, so /dev/disk/by-id/* devices
  are a good choice.  Note that all data on the partition you
  specify here will be destroyed.

Do you wish to use SBD (y/n)? n
WARNING: Not configuring SBD - STONITH will be disabled.
INFO: Hawk cluster interface is now running. To see cluster status, open:
INFO:   https://192.168.120.11:7630/
INFO: Log in with username 'hacluster', password 'linux'
WARNING: You should change the hacluster password to something more secure!
INFO: BEGIN Waiting for cluster
...........
INFO: END Waiting for cluster
INFO: Loading initial cluster configuration
INFO: Configure Administration IP Address:
  Optionally configure an administration virtual IP
  address. The purpose of this IP address is to
  provide a single IP that can be used to interact
  with the cluster, rather than using the IP address
  of any specific cluster node.

Do you wish to configure a virtual IP address (y/n)? n
INFO: Configure Qdevice/Qnetd:
  QDevice participates in quorum decisions. With the assistance of 
  a third-party arbitrator Qnetd, it provides votes so that a cluster 
  is able to sustain more node failures than standard quorum rules 
  allow. It is recommended for clusters with an even number of nodes 
  and highly recommended for 2 node clusters.

Do you want to configure QDevice (y/n)? n
INFO: Done (log saved to /var/log/crmsh/crmsh.log)
Explanation:

It's time to connect the second node to the cluster. On the second node, run the command:

root@nfs2:~ # crm cluster join
INFO: Join This Node to Cluster:
  You will be asked for the IP address of an existing node, from which
  configuration will be copied.  If you have not already configured
  passwordless ssh between nodes, you will be prompted for the root
  password of the existing node.

IP address or hostname of existing node (e.g.: 192.168.1.1) []192.168.120.11
INFO: SSH key for root does not exist, hence generate it now
INFO: The user 'hacluster' will have the login shell configuration changed to /bin/bash
Continue (y/n)? y
INFO: SSH key for hacluster does not exist, hence generate it now
INFO: Configuring csync2
INFO: Starting csync2.socket service
INFO: BEGIN csync2 syncing files in cluster
INFO: END csync2 syncing files in cluster
INFO: Merging known_hosts
INFO: BEGIN Probing for new partitions
INFO: END Probing for new partitions
Address for ring0 [192.168.120.12]<Enter>
INFO: Hawk cluster interface is now running. To see cluster status, open:
INFO:   https://192.168.120.12:7630/
INFO: Log in with username 'hacluster', password 'linux'
WARNING: You should change the hacluster password to something more secure!
INFO: BEGIN Waiting for cluster
..                                                                                                                                                                        INFO: END Waiting for cluster
INFO: Set property "priority" in rsc_defaults to 1
INFO: BEGIN Reloading cluster configuration
INFO: END Reloading cluster configuration
INFO: Done (log saved to /var/log/crmsh/crmsh.log)

The answers are obvious and the results look like this:

# crm status
Status of pacemakerd: 'Pacemaker is running' (last updated 2024-10-03 15:19:15 +03:00)
Cluster Summary:
  * Stack: corosync
  * Current DC: nfs1 (version 2.1.5+20221208.a3f44794f-150500.6.17.1-2.1.5+20221208.a3f44794f) - partition with quorum
  * Last updated: Thu Oct  3 15:19:15 2024
  * Last change:  Thu Oct  3 15:16:02 2024 by root via cibadmin on nfs2
  * 2 nodes configured
  * 0 resource instances configured

Node List:
  * Online: [ nfs1 nfs2 ]

Full List of Resources:
  * No resources

Configure fencing (STONITH)

Setting up fencing is beyond the scope of this article, as the LAB environment does not reflect the fencing used in production. You can refer to this example to set up the correct fencing.

Without fencing, a two-node cluster is non-functional and should not be used.

Configure cluster resources

We need to install the missing software on both nodes:

# zypper in -y nfs-kernel-server

Using DRBD Disk Instead of Shared Storage Disk

If you do not have a shared disk from storage, which is possible in the case of a stretched cluster, you can use a DRBD device instead. This article describes how to configure a DRBD resource. The other resources described here should be tight to the master state of DRBD resource.

Preparing LVM on a shared disk

We have a shared disk attached to both of our nodes. I will create an LVM VG and two LVs on it. The VG will be activated by the cluster in exclusive mode. So no VGs other than the "rootvg" should be activated during boot. This can be achieved by setting the volume_list parameter in the /etc/lvm/lvm.conf file. Set it as shown:

activation {
 ..
	volume_list = [ "rootvg" ]
 ..
}
where "rootvg" is a real name of your root VG.

After correcting the /etc/lvm/lvm.conf file, the initrd file should be recreated. This is because, despite the new system-wide configuration file, other VGs may be activated during the old initrd stage of boot.

# dracut --force

Apply these changes on both nodes.

Let's create the required LVM structure on the shared disk of the first node:

root@nfs1:~ # pvcreate /dev/sda
  Physical volume "/dev/sda" successfully created.
root@nfs1:~ # vgcreate nfsvg /dev/sda
  Volume group "nfsvg" successfully created
root@nfs1:~ # vgs
  VG     #PV #LV #SN Attr   VSize  VFree 
  nfsvg    1   0   0 wz--n- 10.00g 10.00g
  rootvg   1   5   0 wz--n- 60.00g 48.00g
root@nfs1:~ # lvcreate -L64m -n track nfsvg
  Volume nfsvg/track is not active locally (volume_list activation filter?).
  Aborting. Failed to wipe start of new LV.

As you can see, there is no access to "nfsvg" because of the filter we added. Let's accomplish our task using a temporary lvm.conf:

root@nfs1:~ # cp /etc/lvm/lvm.conf /tmp/
root@nfs1:~ # vi /tmp/lvm.conf  # <- Remove volume_list = [ "rootvg" ] line !!
root@nfs1:~ # export LVM_SYSTEM_DIR=/tmp
root@nfs1:~ # vgchange -ay nfsvg
  0 logical volume(s) in volume group "nfsvg" now active
root@nfs1:~ # lvcreate -L64m -n track nfsvg
  Logical volume "track" created.
root@nfs1:~ # lvcreate -L2G -n data nfsvg
  Logical volume "data" created.
root@nfs1:~ # mkfs.ext4 -j -m0 /dev/nfsvg/track
 ..
root@nfs1:~ # mkfs.xfs /dev/nfsvg/data
 ..
root@nfs1:~ # vgchange -an nfsvg
  0 logical volume(s) in volume group "nfsvg" now active
root@nfs1:~ # unset LVM_SYSTEM_DIR

Configuring resource

We will create resources by adding them to a text file. This file can be used as a backup of the configuration and can be managed by any version control system, such as GIT. Here is the content of the resources.txt file.

# Activate LVM VG in exclusive mode:
primitive p-vg-activate LVM \
        params volgrpname=nfsvg exclusive=true \
        meta target-role=Started \
        op start timeout=30s interval=0s \
        op stop timeout=30s interval=0s \
        op monitor timeout=30s interval=10s
# Mount /export
primitive p-fs-data Filesystem \
        params device="/dev/nfsvg/data" directory="/export" fstype=xfs \
        meta target-role=Started \
        op start timeout=60s interval=0s \
        op stop timeout=60s interval=0s \
        op monitor timeout=40s interval=20s
# Mount /var/lib/nfs/nfsdcltrack
primitive p-fs-track Filesystem \
        params device="/dev/nfsvg/track" directory="/var/lib/nfs/nfsdcltrack" fstype=ext4 \
        meta target-role=Started \
        op start timeout=60s interval=0s \
        op stop timeout=60s interval=0s \
        op monitor timeout=40s interval=20s
# Define VIP for NFS
primitive p-ip-nfsvip IPaddr2 \
        params ip=192.168.120.10 \
        meta target-role=Started \
        op start timeout=20s interval=0s \
        op stop timeout=20s interval=0s \
        op monitor timeout=20s interval=10s
# Starting NFS service
primitive p-nfsserver systemd:nfs-server op monitor interval="30s"
# Exporting NFS
primitive p-nfsexport exportfs \
        params clientspec="192.168.120.0/24" directory="/export" options="sec=sys,no_root_squash,rw" fsid=10 \
        meta target-role=Started \
        op start timeout=40s interval=0s \
        op stop timeout=120s interval=0s \
        op monitor timeout=20s interval=10s
# Group everything together, order is matter
group g-nfs p-vg-activate p-fs-data p-fs-track p-ip-nfsvip p-nfsserver p-nfsexport \
        meta target-role=Started

Once the file is ready, apply it to the configuration.

# crm configure load update resources.txt

Updated on Fri Oct 4 17:43:06 IDT 2024 More documentations here