Building HA solution for Jenkins service

Preparing hosts

This example based on two VMs running on KVM host. A pacemaker/corosync cluster by CentOS 7.3 will manage our resources. DRBD will emulate shared disk. KVM fencing will resolve brain-split.

Following steps should be performed on both nodes:

Names

Decide on names, including cluster node names (here node01 and node02), cluster name (here ha) and VIP name (here hajnk).

Important NOTE: Names have to be DNS complient (for example underscores are unacceptable).

Security

We are disabling for now selinux and firewall. Probably, I'll investigate later how to turn them back, not sure.

# getenforce
Disabled

If you see other output, fix /etc/sysconfig/selinux

# cat /etc/sysconfig/selinux
SELINUX=disabled

Uninstall firewall:

# for rpm in iptables-services firewalld ; do rpm -e $rpm ; done

Network

Disable IPv6: Edit /etc/default/grub, fix GRUB_CMDLINE_LINUX line, replacing rngb quiet with net.ifnames=0 biosdevname=0 vga=0x314 ipv6.disable=1 rd.shell. Here is a result:

# grep GRUB_CMDLINE_LINUX /etc/default/grub
GRUB_CMDLINE_LINUX="rd.lvm.lv=rootvg/root rd.lvm.lv=rootvg/swap vga=0x314 net.ifnames=0 ipv6.disable=1 rd.shell"

After fixing /etc/default/grub, generate new /boot/grub2/grub.cfg using:

# grub2-mkconfig -o /boot/grub2/grub.cfg

Remove RPMs:

# rpm -e $(rpm -qa | grep NetworkManager) biosdevname

It is nice to reboot now, at this time. The network interfaces names becomes named back to ethX and no IPv6 addresses assigned.

Configure IP address in /etc/sysconfig/network-scripts/ifcfg-eth0 if not done already.

All used IPs and network names should be resolvable either by DNS or by /etc/hosts. Even if DNS works well, put everything in /etc/hosts, this will speed up cluster performance.

# echo -e "$(hostname -i)\t$(hostname -f) $(hostname -s)" # >> /etc/hosts

Do not forget to add info about Virtual IP (VIP). Put both short and FQDN formated names. Once /etc/hosts full of data, copy it between both nodes.

Time synchronization

Both nodes have to be synchronized by time. Using NTP preferable. Install RPMs:

# yum install -y net-tools ntp ntpdate sysstat eject wget rsync

Fix /etc/ntp.conf with correct NTP-SERVER-NAME-OR-IP (also remove IPv6 related line restrict ::1,otherwice service will fail to start) and start NTP synchronization.

# ntpdate -b NTP-SERVER-NAME-OR-IP
# hwclock -w
# systemctl enable ntpd.service
# systemctl start ntpd.service

After some time ( >5min ), check progress with ntpq -p command

SSH keys exchange

It is preferrable that both nodes will have same host's SSH keys, otherwise switching service between nodes can cause problems for SSH based services (scripts). It is already true for my installation, because both VMs cloned from same template. If you have independent images, please copy /etc/ssh/ssh_host* from one node to another and restart SSH service.

SSH passwordless access for root between nodes should be configured. Probably this is not a MUST requirement anymore, because most of cluster related interconnections done by corosync protocol with it's own keys and signatures. But this is very handfull.

Generate root SSH key on one node:

root@node01:~ # ssh-keygen -t rsa -b 2048 -C "root@ha"
..
root@node01:~ # cd .ssh
root@node01:~/.ssh # cp id_rsa.pub authorized_keys

Add host keys to known_hosts file:

root@node01:~/.ssh # ssh-keyscan $(hostname -s) | \
	sed -e 's/'$(hostname -s)'/'$(hostname -f),$(hostname -s)'/' | \
	awk '/'$(hostname -s)'/ {a=$1;gsub("01","02",a);print a","$0;}' > known_hosts

Copy credentials to other node:

root@node01:~/.ssh # cd
root@node01:~ # scp -pr .ssh node02:

Software installation

Install on both nodes:

# yum install -y pcs pacemaker fence-agents-all openssl-devel

Install jenkins.rpm on both nodes as it described at Jenkins Wiki Disable Jenkins from autostarting:

# chkconfig jenkins off
# chkconfig --del jenkins

Installing DRBD software involving compiling from sources. This can be done on any one node or even other CentOS same version server with exact same kernel version. The resulting RPMs could be copied to cluster nodes and installed.

Install development packages, required to compile software:

# mkdir -p /root/rpmbuild/{BUILD,BUILDROOT,RPMS,SOURCES,SPECS,SRPMS}
# yum -y install gcc make automake autoconf libxslt libxslt-devel flex rpm-build kernel-devel

Download and compile latest DRBD tar balls from Linbit:

# cd /tmp
/tmp # wget -q -O - http://oss.linbit.com/drbd/8.4/drbd-8.4.8-1.tar.gz | tar zxvfB -
/tmp # cd drbd-8.4.8-1
/tmp/drbd-8.4.8-1 # make km-rpm

Download latest DRBD utils tar balls from Linbit:

# cd /tmp
/tmp # wget -q -O - http://oss.linbit.com/drbd/drbd-utils-8.9.6.tar.gz | tar zxvfB -
/tmp # cd drbd-utils-8.9.6

Patch to drbd.spec.in should be applied, otherwice resulting RPMs will fail to install.

--- drbd.spec.in        2017-03-27 13:59:28.930525786 +0300
+++ drbd.spec.in        2017-03-27 13:59:16.973397350 +0300
@@ -31,6 +31,7 @@
 # conditionals may not contain "-" nor "_", hence "bashcompletion"
 %bcond_without bashcompletion
 %bcond_without sbinsymlinks
+%undefine with_sbinsymlinks
 # --with xen is ignored on any non-x86 architecture
 %bcond_without xen
 %bcond_without 83support

/tmp/drbd-utils-8.9.6 # ./configure --with-pacemaker
/tmp/drbd-utils-8.9.6 # make rpm

If everything was OK, the resulting RMPs are here:

# find /root/rpmbuild -type f -name "*.rpm"
/root/rpmbuild/RPMS/x86_64/drbd-km-3.10.0_514.6.1.el7.x86_64-8.4.8-1.x86_64.rpm
/root/rpmbuild/RPMS/x86_64/drbd-utils-8.9.6-1.el7.centos.x86_64.rpm
/root/rpmbuild/RPMS/x86_64/drbd-heartbeat-8.9.6-1.el7.centos.x86_64.rpm
/root/rpmbuild/RPMS/x86_64/drbd-udev-8.9.6-1.el7.centos.x86_64.rpm
/root/rpmbuild/RPMS/x86_64/drbd-xen-8.9.6-1.el7.centos.x86_64.rpm
/root/rpmbuild/RPMS/x86_64/drbd-debuginfo-8.9.6-1.el7.centos.x86_64.rpm
/root/rpmbuild/RPMS/x86_64/drbd-bash-completion-8.9.6-1.el7.centos.x86_64.rpm
/root/rpmbuild/RPMS/x86_64/drbd-8.9.6-1.el7.centos.x86_64.rpm
/root/rpmbuild/RPMS/x86_64/drbd-pacemaker-8.9.6-1.el7.centos.x86_64.rpm
/root/rpmbuild/RPMS/x86_64/drbd-km-debuginfo-8.4.8-1.x86_64.rpm

A kernel module compiled for current kernel version. Be aware to recompile DRBD module once updating kernel (or, in other words, forget about kernel updates).

Install resulting RPMs, copy to other node and install them too.

# rpm -ihv /root/rpmbuild/RPMS/x86_64/*rpm
# scp /root/rpmbuild/RPMS/x86_64/*rpm node02:/tmp
# ssh node02 rpm -ihv /tmp/\*rpm

Disable autostart service drbd on both nodes as it be managed by clusterware.

# systemctl disable drbd.service
# systemctl start drbd.service

Starting service without defined resources just load drbd kernel module. This is vital for initialization drdb device in next chapter.

Creating shared disk via DRBD

Disable LVM write cache (according to DRBD recommendations) on both nodes:

# sed -i -e 's/write_cache_state.*/write_cache_state = 0/' /etc/lvm/lvm.conf

Create LV (on both device) that will be used as DRBD back device

# lvcreate -n jenkins -L10g /dev/rootvg

Create /etc/drbd.d/jenkins.res resource file with similar content:

resource jenkins {
        meta-disk internal;
        device /dev/drbd0 ;
        disk /dev/rootvg/jenkins ;
        net { protocol C; after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect; }
        on node01 { address 192.168.122.21:7791 ; }
        on node02 { address 192.168.122.22:7791 ; }
}

Copy resource file to second node:

root@node01:~ # rsync -av /etc/drbd.d/jenkins.res root@node02:/etc/drbd.d/jenkins.res

Initialize (on both nodes) new resource:

# /sbin/drbdadm -- --force create-md jenkins
# /sbin/drbdadm up jenkins
# cat /proc/drbd 
version: 8.4.8-1 (api:1/proto:86-101)
GIT-hash: 22b4c802192646e433d3f7399d578ec7fecc6272 build by root@node01, 2017-01-20 01:23:12
 0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
    ns:0 nr:0 dw:0 dr:0 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:10485404

This status shows that both nodes resources are up (Connected), the data itself is not synchronized (Inconsistent on both nodes) and DRBD does not know select primary (both are Secondary). Let's help it to select primary. Because this disk is still empty, the initial full resync could be skipped:

root@node01:~ # /sbin/drbdadm -- --clear-bitmap new-current-uuid jenkins/0
root@node01:~ # /sbin/drbdadm primary --force jenkins
root@node01:~ # cat /proc/drbd 
version: 8.4.8-1 (api:1/proto:86-101)
GIT-hash: 22b4c802192646e433d3f7399d578ec7fecc6272 build by root@node01, 2017-01-20 01:23:12
 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
    ns:298436 nr:0 dw:298436 dr:1404 al:69 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

This is normal working state of DRBD. Both nodes are up (Connected), the data is UpToDate on both nodes and one of them Primary. The resource will be managed by cluster software, therefore we should make resource Secondary on both nodes for now:

root@node01:~ # while ! /sbin/drbdadm secondary jenkins ; do sleep 2 ; done
root@node01:~ # cat /proc/drbd 
version: 8.4.8-1 (api:1/proto:86-101)
GIT-hash: 22b4c802192646e433d3f7399d578ec7fecc6272 build by root@node01, 2017-01-20 01:23:12
 0: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r-----
    ns:298436 nr:0 dw:298436 dr:1404 al:69 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

Configuring cluster

Authorizing pcsd

Set complex password for user hacluster (on both nodes), make it non-expired. Start service (on both nodes):

# systemctl enable pcsd.service
# systemctl start pcsd.service

Authorise service (on one node):

root@node01:~ # pcs cluster auth node01 node02
Username: hacluster
Password: 
node02: Authorized
node01: Authorized

Authorizing pcsd (other way)

NOTE: this step is not needed if you sat password for hacluster user and authrized pcsd already as described above.

If you want to make cluster formation automatically via script or hate having unnecessary passwords in system, you can do pcsd authorization in alternative way.

Start and stop pcsd on one node, then inspect content of /var/lib/pcsd directory. There should be certificate files in it. Create pcs_users.conf file with similar content:

[
  {
    "creation_date": "Wed Feb 08 16:40:54 +0200 2017",
    "username": "hacluster",
    "token": "00000000-0000-0000-0000-000000000000"
  }
]

The token is hexadecimal string, probably can be any string, not checked. In this example every symbol replaced to zero to force you generate your own token instead of copy-paste it.

Then create file tokens with content:

{
  "format_version": 2,
  "data_version": 2,
  "tokens": {
    "node01": "00000000-0000-0000-0000-000000000000",
    "node02": "00000000-0000-0000-0000-000000000000"
  }
}

Replace zeroes with token you had generated in previous file. Use same token for both nodes. If your cluster will have more than two node, add all of them here.

Copy whole /var/lib/pcsd directory to other node, then enable and start pcsd on every node.

# systemctl enable pcsd.service
# systemctl start pcsd.service

Initial cluster configuration

root@node01:~ # pcs cluster setup --start --enable --name ha node01 node02 --transport udpu
Destroying cluster on nodes: node01, node02...
node01: Stopping Cluster (pacemaker)...
node02: Stopping Cluster (pacemaker)...
node02: Successfully destroyed cluster
node01: Successfully destroyed cluster

Sending cluster config files to the nodes...
node01: Succeeded
node02: Succeeded

Starting cluster on nodes: node01, node02...
node01: Starting Cluster...
node02: Starting Cluster...
node01: Cluster Enabled
node02: Cluster Enabled

Synchronizing pcsd certificates on nodes node01, node02...
node02: Success
node01: Success

Restarting pcsd on the nodes in order to reload the certificates...
node02: Success
node01: Success

Check resulting status :

root@node01:~ # pcs cluster status
Cluster Status:
 Stack: corosync
 Current DC: node02 (version 1.1.15-11.el7_3.2-e174ec8) - partition with quorum
 Last updated: Sat Jan 21 12:14:41 2017         Last change: Sat Jan 21 12:14:28 2017 by hacluster via crmd on node02
 2 nodes and 0 resources configured

PCSD Status:
  node01: Online
  node02: Online

Ensure fencing/STONITH disabled for configuration time:

root@node01:~ # pcs property set stonith-enabled=false
root@node01:~ # pcs property list
Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: ha
 dc-version: 1.1.15-11.el7_3.2-e174ec8
 have-watchdog: false
 stonith-enabled: false

Adding and configuring resources

Cluster configuration done from one of nodes using pcs command. The changes propagated via corosync and pcsd services to other nodes. All nodes chould be avaliable to changes be successed.

A shared disk (DRBD) will be initialized as LVM device, three Logical Volumes (LV) will be created and mounted where jenkins want them. A Virtual IP (VIP) and jenkins service will finish set of resources.

Adding DRBD resource

First we add DRBD resource becasue the rest depends on it:

root@node01:~ # pcs resource list drbd
ocf:linbit:drbd - Manages a DRBD device as a Master/Slave resource
systemd:drbd
root@node01:~ # pcs resource describe ocf:linbit:drbd

No much parameters exist. A master/slave set should be defined too:

root@node01:~ # pcs resource create jenkins-drbd ocf:linbit:drbd drbd_resource=jenkins
root@node01:~ # pcs resource master master-jenkins-drbd jenkins-drbd master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true
root@node01:~ # pcs status
Cluster name: ha
Stack: corosync
Current DC: node01 (version 1.1.15-11.el7_3.2-e174ec8) - partition with quorum
Last updated: Sat Jan 21 18:54:30 2017          Last change: Sat Jan 21 18:54:25 2017 by root via cibadmin on node01

2 nodes and 2 resources configured

Online: [ node01 node02 ]

Full list of resources:

 Master/Slave Set: master-jenkins-drbd [jenkins-drbd]
     jenkins-drbd       (ocf::linbit:drbd):     FAILED node02 (blocked)
     jenkins-drbd       (ocf::linbit:drbd):     FAILED node01 (blocked)

Failed Actions:
* jenkins-drbd_stop_0 on node02 'not configured' (6): call=6, status=complete, exitreason='none',
    last-rc-change='Sat Jan 21 18:52:11 2017', queued=0ms, exec=45ms
* jenkins-drbd_stop_0 on node01 'not configured' (6): call=6, status=complete, exitreason='none',
    last-rc-change='Sat Jan 21 18:52:12 2017', queued=0ms, exec=43ms


Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

It is recommended to define both base resource and it's master/slave definitions simultaniously. It is done by putting all definitions in one file then apply it. I'd configured on live system, therefore resource marked "FAILED". No problem with it:

root@node01:~ # pcs resource cleanup jenkins-drbd
Cleaning up jenkins-drbd:0 on node01, removing fail-count-jenkins-drbd
Cleaning up jenkins-drbd:0 on node02, removing fail-count-jenkins-drbd
Waiting for 2 replies from the CRMd.. OK
root@node01:~ # pcs status                       
Cluster name: ha
Stack: corosync
Current DC: node01 (version 1.1.15-11.el7_3.2-e174ec8) - partition with quorum
Last updated: Sat Jan 21 18:55:20 2017          Last change: Sat Jan 21 18:55:13 2017 by hacluster via crmd on node01

2 nodes and 2 resources configured

Online: [ node01 node02 ]

Full list of resources:

 Master/Slave Set: master-jenkins-drbd [jenkins-drbd]
     Masters: [ node02 ]
     Slaves: [ node01 ]

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

According to status, master (primary) DRBD run on node02. You can verify this by cat /proc/drbd

Adding Volume Group (VG) resource

A shared disk here is /dev/drbd/by-res/jenkins/0 let's initialize it (do this on node where primary DRBD run, node02 as following from previous output):

root@node02:~ # pvcreate --dataalignment=4k /dev/drbd/by-res/jenkins/0
root@node02:~ # vgcreate jenkinsvg /dev/drbd/by-res/jenkins/0
root@node02:~ # for lv in cache log lib ; do 
	lvcreate -L2g -n $lv jenkinsvg
	mkfs.ext3  -j -m0 /dev/jenkinsvg/$lv
done
root@node02:~ # vgchange -an jenkinsvg
root@node02:~ # pvs
  PV         VG        Fmt  Attr PSize  PFree
  /dev/drbd0 jenkinsvg lvm2 a--  10.00g 4.00g
  /dev/vda2  rootvg    lvm2 a--  19.75g 3.75g

Thus, disk was initialized by LVM and contains Volume Group (VG) named "jenkinsvg". Lets create cluster resource for it.

root@node02:~ # pcs resource describe ocf:heartbeat:LVM

According to description, /etc/lvm/lvm.conf should be fixed, again. The following lines should be changed:

use_lvmetad = 0
volume_list = [ "rootvg" ]

A "rootvg" is name of my current rootvg. It should be always available. Copy lvm.conf to another node and repeat the following commands on both nodes. A new initrd should be produced to reflect these changes:

# systemctl disable lvm2-lvmetad.service
# mkinitrd -f /boot/initramfs-$(uname -r).img $(uname -r) && reboot

Once cluster nodes comes back, add resource and check /var/log/messages for successful resource activation.

# pcs resource create jenkins-vg ocf:heartbeat:LVM volgrpname=jenkinsvg exclusive=true

You will see new resource failed on other node (where DRBD slave), but successed on master node.

Adding mounted FS resource

# pcs resource describe ocf:heartbeat:Filesystem
# pcs resource create jenkins-fs-cache ocf:heartbeat:Filesystem device=/dev/jenkinsvg/cache directory=/var/cache/jenkins fstype=ext3
# pcs resource create jenkins-fs-lib ocf:heartbeat:Filesystem device=/dev/jenkinsvg/lib directory=/var/lib/jenkins fstype=ext3
# pcs resource create jenkins-fs-log ocf:heartbeat:Filesystem device=/dev/jenkinsvg/log directory=/var/log/jenkins fstype=ext3

Observe all file systems mounted on one node. Fix permissions on this node for future use:

# chown jenkins:jenkins /var/cache/jenkins /var/lib/jenkins /var/log/jenkins
# chmod o-rx /var/cache/jenkins /var/log/jenkins

Adding VIP address

# pcs resource list ipaddr
# pcs resource describe ocf:heartbeat:IPaddr
# pcs resource create jenkins-ip ocf:heartbeat:IPaddr ip=192.168.122.20

Adding jenkins service

# pcs resource list lsb
# pcs resource create jenkins-service lsb:jenkins

Grouping everything together

Groupping resources add constraints to run this group of resources on same node and also define the sequence of start and stop procedure. Therefore, the order of resources is important here. First we activate VG, then mount FS, then start service. A VIP have to be before service.

# pcs resource group add jenkins jenkins-vg jenkins-fs-cache jenkins-fs-lib jenkins-fs-log jenkins-ip jenkins-service

All these should happen only where master DRBD copy run, then add the following constraint:

# pcs constraint colocation add jenkins master-jenkins-drbd INFINITY with-rsc-role=Master
# pcs constraint order promote master-jenkins-drbd then start jenkins

Clean error status from previous run and see the result:

# pcs resource cleanup --node node01
# pcs resource cleanup --node node02
# pcs status 
Cluster name: ha
Stack: corosync
Current DC: node02 (version 1.1.15-11.el7_3.2-e174ec8) - partition with quorum
Last updated: Sat Jan 21 20:15:19 2017          Last change: Sat Jan 21 20:13:39 2017 by root via cibadmin on node02

2 nodes and 8 resources configured

Online: [ node01 node02 ]

Full list of resources:

 Master/Slave Set: master-jenkins-drbd [jenkins-drbd]
     Masters: [ node02 ]
     Slaves: [ node01 ]
 Resource Group: jenkins
     jenkins-vg (ocf::heartbeat:LVM):   Started node02
     jenkins-fs-cache   (ocf::heartbeat:Filesystem):    Started node02
     jenkins-fs-lib     (ocf::heartbeat:Filesystem):    Started node02
     jenkins-fs-log     (ocf::heartbeat:Filesystem):    Started node02
     jenkins-ip (ocf::heartbeat:IPaddr):        Started node02
     jenkins-service    (lsb:jenkins):  Started node02

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Adding fencing

There is no reason to test cluster, espetially with DRBD without fencing. These two nodes run on KVM host, therefore fence-virt will be used. Here is a very good guide to make fencing for our POC environment.

Troubleshooting

Updated on Sun Jan 22 20:34:59 IST 2017 More documentations here