There are several installation options for OpenShift for various cloud providers, and another installation option is on bare metal hardware. This option looks like the basic one for any other installation, and as soon as you familiarize yourself with it, you will manage with any other installation.
The current version of OpenShift only allows online installation. The offline version is promised to be later.
The bare metal installation requires pre-installing an infrastructure server that will perform DNS and load balancing tasks. The installation reference gives hints about this infra server configuration, but without much details. Thinking thoroughly, it would be nice to make this server a default gateway for the internal network of OpenShift servers. The following configuration plan is the result of my thoughts:
Summarizing all these requirements, I developed several Ansible scripts that form the necessary infrastructure server. They should work for any target of the RedHat family, but they are tested only on CentOS7 and come without any support or guarantee. Scripts do supposed that firewall and selinux are disabled.
Edit roles/openshift.infra-server/defaults/main.yaml values to fit your needs, then run:
$ ansible-playbook -i "192.168.0.222," role-openshift.infra-server.ymlwhere mentioned IP address is for your future infra server. Of course, your server should met usual Ansible prerequisites (python, SSH, user permission).
Alternatively, you can use the hints from the installation guide and complete the full infra server setup manually.
The server is almost ready, we will need the RHCOS installation images located where the PXE configuration expects them:
# mkdir /var/www/html/RHCOS # wget -O /var/www/html/RHCOS/kernel \ https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/latest/latest/rhcos-4.3.8-x86_64-installer-kernel-x86_64 # wget -O /var/www/html/RHCOS/initramfs.img \ https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/latest/latest/rhcos-4.3.8-x86_64-installer-initramfs.x86_64.img # wget -O /var/www/html/RHCOS/metal.raw.gz \ https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/latest/latest/rhcos-4.3.8-x86_64-metal.x86_64.raw.gz # ls -la /var/www/html/RHCOS/ total 862088 drwxr-xr-x 2 root root 61 Apr 5 04:10 . drwxr-xr-x 3 root root 19 Apr 5 04:05 .. -rw-r--r-- 1 root root 71105367 Mar 31 13:37 initramfs.img -rw-r--r-- 1 root root 8106848 Mar 31 13:37 kernel -rw-r--r-- 1 root root 803561085 Mar 31 13:37 metal.raw.gz
Any Linux workstation is suitable for generating configuration files and use the OpenShift CLI. Just because we already have such a server (infra-server), we will use it as an installation server. Create any less privileged user and log in with the infra server.
Use this link to create "pull secret" and save it somewhere. Download "latest" installer, unpack it and move to your PATH directory:
$ mkdir ~/bin $ ( cd ~/bin && \ curl -s https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable/openshift-install-linux.tar.gz | \ tar zxvf - )
Do the same for the OpenShift CLI. We will need this much later:
$ ( cd ~/bin && \ curl -s https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable/openshift-client-linux.tar.gz | \ tar zxvf - ) $ openshift-install version openshift-install 4.3.8 built from commit f7a2f7cf9ec3201bb8c9ebb677c05d21c72e3cc5 release image quay.io/openshift-release-dev/ocp-release@sha256:a414f6308db72f88e9d2e95018f0cc4db71c6b12b2ec0f44587488f0a16efc42
Create an install-config.yaml file with sample content taken from the installation guide and modify it as you like. This typically includes adding a proxy definition and fixing the cluster name and base domain name. Please note that this information must match the DNS settings on the infra server.
Do not forget to update the "pull secret" data and your public SSH key. It is advisable to create a separate key for managing OpehShift, because most likely you will have to share the private key with other administrators. Keep a copy of the file as the original file will be deleted during processing.
"Pull secret" is what allows you to retrieve installation and update the data from RedHat. You should care that your subscription will be at least at trial mode.
$ rm -rf ~/work && mkdir ~/work $ cp ~/install-config.yaml ~/work/ $ openshift-install create manifests --dir ~/work INFO Consuming Install Config from target directory WARNING Making control-plane schedulable by setting MastersSchedulable to true for Scheduler cluster settings
Several kubernetes manifests will appear in the working directory. According to the installation guide, mastersSchedulable must be set to False at work/manifest/cluster-scheduler-02-config.yml file for bare metal installation. Correct this and then convert the manifests to ignition files:
$ vi ~/work/manifests/cluster-scheduler-02-config.yml $ openshift-install create ignition-configs --dir ~/work INFO Consuming Master Machines from target directory INFO Consuming Worker Machines from target directory INFO Consuming Openshift Manifests from target directory INFO Consuming Common Manifests from target directory
The previous contents of the working directory will disappear, and several * .ign files will appear instead. The ignition file for RH CoreOS plays the same role as the kickstart for RHEL. Copy them to your PXE server to serve them over HTTP. Make them readable by apache service as they are too protected by default:
$ chmod 644 ~/work/*ign $ sudo cp -av ~/work/*ign /var/www/html/RHCOS/
NOTE: The generated ignition files include SSL certificates for authenticating each other in the being created cluster. These certificates will expire in the next 24 hours for security reasons. You must complete the installation within this period, otherwise you should repeat the entire procedure, starting with deleting the working directory and creating new ignition files.
Although the installation guide recommends that you initally deploy the bootstrap server and then everyone else, my experience has shown that the order does not matter. It turns out even better when the master and worker servers are already deployed and waiting for information from bootstrap.
Then update the /etc/dhcpd.conf.static file on the infra-server with the host name and MAC address of all installed servers, for example:
.. host bootstrap { hardware ehternet <MAC-address-of-your-bootstrap-server-in-lower-case>; fixed-address XXX.XXX.XXX.XXX ; host-name "bootstrap.<cluster-name>.<base.domain>"; ddns-hostname "bootstrap"; } host etcd-0 { hardware ..
It is important to use fixed IP addresses for all components in the cluster, as some certificates will be signed for that IP address and may stop working after a renewal.
Once updated, restart the DHCP server to accept the changes:
# service dhcpd restart Redirecting to /bin/systemctl restart dhcpd.service
Boot all designated workers servers using the network boot method, select "Worker" from PXE menu and start their installation.
Boot all designated master servers using the network boot method, select "Master" from PXE menu and start their installation.
Boot the bootstrap server using the network boot method, select "Bootstrap" from PXE menu and install it. This server will be released at the end of the installation, so it can be a temporary virtual server or a server designated to be a worker later.
Follow over the TFTP and apache logs on our infra-server to make sure that PXE installation was successful:
# tail -f /var/log/messages /var/log/httpd/access_log
Our haproxy service depends on DNS information about existing servers. Our DNS is dynamically updated by DHCP, which means that until the "etcd-X" servers are deployed, the information about them is missing. The haproxy service will refuse to start without this. Wait until the servers, being installed, reboots a couple of times until the correct host name appears on their console. This indicates that the dynamic DNS should already be updated if it is configured correctly. Check the entries in the /etc/haproxy/haproxy.cfg file for all the servers in the game and restart the haproxy service.
# vi /etc/haproxy/haproxy.cfg # service haproxy restart Redirecting to /bin/systemctl restart haproxy.service
Now you can log in to the bootstrap server and monitor it's actions. The server is accessible only from our infra-server, using a user "core" and a pre-configured SSH key. This user has sudo privileges using NOPASSWD option.
# ssh core@bootstrap .. [core@bootstrap ~]$ journalctl -b -f -u bootkube.service .. etcdctl failed. Retrying in 5 seconds...
The last message means that the bootstrap has finished to start services and is waiting for the master servers to continue. To make sure that the system is ready to deploy the main servers, you can connect to port 6443 of the infra-server and observe an openshift certificate.
# echo -n | openssl s_client -connect infra-server:6443 2>/dev/null | openssl x509 -noout -text
Another good debugging resource is our haproxy statistics page. Point your browser to http://infra-server:8404/stats to see haproxy works. The masters-config backend should be UP, to make workers and masters able download their configuration and continue with deploy. The masters-api backend should be UP, to form future etcd cluster. You can see servers goes up and down, attached to backends and removed, do not worry, just wait.
The rest of the process is fully automated and not documented at all. This step takes at least 20 minutes, servers will update CoreOS and deploy required software. If you are still watching the process on the boot server, wait until the following appears:
[core@bootstrap ~]$ journalctl -b -f -u bootkube.service .. bootkube.service complete
Another option is to verify this from the installation server:
$ openshift-install --dir work wait-for bootstrap-complete --log-level=debug DEBUG OpenShift Installer v4.3.0 DEBUG Built from commit 2055609f95b19322ee6cfdd0bea73399297c4a3e INFO Waiting up to 30m0s for the Kubernetes API at https://api.osh.example.com:6443... INFO API v1.16.2 up INFO Waiting up to 30m0s for bootstrapping to complete... DEBUG Bootstrap status: complete INFO It is now safe to remove the bootstrap resources
Both methods reports that the bootstrap process had completed and the bootstrap server can be removed. Shut down the bootstrap server.
There are several basic cluster operators that run on worker servers, so you need to deploy at least one worker server from the beginning.
If your bootstrap server was designated be a worker server later, update /etc/dhcpd.conf.static with this fact.
.. host c2 { hardware ethernet f2:c6:d1:c4:65:99; fixed-address XXX.XXX.XXX.XXX ; option host-name "c2.<cluster-name>.<base.domain>"; ddns-hostname "c2"; ..and restart DHCP service:
# service dhcpd restart Redirecting to /bin/systemctl restart dhcpd.service
Boot the worker server from the network, select the "Worker" in the PXE menu. Wait until they reboot a couple of times. Again, update HAPROXY configuration about ingress HTTP and HTTPS load balancer, adding new workers servers. Restart it and check it running.
Next step, according to the installation guide, is about connecting to the cluster. Export the kubeadm credentials:
$ export KUBECONFIG=~/work/auth/kubeconfig $ oc whoam system:admin
The last message confirms that you can connect to the cluster with administrator privileges.
List an existing nodes:
$ oc get nodes NAME STATUS ROLES AGE VERSION c0.osh.example.com Ready worker 28m v1.16.2 c1.osh.example.com Ready worker 28m v1.16.2 etcd-0.osh.example.com Ready master 28m v1.16.2 etcd-1.osh.example.com Ready master 28m v1.16.2 etcd-2.osh.example.com Ready master 28m v1.16.2
There is no new c2 worker server added. The next chapter of the installation guide describes this: recently added servers are awaiting administrator approval. Well, some security should be provided. List pending certificate approval:
$ oc get csr NAME AGE REQUESTOR CONDITION .. csr-pf8bb 6m12s system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
If you do not see pending requests, the installation of a new production server is most likely still ongoing.
Then approve it:
$ oc adm certificate approve csr-pf8bb certificatesigningrequest.certificates.k8s.io/csr-pf8bb approved
The result is the new query, regarding server itself:
$ oc get csr NAME AGE REQUESTOR CONDITION .. csr-kcnzl 43s system:node:c2.osh.example.com Pending
Approve it too. The new worker server added:
$ oc get nodes NAME STATUS ROLES AGE VERSION c0.osh.example.com Ready worker 53m v1.16.2 c1.osh.example.com Ready worker 53m v1.16.2 c2.osh.example.com Ready worker 2m11s v1.16.2 etcd-0.osh.example.com Ready master 53m v1.16.2 etcd-1.osh.example.com Ready master 53m v1.16.2 etcd-2.osh.example.com Ready master 53m v1.16.2
Good, what next? Check that all cluster operators start:
$ oc get clusteroperators NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication Unknown Unknown True 23h cloud-credential 4.3.8 True False False 24h cluster-autoscaler 4.3.8 True False False 23h console 4.3.8 False True False 23h dns 4.3.8 True False False 23h image-registry 4.3.8 True False False 23h ingress 4.3.8 True False False 20h insights 4.3.8 True False False 24h kube-apiserver 4.3.8 True False False 23h kube-controller-manager 4.3.8 True False False 23h kube-scheduler 4.3.8 True False False 23h machine-api 4.3.8 True False False 23h machine-config 4.3.8 True False False 113m marketplace 4.3.8 True False False 20h monitoring 4.3.8 True False False 112m network 4.3.8 True False False 24h node-tuning 4.3.8 True False False 113m openshift-apiserver 4.3.8 True False False 6h59m openshift-controller-manager 4.3.8 True False False 20h openshift-samples 4.3.8 True False False 20h operator-lifecycle-manager 4.3.8 True False False 23h operator-lifecycle-manager-catalog 4.3.8 True False False 23h operator-lifecycle-manager-packageserver 4.3.8 True False False 7h7m service-ca 4.3.8 True False False 24h service-catalog-apiserver 4.3.8 True False False 23h service-catalog-controller-manager 4.3.8 True False False 23h storage 4.3.8 True False False 23hand wait for all operators becomes AVAILABLE. If this does not happen for a while, check the case especially:
$ oc get pods --all-namespaces | grep console openshift-console-operator console-operator-644498f9db-hb99k 1/1 Running 5 27h openshift-console console-669ffdcc9f-5fbzj 0/1 CrashLoopBackOff 297 27h openshift-console console-7cddd989d8-rjhwq 0/1 Running 301 27h openshift-console console-7cddd989d8-sp2bp 0/1 Running 300 27h openshift-console downloads-6f4898c5c9-9scbg 1/1 Running 1 27h openshift-console downloads-6f4898c5c9-g2q7l 1/1 Running 3 27hThen check the logs of this failing operator:
$ oc logs console-7cddd989d8-rjhwq -n openshift-console .. 2020/04/6 14:43:24 auth: error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.osh.example.com/oauth/token failed: Head https://oauth-openshift.apps.osh.example.com: Service Unavailable
Seams the console operator depends on authentication operator, that also does not avaliable. Lets check it:
$ oc get pods --all-namespaces | grep auth openshift-authentication-operator authentication-operator-5954c6c9d-cq72z 1/1 Running 6 27h openshift-authentication oauth-openshift-569dfc5dd7-5n9pj 1/1 Running 0 11h openshift-authentication oauth-openshift-569dfc5dd7-hsxx8 1/1 Running 1 11h $ oc logs oauth-openshift-569dfc5dd7-5n9pj -n openshift-authentication Copying system trust bundle I0406 03:35:21.357327 1 secure_serving.go:64] Forcing use of http/1.1 only I0406 03:35:21.357443 1 secure_serving.go:123] Serving securely on [::]:6443
Service looks up, listening on port 6443. Continue digging:
$ oc logs authentication-operator-5954c6c9d-cq72z -n openshift-authentication-operator | grep "^E" | tail -1 E0406 14:46:39.344974 1 controller.go:129] {AuthenticationOperator2 AuthenticationOperator2} failed with: error checking current version: unable to check route health: failed to GET route: Service Unavailable
Something wrong with routes:
$ oc get route --all-namespaces | egrep "NAME|auth" NAMESPACE NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD openshift-authentication oauth-openshift oauth-openshift.apps.osh.example.com oauth-openshift 6443 passthrough/Redirect None $ oc get endpoints -n openshift-authentication NAME ENDPOINTS AGE oauth-openshift 10.130.0.33:6443,10.131.0.22:6443 27h $ oc get pods -n openshift-authentication -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES oauth-openshift-569dfc5dd7-5n9pj 1/1 Running 0 11h 10.131.0.22 etcd-2.osh.example.com <none> <none> oauth-openshift-569dfc5dd7-hsxx8 1/1 Running 1 11h 10.130.0.33 etcd-0.osh.example.com <none> <none>
Very strange status. Authentication pods runs on master nodes while registered as *.apps.cluster.basename, that points on load balancer of workers according to installation guide. That load balancer has no definitions for port 6443, only 80 and 443 (according to the same paper).
In mine configuration, both apps and api IPs are the same, that cause port 6443 serverd by API load balancer and requests were forwarded to API service, and not to apps router.
In addition, the console operator was looking for authentication service on plain HTTPS port of oauth-openshift.apps.osh.example.com. Looks like oauth-openshift route should be fixed and registered on 443 port.
$ oc describe route -n openshift-authentication oauth-openshift Name: oauth-openshift Namespace: openshift-authentication Created: 20 hours ago Labels: app=oauth-openshift Annotations: <none> Requested Host: oauth-openshift.apps.osh.example.com exposed on router default (host apps.osh.example.com) 20 hours ago Path: <none> TLS Termination: passthrough Insecure Policy: Redirect Endpoint Port: 6443 Service: oauth-openshift Weight: 100 (100%) Endpoints: 10.130.0.33:6443, 10.131.0.22:6443
May be I should change the "Endpoint Port" be 443 ?
$ oc edit route -n openshift-authentication oauth-openshift $ oc patch route -n openshift-authentication oauth-openshift -p '{"spec":{"port":{"targetPort":"433"}}}'
Nothing help. Although both commands executed without errors, the resulting route remains the same. I am stuck here for today. To be continued...
My setup uses an external proxy to access the Internet. To do this, proxy definitions were added to the install-config.yaml file before generating manifests and ignition files.
The DNS server for "clustername.base.domain", running on the infra server, was not known to the external proxy server, so the requests to "oauth-openshift.apps.clustername.base.domain" were not successful.
There are two possible solutions to this problem.
I used the second method, because this is not a production installation, but only a sandbox. The resulting proxy definitions should be similar to:
.. baseDomain: example.com proxy: httpProxy: http://192.168.0.254:3128 httpsProxy: http://192.168.0.254:3128 noProxy: 192.168.0.0/24,192.168.1.0/24,.osh.example.com metadata: name: osh ..
When doing "oc login" you can get the following error:
$ oc login -u developer error: x509: certificate signed by unknown authority
This is because the authentication service is signed with an ingress certificate that is unknown to the workstation running oc. The easiest way is to download via browser the "ingress-operator@XXXXX" certificate in PEM format. Open in browser OC console and inspect certificate, find ingress certificate and download it. Put the resulting PEM file in /etc/pki/ca-trust/source/anchors/ and run update-ca-trust.
The other way to get the certificate is to extract it from the secret:
$ oc describe secret router-ca -n openshift-ingress-operator Name: router-ca Namespace: openshift-ingress-operator Labels: <none> Annotations: <none> Type: kubernetes.io/tls Data ==== tls.crt: 1074 bytes tls.key: 1675 bytes $ oc extract secrets/router-ca --keys tls.crt -n openshift-ingress-operator tls.crt
The resulting tls.crt is PEM formatted CA ingress certificate.