The health of the system is monitored using a collection of JumpScripts, documented in Monitoring the System Health.
On the Status Overview page you get an immediate view on the health of the system.
You can access the Status Overview page in two ways:
By clicking the green/orange/red status dot in the top navigation bar:
or via the left navigation bar, under Grid Portal click Status Overview
Under the Process Status you get an overview of the health based on the last health check.
By clicking Run Health Check a new health check gets scheduled to start immediately.
Clicking any of the Details links brings you to the Node Status page, providing detailed health information for the selected node:
Clicking Run Health Check on Node will first ask you for your confirmation:
Once confirmed all health check jobs (JumpScripts) will start, which you can verify on the Jobs page:
On the Node Status page you can see more details by clicking the various section titles. You also have the option here to start the health check related to any of the items listed below each of the sections.
Depending on the type of node, following sections are available:
You can mute / un-mute specific health check See here how to configure it:
Depending on the node, you will see information about "orphan" disks or "orphan" virtual machines.
In case of the master node, it looks like this:
In case of a CPU node you will get an overview of all "orphan" virtual machines. This is about virtual machines that are marked as destroyed in the Grid and Cloud Broker Portal, while they still exist in reality on a physical node. This is obviously unwanted and as part of automatic health checks, "orphan" virtual machines will be removed.
In order to manually remove "orphan" virtual machines use the following commands at the command line of the physical machine where the "orphan" virtual machine exists:
vm="vm-8"disks="$(virsh dumpxml $vm | grep 'source file' | cut -d "'" -f 2)"virsh destroy $vm; virsh undefine $vmrm $disksrm -rf /mnt/vmstor/$vm