Below is just a snapshot in time of the monitoring jumpscripts.
Also check the JumpScript page in the Grid Portal where you can filter on monitor.health to see a list of the JumpScripts actually available on your environment:
In order to check the actual system health go to the Status Overview page in the Grid Portal, also reachable by clicking the green, orange or red colored bullet in the top navigation bar:
For more information about the Status Overview page go to the dedicated section here.
Below you will see an overview of all JumpScripts, organized in the same way (sections) as on the Node Status page.
Depending on the type of node, following sections are available:
cpu_ctxpy_check.py checks the number of CPU context switches per second. If higher than expected an error condition is thrown
cpu_interrupts_check.py checks the number of interrupts per second. If higher than expected an error condition is thrown
cpu_mem_core_check.py checks memory and CPU usage/load. If average per hour is higher than expected an error condition is thrown
openfd_check.py checks the number of open file descriptors for each process
swap_used_check.py checks the amount of swap used by the system
threads_check.py checks the number of threads and throws an error if higher than expected
db_check.py checks status of MongoDB and InfluxDB databases on Master. If not running an error condition is thrown.
disk_orphan.py checks for orphan disks on volume driver nodes. Generates warning if orphan disks exist on the specified volumes. Is scheduled by disk_orphan_schedule.py, running on the master. Throws an error condition for each orphan disk found
vm_orphan.py checks if libvirt still has VMs that are not known by the system
disk_usage_check.py checks status of all physical disks and partitions on all nodes, reporting back the free disk space on mount points. Throws error condition for each disk that is almost (>90%) full
fan_check.py checks the fans of a node using IPMItool
networkbond_check.py monitors if a network bond (if there is one) has both (or more) interfaces properly active
psu_check.py checks the power redundancy of a node using IPMItool
raid_check.py checks whether all configured RAID devices are still healthy
networkperformance.py tests bandwidth between storage nodes, volume drivers and itself
ovs_healthcheck.py calls the standard Open vStorage health checks, see: https://github.com/openvstorage/openvstorage-health-check
ovsstatus.py checks every predefined period (default 60 seconds) if all OVS processes are still running
deployment_test.py tests every predefined period (default 30 minutes) whether test VM exists and if exists it tests write speed. Every 24hrs, test VM is recreated
publicipswatcher.py checks the status of the available public IPs
routeros_check.py checks the status of RouterOS. If RouterOS was shut down unexpectedly it will be restarted. (scheduled by routeros_check_schedule.py)
redis_usage_check.py checks Redis server status
nodestatus.py checks the status of each stack (CPU node)
temp_check.py checks the CPU + disk temperature of the system
workerstatus_check.py monitors the workers, checking if they report back on regular basis report to their agent for new tasks
transition_check.py recover OVC objects stuck in transition states. Applied to nodes, virtual machines, cloudspaces, accounts, disks. Logic of the healthcheck can be found here.