Warewulf Node Health Check (NHC)
Warewulf Node Health Check (NHC) is a periodic "node health check" script to be executed on each compute node to verify that the node is working properly. Nodes which are determined to be "unhealthy" can be marked as down or offline so as to
prevent jobs from being scheduled or run on them. This helps increase the reliability and throughput of a cluster by reducing preventable job failures due to misconfiguration, hardware failures, etc.
- Developed at network:cluster
- Sources inherited from project openSUSE:Factory
-
1
derived packages
- Download package
-
Checkout Package
osc -A https://api.opensuse.org checkout openSUSE:Leap:15.2:FactoryCandidates/warewulf-nhc && cd $_
- Create Badge
Refresh
Refresh
Source Files
Filename | Size | Changed |
---|---|---|
warewulf-nhc-1.4.3.tar.gz | 0000128253 125 KB | |
warewulf-nhc.changes | 0000002912 2.84 KB | |
warewulf-nhc.spec | 0000003101 3.03 KB |
Latest Revision
Ana Guerrero (anag+factory)
accepted
request 1127173
from
Christian Goll (mslacken)
(revision 2)
- updated to 1.4.3 with following new features: * toggle BASH tracing or NHC debugging via SIGUSR1/SIGUSR2, respectively * check_nvsmi_healthmon(): New check from CSC for GPU health monitoring via nvidia-smi * Provide added detail to tracing info (-x mode) * Based on feedback from Moe Jette of SchedMD, pull node job data directly from Slurm via squeue instead of the previous method that only worked for single-node jobs. * Support for recent additions to the Slurm node states (e.g., "planned") * Pathname expansion has been disabled on startup, and re-enabled only when being actively used, to avoid "unintended" expansions of wildcards at random points throughout the code. * Correct clobbering of BASH built-in variables and add tests to prevent future recurrence * Switch "system UID" boundary handling to a more accurate source of truth, and ensure that the code matches the math, naming, and intent. * Reorder resource manager detection to improve accurate detection, especially with respect to Slurm vs. PBS (all variants) - removed test-test_lbnl_file.nhc-Put-all-process-substitution.patch
Comments 0