Warewulf Node Health Check (NHC)

Edit Package warewulf-nhc

Warewulf Node Health Check (NHC) is a periodic "node health check" script to be executed on each compute node to verify that the node is working properly. Nodes which are determined to be "unhealthy" can be marked as down or offline so as to
prevent jobs from being scheduled or run on them. This helps increase the reliability and throughput of a cluster by reducing preventable job failures due to misconfiguration, hardware failures, etc.

Refresh
Refresh
Source Files
Filename Size Changed
warewulf-nhc-1.4.3.tar.gz 0000128253 125 KB
warewulf-nhc.changes 0000002912 2.84 KB
warewulf-nhc.spec 0000003101 3.03 KB
Latest Revision
Ana Guerrero's avatar Ana Guerrero (anag+factory) accepted request 1127173 from Christian Goll's avatar Christian Goll (mslacken) (revision 2)
- updated to 1.4.3 with following new features:
  * toggle BASH tracing or NHC debugging via SIGUSR1/SIGUSR2, respectively
  * check_nvsmi_healthmon(): New check from CSC for GPU health monitoring via
    nvidia-smi
  * Provide added detail to tracing info (-x mode)
  * Based on feedback from Moe Jette of SchedMD, pull node job data directly
    from Slurm via squeue instead of the previous method that only worked for
    single-node jobs.
  * Support for recent additions to the Slurm node states (e.g., "planned")
  * Pathname expansion has been disabled on startup, and re-enabled only when
    being actively used, to avoid "unintended" expansions of wildcards at
    random points throughout the code.
  * Correct clobbering of BASH built-in variables and add tests to prevent future recurrence
  * Switch "system UID" boundary handling to a more accurate source of truth,
    and ensure that the code matches the math, naming, and intent.
  * Reorder resource manager detection to improve accurate detection,
    especially with respect to Slurm vs. PBS (all variants)
- removed test-test_lbnl_file.nhc-Put-all-process-substitution.patch
Comments 0
openSUSE Build Service is sponsored by