Adding resiliency to your job scheduler can make a real difference in the overall reliability of your cluster.  With shared memory systems, a single hardware failure can bring your entire system down causing a restart of all jobs.  Single hardware failures in a cluster though will usually effect only a single job…. unless the failure occurs in the hardware running your scheduler!  If you lose the job scheduling state, a complete restart of all jobs might also be necessary.  Take a look at my suggestions for building resiliency into your job scheduler. Let me know what you think below!

Advertisements