Adding resiliency to your job scheduler can make a real difference in the overall reliability of your cluster. With shared memory systems, a single hardware failure can bring your entire system down causing a restart of all jobs. Single hardware failures in a cluster though will usually effect only a single job…. unless the failure occurs in the hardware running your scheduler! If you lose the job scheduling state, a complete restart of all jobs might also be necessary. Take a look at my suggestions for building resiliency into your job scheduler. Let me know what you think below!
Archives for posts with tag: Job scheduler