KIT – ITI Algorithm Engineering – Group Members - Lightweight Checkpointing and Recovery for Iterative Converging Algorithms

Links:PDF
Fault tolerance is important for High Performance Computing, since the mean time between failures (hardware failures) decreases as the number of compute nodes increases. A system with 100.000 nodes each with a mean time between failures of 10 years, would have an expected failure rate of once every 50 minutes. We test various different methods to increase the fault tolerance of PageRank. We show that checkpointless methods can handle failures in the first 40% and finish in the same amount of iteration as a failure free PageRank execution. By estimating the progress of PageRank based on the convergence we manage to write checkpoints after 40% completion, relying on checkpointless methods beforehand. This allows us to reduce the checkpoint overhead by 40%. We also use different compression methods to compress the checkpoints. We try both lossless and lossy compression methods and show that checkpoints compressed with ZFP (lossy) cause around 40% less overhead than uncompressed checkpoints. We can reduce the checkpoint over- head of ZFP by another 16% by writing differential checkpoints, one ZFP checkpoint of the PageRank values followed by 3 strongly compressed SZ3 (lossy) checkpoints of the difference between the current PageRank values and the PageRank values at the time of the last ZFP checkpoint. In most of our test our adaptive approach, with a 1∼2% checkpoint overhead, manages to handle failures without increasing the iteration time of PageRank.