Job Checkpointing

Tip

We recommend implementing checkpointing on any job running longer than a day!

Job/Application Checkpointing is the capturing of a programs state, so that it can be restarted from that point in case of failure. This is especially important in long running jobs.

How checkpointing can be implemented depends on the application/code being used, some will have inbuilt methods whereas others might require some scripting.

The application specific documentation should be the first place you look when trying to implement checkpointing.

Warning

Checkpointing may add to the runtime of a job, especially if you are using large amounts of memory.

Before implementing checkpointing consider

Additional runtime added by checkpointing.
Length of the job being checkpointed.
Time taken to implement checkpointing.
How many times you will reuse this method.

Queuing¶

Checkpointing code has the added advantage that it allows you to split your work into smaller jobs, allowing them to move through the queue faster, and allows you to run work for longer than the job maximum time limit.

This can be most easily implemented by splitting the work into smaller chunks, then in your script loading and saving to disk at the start and end of the job respectively.

Below is an example of submitting the same job again, if previous has run successfully.

# Slurm header #SBATCH etc etc

sbatch --dependency=afterok:${SLURM_JOB_ID} "$0" 
# "$0" is equal to the name of this script.

# Read data from disk

# Do work

# Write data back to disk.

This job will resubmit itself forever until stopped

If writing your own code, you could exit with a non-zero code once all the work has been done.

The use of --dependency has the advantage of adding the next job to the queue before starting, saving queue time in between jobs.

Job Checkpointing

Queuing¶

Examples¶