Slurm job scheduling system

Refer to the Slurm documentation for more details.

According to the sbatch man page, we can specify job dependencies with --dependency=<dependency list>. This could allow us to avoid having a master Python process running for the duration of the optimisation.

See, .e.g., afterok:job_id[:job_id...].
But we still need to maintain the internal state of the optimisation routine, so perhaps this isn’t so helpful.

We can also --job-name=<jobname> to slurm when scheduling jobs, which we can use to ensure that the jobs listed in the process queue are more informative than mere job numbers. We can also use --output=<filename pattern> to record output to specific file(s).

Finally, according to the documentation for --time=<time>:

A time limit of zero requests that no time limit be imposed.

This argument accepts time limits in a variety of formats, including the ability to specify limits in terms of the number of days.

Note

Partitions such as “cascade” (on which we’re running MCAS) can enforce limits, and the cascade partition’s time limit is “30-00:00:0” (i.e., 30 days). This can be identified by running sinfo -p cascade.