Make any job re-submit itself

by Andrew Peterson

We keep our maximum job wall time fairly short on our system (50 hours), in order that there's a high turnover of nodes. However, some jobs need much longer than 50 hours to complete. We've developed a system in which a job will watch the clock, then just before it is about to expire it will re-submit itself and terminate.

Self-resuming script

To use this, you first need to write your job such that it is capable of picking up where it left off. The below is a simple example of how to do this for a geometry optimization. Notice that if the script is interrupted, you could submit the same script and it will pick up where it left off, but writing to qn0001.traj instead of qn0000.traj.

import os
import ase.io
from ase.calculators.emt import EMT
from ase.optimize import QuasiNewton


def make_atoms():
    """Creates the atoms to be optimized.
    This is run only the first time this script is started."""
    from ase.build import fcc111
    atoms = fcc111('Cu', (3, 3, 3), vacuum=10.)
    return atoms


def resume():
    """Finds the last atomic configuration if trajectory files exist,
    otherwise creates a fresh atoms object to be optimized. Also returns
    the new iteration number."""
    qnfiles = [f for f in os.listdir(os.getcwd()) if 
               (f.startswith('qn') and f.endswith('.traj'))]
    if len(qnfiles) == 0:
        # This is a fresh start.
        atoms = make_atoms()
        iteration = 0
    else:
        # This is resuming.
        lastqn = sorted(qnfiles)[-1]
        atoms = ase.io.read(lastqn)
        iteration = int(lastqn[2:-5]) + 1
    return atoms, iteration


atoms, iteration = resume()
atoms.set_calculator(EMT())
opt = QuasiNewton(atoms,
                  trajectory='qn{:04d}.traj'.format(iteration))
opt.run()

Self-resubmitting script

Having a script like above can already make your life easier, as you don't need to do as much manual manipulation every time you want to restart a job. However, we can make it better by having the job automatically re-submit itself right before it will be killed by the system. To do this, we use the the ReQueue module from our group's pgroup repository.

Most scripts, like the above, have a single line that is responsible for nearly all the computational demand of the script; in this case: opt.run(). Here, instead of running this line directly we feed this line to the ReQueue module like below:

import os
import ase.io
from ase.calculators.emt import EMT
from ase.optimize import QuasiNewton
from pgroup.requeue import ReQueue


def make_atoms():
    """Creates the atoms to be optimized.
    This is run only the first time this script is started."""
    from ase.build import fcc111
    atoms = fcc111('Cu', (3, 3, 3), vacuum=10.)
    return atoms


def resume():
    """Finds the last atomic configuration if trajectory files exist,
    otherwise creates a fresh atoms object to be optimized. Also returns
    the new iteration number."""
    qnfiles = [f for f in os.listdir(os.getcwd()) if 
               (f.startswith('qn') and f.endswith('.traj'))]
    if len(qnfiles) == 0:
        # This is a fresh start.
        atoms = make_atoms()
        iteration = 0
    else:
        # This is resuming.
        lastqn = sorted(qnfiles)[-1]
        atoms = ase.io.read(lastqn)
        iteration = int(lastqn[2:-5]) + 1
    return atoms, iteration


atoms, iteration = resume()
atoms.set_calculator(EMT())
opt = QuasiNewton(atoms,
                  trajectory='qn{:04d}.traj'.format(iteration))

##############################################################
# Make the resubmission object.
requeue = ReQueue(maxtime=24., checktime=0.5)

# Run the compute-heavy line inside the requeue object.
status = requeue(opt.run, fmax=0.05)

# Check to see if the job finished or ran out of time.
if status == 'time_elapsed':
    os.system('sbatch --begin=now+2minutes run.py')
else:
    # Put any post-process lines here.

Behind the scenes, ReQueue is starting opt.run in a separate thread. Every 30 minutes it checks the thread to see if it is still running or has completed. (You can change the check interval with checktime.) If it has completed, it exits and returns the status 'job_completed'. If 24 hours have elapsed, it abandons the thread (killing it) and returns the status 'time_elapsed'.

Back in our script, if we get the 'time_elapsed' flag, we know that we should resubmit our script, which is accomplished by the call to os.system. Here, we added a 2-minute start-time delay to give our current script some time to terminate properly before the next one starts.

Note: for NEBs, see also Easy NEB restarts.