Without queueing systems

We now cover the case where there is no queueing system and the user has full power to run on any of the nodes in the cluster.

Networking problems can be tricky and we will test various thing as we go along. You only need to perform this installation on your "master" machine that will run maps. All other machine (which we be called the “remote” machines) only need to have VASP (or any other ab-initio code).

Before you start, you must first make sure that it is possible to login from the master machine to the remote machines without entering a password. This is essential for the program to be able to run on its own, without your intervention. Don't worry it is generally possible to do this without compromising the security of your system. Two commands allow you to run a command on a remote machine. If the master and remote machines are connected through a secure network (e.g. a beowolf cluster having its own local network) or if you don't care about security (for now), you can use rsh. Otherwise, ssh provides a secure way to command a machine remotely.

To set up rsh so that you can login without typing a password, you must have the appropriate .rhosts file on the remote machine. For more information, consult the rsh man page. (One important issue, often not mentioned in the man pages, its that you need to set the file permissions of the .rhosts file so that noone else but you has “write” permission: chmod og-w ~/.rhosts).

To set up ssh so that you can login without typing a password, consult the ssh man page, especially the section on “RSA-based host authentication”. (This is the feature that makes the login secure even if no password is needed.) In general, setting this up involves creating a .shosts file and generating a public key files to be copied onto the remote machines.

If your username is different on the remote machine and if you use ssh, use the syntax node username@host instead. If you use rsh, use the syntax node -l username host.

Once you are able to run either rsh node2 ls or ssh node2 ls and get the content of your home directory on the remote machine (assuming that you have a remote machine called node2), you can proceed to the next step. Do you have the same home directory on the master and remote machines and does it have the same access path? To check this, cd to some arbitrary subdirectory and type:

 node node2 ls
where node is a command provided with ATAT and node2 is the name of the remote machine. If you want to use ssh instead, type
 node -s node2 ls
This should print the content of the current directory on the master machine (not your home directory). If you get an error message or if you get the content of another directory, you will need to check if the following works. Make sure you are in a directory that does not contain too many files. If you want to use rsh, type:
 node -r node2 ls
while if you want to use ssh
 node -s -r node2 ls
In either case, you should get the content of the current directory before continuing on. The node -r command works by copying the content of the current directory on the remote machine in a temporary directory. Once the command has terminated, the new content is copied back and the remote temporary directory is deleted.

We are now ready to automate the calculations. Type

 chl
and, as indicated on the screen, open the file ~/.machines.rc with a text editor. This file contains numerous comment lines explaining the format of the file and a few examples.

The commands in the first column (before the +) must print a single number indicating the load of the machine. It is a good check to copy and paste each of these command, one at the time, into a shell window to see if the output is a single number. In order to extract a single number out of a complicated output, the command getvalue is provided. It extracts the single number following the token given as an argument. The first entry, with the none keyword after the + indicates the threshold load above which a machine is considered too busy to be usable. Note that the load checking commands may quite elaborate if, for instance, you need to “rescale” the load of some machines because they have a different reporting scheme or if you want to tweak the priority given to each machine.

The second column (after the +) give the command prefix needed to run on each remote machine. These prefix will usually consist of the command node described above. It is very important that the command prefix be such that the current directory in the remote machine when the command is run is the same as on the local machine. The best way to test that is to try the prefix in front of the ls command and verify that what is printed is indeed the content of the current local directory.

Once you are done with entering the information for each of your machines (you can also enter only a few and come back later to add the remaining ones), make sure to comment out the examples provided (placing a # at the beginning of the unwanted line). Do not comment out the first line which contains the none keyword.

Once you have edited the ~/.machines.rc file, type chl. This should give a list of the load of all machines in the first column and a list of command prefix in the second. Next, try the command minload. It should give the command prefix that let you access the machine with the minimum load or none if no machine is available. To check, once again, that the command prefix are correct, type `minload` ls (make sure you use backward quotes!). This should print the content of the current directory (unless there are no machines available).

This approach could also be used with a queueing system, but this is much more advanced and requires good knowledge of scripting languages. You could read the “node list file” the queueing system assigned to the job and create a local .machine.rc file on-the-fly. The -s option of pollmach could prove helpful to ensure two vasp runs do run on the same processors, and avoids the need for load balancing code.

[email protected] Wed, Dec 6, 2023 12:55:16 PM