Attach a SLURM cluster
Attach a SLURM cluster¶
The base installation instructions install a complete working application that does all the computation on the local machine.
If, after you’ve installed the base application, you want to improve performance by attaching a SLURM cluster, these instructions will show you how.
Step 1: Prerequisites¶
- SLURM scheduler:
nextPYPuses a SLURM compute cluster to do the data processing. The login node of the SLURM cluster must be reachable on the network from the machine where
- Shared filesystem:
nextPYPrequires that the web server and the SLURM cluster share a single filesystem (e.g. an NFS storage system) and it be mounted at the same mount point on every machine. For example, if the shared filesystem is mounted on the SLURM cluster nodes as
/nfs/data, then those files should also be available on the web server machine as
- Service account:
nextPYPrequires a service account to both run the web server process, and access files on the shared filesystem. This user should be the same on the web server machine and the SLURM cluster. Because this user account runs the web server on the web server machine (which may be exposed to the public internet), the service account should not have administrative privileges.
- Passwordless SSH access to the SLURM login node:
The service account needs to have login access from the web server to the SLURM node via SSH without a password. This will require installing the public SSH key for the service account into the login system for the SLURM node. For a stock linux installation of
sshd, that usually means copying the public key into a file like /home/account/.ssh/authorized_keys. But for SLURM clusters with a networked login system or SSO, you’ll need to consult your organization’s IT staff for SSH key installation instructions.
Step 3: Configuration¶
config.toml file in your installation folder, add (or change) the
to point to the new shared folder you just created, e.g.:
shared = '/nfs/nextpyp/shared'
You’ll also need to add (or change) the
web.webhost settings to match your network
configuration. Depending on how your network is configured, choose one of the following options.
Option 2: The SLURM cluster and the web server are only connected through the public internet¶
You might run into this situation if your web server and the SLURM cluster are on different networks. In this situation, the SLURM nodes must connect to the website through the reverse proxy server.
To enable access in this environment, set the
web.webhost setting to the public internet URL
of the web server, using the HTTPs protocol and no port number, e.g.:
webhost = 'https://nextpyp.myorganization.org'
Do not use the
web.host setting in this environment. The default value here will be correct.
web.host and the
web.webhost settings are actually different from each other!
Be sure not to get them confused. With the benefit of hindsight, the
would perhaps be better named
web.url now, but we’d rather not make a breaking change there.
Then add a new
[slurm] section to the config file as well.
At a minimum, we’ll need to set the
host = 'slurm-login.myorganization.org'
Feel free to add any other relevant SLURM configuration here as well. You can find more information about all of the available settings in the full documentation for the configuration file.
Additonally, it may be helpful to set a few other commonly-needed options now, depending on your SLURM environment:
Path to the SLURM binaries on the login node.
The names of any SLURM partitions to which users can submit
The names of any SLURM partitions with GPU hardware to which users can submit
host = 'slurm-login.myorganization.org'
path = '/opt/slurm/bin'
queues = [ 'general', 'quick' ]
gpuQueue = [ 'gpu' ]
After making changes to your configuration file, restart the application:
sudo systemctl restart nextPYP
Step 4: SSH configuration¶
To process a compute job, the website will attempt to SSH into the login node of the SLURM cluster to submit jobs. For this connection to work, the website must have access to an SSH key.
To generate a new SSH key for the service account, run the following commands as the service account:
ssh-keygen -t rsa -f id_rsa
cat id_rsa.pub >> authorized_keys
chmod go-w authorized_keys
To become the service account,
sudo su account usually works in most environments.
You may need to create the
.sshfolder if it doesn’t already exist.
Be sure to set the correct filesystem permissions for .ssh folders.
RSA keys are known to work well with
nextPYP’s SSH client.
If your organization prefers the newer ECDSA key type, you can try to generate one of those instead. The SSH client advertises support for ECDSA keys, but we havent tested them ourselves.
Other SSH configurations than the one suggested here may work as well. If you stray from the defaults,
you may need to update the
config.toml file to describe your SSH configuration to the website.
You can find more information about all of the SSH settings in the
full documentation for the configuration file.
Step 5: Test the new configuration¶
After the website is restarted, go to the administration page. You can access the administration page by
clicking on your username in the upper right corner and clicking the administration link there. Or you can
just visit the administration page directly by changing the path (and hash) parts of the URL to
On the administration page, in the PYP tab, click the PYP/WebRPC Ping button.
This button will launch a short simple job on the cluster and wait for the result.
If a pong response is returned, then the new configuration was successful.
If instead, you see an error or a timeout or a no-response message of some kind, then the configuration was not successful. To find out what went wrong will require some debugging.
The first useful place to look for error information will be the
micromon log in the
local/logs folder of
your installation. Errors with the SSH connection will appear there. See troubleshooting for more details.
The next place to look for errors is the log files in the
shared/log folder in the shared filesystem.
If worker processes can’t connect to the website, their log files will usually explain why. Usually problems
at this stage are caused by networking issues and mismatched configuration.