Attach a SLURM cluster

The base installation instructions install a complete working application that does all the computation on the local machine.

If, after you’ve installed the base application, you want to improve performance by attaching a SLURM cluster, these instructions will show you how.

Step 1: Prerequisites

  • SLURM scheduler

    nextPYP uses a SLURM compute cluster to do the data processing. The login node of the SLURM cluster must be reachable on the network from the machine where nextPYP is installed.

  • Shared filesystem

    nextPYP requires that the web server and the SLURM cluster share a single filesystem (e.g. an NFS storage system) and it be mounted at the same mount point on every machine. For example, if the shared filesystem is mounted on the SLURM cluster nodes as /nfs/data, then those files should also be available on the web server machine as /nfs/data. Additionally, you should have installed nextPYP using the PYP_SHARED_DATA and PYP_SHARED_EXEC options.

  • Service account

    nextPYP requires a service account to both run the web server process, and access files on the shared filesystem. This user should be the same on the web server machine and the SLURM cluster. Because this user account runs the web server on the web server machine (which may be exposed to the public internet), the service account should not have administrative privileges.

  • Passwordless SSH access to the SLURM login node

    The service account needs to have login access from the web server to the SLURM node via SSH without a password. This will require installing the public SSH key for the service account into the login system for the SLURM node. For a stock linux installation of sshd, that usually means copying the public key into a file like /home/account/.ssh/authorized_keys, for example, using the command ssh-copy-id user@domain. But for SLURM clusters with a networked login system or SSO, you’ll need to consult your organization’s IT staff for SSH key installation instructions.

  • Remote access enabled

    To attach a SLURM cluster, you will first need to enable remote access for your web server using the enable remote access installation instructions.

Step 2: Configuration

Website configuration

In the config.toml file in your installation folder, add (or change) the web.webhost setting. This setting configures how jobs running on compute nodes in the cluster should connect to the web server and should be set to the URL of nextPYP on the web server from the point of view of the compute node.

Tip

The web.host and web.webhost settings are different, but it’s easy to confuse them for one another. The web.host setting controls what interfaces the web server binds to, while the web.webhost setting controls what URL pyp should use to connect to the web server. With the full benefit of hindsight, the web.webhost setting might be better named web.url, but alas, that would be a breaking change.

Depending on how your network is configured, choose one of the following options:

If you have enabled remote access using a private network, then the correct value for web.webhost will look something like:

[web]
webhost = 'http://nextpyp.internal.myorganization.org:8080'

For private networks, the http protocol (not https) should be used here. And the port should be specified in the url explicitly, eg :8080.

Note

If you have changed the port from the default value using the web.port option, use that port number here instead of 8080.

If you have configured remote access to the web server using the reverse proxy, then the correct value for web.webhost will look something like:

[web]
webhost = 'https://nextpyp.myorganization.org'

For access over public networks, the https protocol (not http) should be used here. And the port (443 for https) should not be explicitly specified in the URL, since the HTTPs protocol uses the correct port implicitly.

SLURM configuration

Then add a new [slurm] section to the config file as well. At a minimum, we’ll need to set the slurm.host and the slurm.templatesDir properties, for example:

[slurm]
host = 'slurm-login.myorganization.org'
templatesDir = '/opt/nextpyp/templates'

Each of the required settings are described in more detail below.

  1. slurm.host

    The slurm.host property is the hostname of the SLURM login node, from the point of view of the nextPYP website server. The nextPYP web server process will SSH into the SLURM login node using this hostname to submit jobs to SLURM.

  2. slurm.templatesDir

    The slurm.templatesDir setting configures the location of SLURM submission templates. You’ll need to create this folder somewhere that the service account can read, but not write. This folder doesn’t need to be shared with compute nodes, so a location on the local filesystem of the web server machine is a good choice. A subfolder of your installation folder that is owned by root or your administrator account works well, like in the example above.

    Once the folder is ready, download the default SLURM submission template into the folder:

    wget https://nextpyp.app/files/pyp/latest/default.peb -O /opt/nextpyp/templates/default.peb
    

    Note

    If you’re not using the latest version of nextPYP, change the url of the default.peb file above to match the version you’re using by replacing the latest path component with your version number. e.g,. https://nextpyp.app/files/pyp/0.7.0/default.peb

    This template file allows you to configure how nextPYP submits jobs to SLURM, for example, by adding additional parameters that are specific to your computing environment. For more information about how to customize nextPYP using SLURM templates, see SLURM submission templates.

Feel free to add any other relevant SLURM configuration here as well. You can find more information about all of the available settings in the full documentation for the configuration file.

Some commonly-used options are described below:

slurm.path

Path to the SLURM binaries folder on the login node, if they are not already searchable using the usual $PATH configuration. e.g., path = '/some/nonstandard/location/slurm/bin'.

If your slurm binaries are in a standard location, like /bin or /usr/bin or a folder in the $PATH environment variable, then you won’t need to set the path explicitly.

Once you’ve finished making changes to your configuration file, restart nextPYP:

sudo systemctl restart nextPYP

Step 3: SSH access

To process a compute job, the website will attempt to SSH into the login node of the SLURM cluster to submit jobs. For this connection to work, the website must have access to an SSH key.

To generate a new SSH key for the service account, run the following commands as the service account:

cd ~/.ssh
ssh-keygen -t rsa -f id_rsa
cat id_rsa.pub >> authorized_keys
chmod go-w authorized_keys

Tip

To become the service account, sudo su account usually works in most environments.

Note

  • You may need to create the .ssh folder if it doesn’t already exist. Be sure to set the correct filesystem permissions for .ssh folders.

  • RSA keys are known to work well with nextPYP’s SSH client. If your organization prefers the newer ECDSA key type, or the even newer Ed25519 key type, you can try to generate one of those instead. Our SSH client advertises support for ECDSA and Ed25519 keys, but we haven’t tested them ourselves just yet.

Other SSH configurations than the one suggested here may work as well. If you stray from the defaults, you may need to update the config.toml file to describe your SSH configuration to the website. You can find more information about all of the SSH settings in the full documentation for the configuration file.

Step 4: Test the new configuration

After the website is restarted, go to the administration page. You can access the administration page by clicking on your username in the upper right corner and clicking the administration link there. Or you can just visit the administration page directly by changing the path (and hash) parts of the URL to /#/admin.

On the administration page, in the PYP tab, click the PYP/WebRPC Ping button.

This button will launch a short simple job on the cluster and wait for the result.

If a pong response is returned, then the new configuration was successful.

If instead, you see an error or a timeout or a no-response message of some kind, then the configuration was not successful. To find out what went wrong will require some debugging.

The first useful place to look for error information will be the micromon log in the local/logs folder of your installation. Errors with the SSH connection will appear there. See troubleshooting for more details.

The next place to look for errors is the log files in the shared/log folder in the shared filesystem. If worker processes can’t connect to the website, their log files will usually explain why. Usually problems at this stage are caused by networking issues and mismatched configuration.

Getting help

Getting nextPYP installed and working correctly can be tricky sometimes, especially since everyone’s needs are just a little different. We’ve done our best to build an install process that’s flexible enough to work in many different environments, but sometimes things still might not work out perfectly.

If you have questions, need clarification on any of the installation options, or are just looking for a little help getting through the installation, don’t hesitate to reach out using one of the options listed in the Getting help section.