HPC basics

Metabarcoding: from Lab to Bioinformatics (UT International Summer University, 2022)

HPC basics

For demonstration purposes, we will use the Rocket Cluster of the University of Tartu.

To connect to the head-node of the HPC cluster using the secure shell protocol, run:

ssh USER@SERVER

substitute USER with your login ID and SERVER with a hostname (IP or a domain name).
E.g., ssh koljalg@rocket.hpc.ut.ee.

To copy a file or multiple files to the HPC cluster, use:

scp yourfile koljalg@rocket.hpc.ut.ee:~/yourfile   # single file
scp file1 file2 koljalg@rocket.hpc.ut.ee:~/        # multiple files

To copy file from the HPC cluster (e.g., yourfile from home directory on HPC to home directory on your computer), use:

scp koljalg@rocket.hpc.ut.ee:~/yourfile ~/yourfile

If you have large files (or a large number of files), it’s better to use rsync program for file transfer, e.g.

rsync -avz Documents/* koljalg@rocket.hpc.ut.ee:~/all/

To end your session on the HPC cluster, run:

exit

Setup working environment on HPC cluster

In general, one needs admin rights to install the software on HPC clusters. However, users may install software into their home directory where they have write permissions. To make life easier, you may use Conda - a package manager which helps you find and install the software and its dependencies.
To install Miniconda, run the following code:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
bash ~/miniconda.sh -b -p $HOME/miniconda
~/miniconda/bin/conda init bash
source ~/.bashrc
conda update --all --yes -c bioconda -c conda-forge
conda install --yes -c conda-forge mamba

To install the software (e.g., seqkit program), run:

mamba install -c bioconda seqkit

Conda environments

If the software you wish to use could not be installed in the base (default) environment due to the conflict of versions, or you want to use a specific version of the program, or just want to keep it independent, you may create a separate environment with:

mamba create --name VSEARCHENV -c bioconda -c conda-forge vsearch=2.21.1 blast=2.13.0
conda activate VSEARCHENV           # swith to the new environment we've created

Verify which software versions are installed:

vsearch --version
blastn -version

To switch to the base environment, run:

conda deactivate

Module system

Alternatively, if the software you wish to use is pre-installed on the HPC cluster, you may load it as an environment module.
To list all available modules, use module avail command (scroll the list with space button, press q to quit).
To search for a particular module, use e.g. module -r spider '.*singularity.*'.
If the required software was found, you need to load the module, e.g.:

module load any/singularity/3.7.3

Scheduling jobs on the HPC cluster

The Slurm Workload Manager, a.k.a. Simple Linux Utility for Resource Management (SLURM), is used to share the HPC resources between users.

Please note that users log in from their computers to the cluster head node.
Do not run the analysis on the head node!
You should use it only to schedule the tasks that would be distributed on the cluster nodes.

To run the program on the HPC cluster you should:

Prepare a batch script with directives to SLURM about the number of CPUs^*, amount of RAM^*, and time duration requested for the job, along with commands which perform the desired calculations;
Submit the script to the batch queue. SLURM will evaluate the task’s priority and start executing the job when it reaches the front of the queue.
When the job finishes, you may retrieve the output files.

^* Unless specified otherwise, on the Rocket cluster, all jobs will be allocated 1 node with 1 CPU core and 2 GB of memory.

Batch script

Here is a basic batch script which that contains a minimal number of SLURM options:

#!/bin/bash -l
#SBATCH --job-name=my_job
#SBATCH --cpus-per-task=4
#SBATCH --nodes=1
#SBATCH --mem=10G
#SBATCH --partition amd
#SBATCH --time=48:00:00

## If needed, you may load the required modules here
# module load X

## Run your code
some_program -i input.data -o output_1.data --threads 4
some_script.sh output_1.data > output_2.data
echo "Done" > output.log

The syntax for the SLURM directive in a script is #SBATCH <flag>, where <flag> could be:

--job-name, the name of the job;
--cpus-per-task, the number of CPUs each task should have (e.g., 4 cores);
--nodes, requested number of nodes (each node could have multiple CPUs);
--mem, requested amount of RAM (e.g., 10 gigabytes);
--partition, the partition on which the job shall run
--time, the requested time for the job (e.g., 48 hours).

To submit a job, save the code above to a file (e.g., my_job.sh) and run:

sbatch my_job.sh

Scheduling a task directly from the command line

If the command you wish to run is relatively simple, you may run it without a batch script, but in that case, you should provide SLURM directives as arguments to the sbatch command:

sbatch \
  --job-name=my_job \
  --ntasks-per-node=4 --nodes=1 --mem=10G -p amd \
  --time=48:00:00 \
  some_script.sh input.data

Job management

When the job is submitted, you may monitor the queue and see the status of your running tasks:

squeue -u $USER

The most common job state codes (column ST) are:
PD = PENDING
R = RUNNING
S = SUSPENDED

To cancel the job, use:

scancel <JOBID>        # by job ID (e.g., where <JOBID> is 31727880 - see the column JOBID in "squeue" output)
scancel --name my_job  # one or more jobs by name
scancel -u $USER       # all jobs for a current user