SLURM Array experiments

Background

The SLURM array directive provides a simple and intuitive syntax to submit a job with multiple similar tasks to the underlying HPC system. To fully exploit this capability, a few things should be kept in mind.

In the following, we will, starting from a simple, serial job, explore how the total run time of our job behaves when the `array` option is applied.

job_001: A job with one task running on a single node on a single CPU

This is likely the most simple job thinkable: We run the command `hostname` and let the process idle for 5 seconds. Before and after the `sleep` statement, we print out a timestamp (`date`). The time difference between the two timestamps should be 5 seconds.

 #! /bin/bash

 #SBATCH -n 1
 #SBATCH -N 1
 #SBATCH --time=00:00:10
 #SBATCH --partition=testing
 #SBATCH -J job_001
 #SBATCH -o job_001.out
 hostname
 date
 sleep 5
 date

The output from this script (written to job_001.out) is:

$ cat job_001.out
highmemnode02
Wed Oct 18 14:27:28 CEST 2023
Wed Oct 18 14:27:33 CEST 2023

The first line is the node's hostname, followed by two timestamps 5 seconds apart. So far so good.

job_002: Using the array directive to run 10 tasks

Now let's use the `SLURM –array` directive such that the script gets executed 10 times. The `–array` directive takes a list or a range of task indices, optionally, a stride and a maximum number of concurrent tasks can be specified, but we leave those details to the SLURM documentation. `job_002.sh` below differs from the first script only by the line `SBATCH –array 1-10`, telling SLURM to execute the body of the script 10 times (thereby defining 10 tasks.) Now, it is important to note that this changes the meaning of the other SLURM directives insofar as they now specify the resources for each single task in the array. In other words, each task runs on a single node on a single CPU with a maximum wall time of 10 seconds.

#! /bin/bash

#SBATCH -n 1
#SBATCH -N 1
#SBATCH --array 1-10
#SBATCH --time=00:00:10
#SBATCH --partition=testing
#SBATCH --out job_002_%a.out

hostname 
date
sleep 5
date

Submitting this script, and observing my queue (`squeue`), I find that only one out of the ten tasks is running at a given time with the remaining jobs being queued up but not started yet. Why is that? The job is in no way demanding in resources, so what keeps the scheduler from executing more than one task at a time? The answer is memory: By default, the entire memory available on the allocated node is given to the one task. Hence, no memory remains available for the other tasks, they have to wait until resources become available.

The output from task 1 (captured in `job_002_1.out`) reads

highmemnode02
Wed Oct 18 14:38:21 CEST 2023
Wed Oct 18 14:38:26 CEST 2023

Output from task 2 (`job_002_2.out`) reads

highmemnode02
Wed Oct 18 14:38:27 CEST 2023
Wed Oct 18 14:38:32 CEST 2023

The timestamps prove that each task was run sequentially. Not want we want, especially for small tasks like these.

job_003.sh: Job array with 10 task and mem-per-cpu allocation.

We're going to reduce the memory allocated to a single task. Allocating as specific amount of RAM can be done globally for the entire job per the –mem directive or, more fine-grained, through the mem-per-cpu directive. This is applied in the following script `job_003.sh`:

 #! /bin/bash

 #SBATCH -n 1
 #SBATCH -N 1
 #SBATCH --array 1-10
 #SBATCH --mem-per-cpu=200
 #SBATCH --time=00:00:10
 #SBATCH --partition=testing
 #SBATCH --out job_003_%a.out

 hostname
 date
 sleep 5
 date

This time, only a small fraction of the available memory is allocated to an individual task and hence multiple tasks can be executed in parallel.

The timestamps in the task specific output files prove this point

//job_003_1.out:

Runs the example 10 times. Depending on available resources on the node, several or all tasks are run concurrently.

Again, the timestamps in the individual output files make the point.

#+begin_example shell $ cat job_003*.out fastnode01 Wed Oct 18 14:39:22 CEST 2023 Wed Oct 18 14:39:27 CEST 2023 node01 Wed Oct 18 14:39:22 CEST 2023 Wed Oct 18 14:39:27 CEST 2023 fastnode01 Wed Oct 18 14:39:22 CEST 2023 Wed Oct 18 14:39:27 CEST 2023 fastnode01 Wed Oct 18 14:39:22 CEST 2023 Wed Oct 18 14:39:27 CEST 2023 fastnode01 Wed Oct 18 14:39:22 CEST 2023 Wed Oct 18 14:39:27 CEST 2023 fastnode01 Wed Oct 18 14:39:22 CEST 2023 Wed Oct 18 14:39:27 CEST 2023 fastnode01 Wed Oct 18 14:39:22 CEST 2023 Wed Oct 18 14:39:27 CEST 2023 fastnode01 Wed Oct 18 14:39:22 CEST 2023 Wed Oct 18 14:39:27 CEST 2023 fastnode01 Wed Oct 18 14:39:22 CEST 2023 Wed Oct 18 14:39:27 CEST 2023 fastnode01 Wed Oct 18 14:39:22 CEST 2023 Wed Oct 18 14:39:27 CEST 2023 $ #+end_example

All tasks were executed at exactly the same time as intended. Somewhat at a surprise, task 2 was run on a different node than all other tasks. We'll leave that for a forthcoming post to investigate.

Conclusions

To run tasks in parallel using the job array syntax, must not only specify cpus but also memory. Depending on availability of mem and cpus on the reserved node(s), 1, multiple, or all tasks will be run in concurrency.

SLURM, array, mem-per-cpu