Introduction to Modern Fortran

In February, the Research Software Engineering group hosted an ‘Introduction to Modern Fortran Course’ taught by EPSRC Research Software Engineering Fellow, Ian Bush. The course material is available at https://www.oerc.ox.ac.uk/introduction-modern-fortran-course-materials.

During the day, Ian recommended a bunch of books (below)

We’ve been working with the University library and I’m happy to announce that all of these are now available to borrow. Search for them using the University catalogue

Determining MPI placement on the HPC clusters

Say you request a 16 slot MPI job on ShARC with 3GB per-process using a submission script like the one below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
#!/bin/bash
#Tell the scheduler that maximum runtime is 1 hour
#$ -l h_rt=1:00:00
#Request 16 slots
#$ -pe mpi 16
#Request 3 Gigabytes per slot
#$ -l rmem=3G

#Load gcc 4.9.4 and OpenMPI 2.0.1
module load dev/gcc/4.9.4
module load mpi/openmpi/2.0.1/gcc-4.9.4

mpirun  ./MPI_hello_world

The scheduler is free to decide where on the system your 16 slots get placed. You may have all 16 slots running on one node, one slot per node for 16 nodes or anything in between. The exact placement of your jobs may affect runtime.

We can find out where the scheculer placed your MPI processes using the $PE_HOSTFILE environment variable. When your job starts running, this points to a file that contains placement information. We make use of it in a submission script as follows

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
#!/bin/bash
#Tell the scheduler that maximum runtime is 1 hour
#$ -l h_rt=1:00:00
#Request 16 slots
#$ -pe mpi 16
#Request 3 Gigabytes per slot
#$ -l rmem=3G

#Load gcc 4.9.4 and OpenMPI 2.0.1
module load dev/gcc/4.9.4
module load mpi/openmpi/2.0.1/gcc-4.9.4

#Put placement information into node_info.txt
cat $PE_HOSTFILE  > node_info.txt

mpirun  ./MPI_hello_world

You'll now get a file called node_info.txt that contains information about which nodes your MPI slots were placed. For example

sharc-node031.shef.ac.uk 1 shortint.q@sharc-node031.shef.ac.uk UNDEFINED
sharc-node069.shef.ac.uk 1 shortint.q@sharc-node069.shef.ac.uk UNDEFINED
sharc-node112.shef.ac.uk 1 shortint.q@sharc-node112.shef.ac.uk UNDEFINED
sharc-node108.shef.ac.uk 1 shortint.q@sharc-node108.shef.ac.uk UNDEFINED
sharc-node081.shef.ac.uk 1 shortint.q@sharc-node081.shef.ac.uk UNDEFINED
sharc-node090.shef.ac.uk 2 shortint.q@sharc-node090.shef.ac.uk UNDEFINED
sharc-node080.shef.ac.uk 2 shortint.q@sharc-node080.shef.ac.uk UNDEFINED
sharc-node050.shef.ac.uk 3 shortint.q@sharc-node050.shef.ac.uk UNDEFINED
sharc-node059.shef.ac.uk 4 shortint.q@sharc-node059.shef.ac.uk UNDEFINED

In the above example, 4 slots were placed on node059, 3 slots on node 50, 2 slots on nodes 080 and 090 and one slot on the other listed nodes.

Job validation with Grid Engine

(Edit: caveats are listed in a more recent post)

Computer cluster job scheduling software is fantastic at managing resources and permitting many jobs to run efficiently and simultaneously.
However, schedulers aren't always great at giving end-users feedback when things go wrong.

For example, on our ShARC cluster, which runs the (Son of) Grid Engine (SGE) scheduler, if you request a longer run-time than is permitted by any of the cluster's job queue configurations then your job will sit there queueing indefinitely until you or someone else deletes it.
For example, let's use qsub to submit a job where we ask for 1000 hours of run time and 4 GiB of RAM:

[will@mysofa ~]$ ssh sharc
...
[te1st@sharc-login1 ~]$ qsub -l h_rt=1000:00:00 -l rmem=4G -m bea -M w.furnass@sheffield.ac.uk -N longtask myjobscript.sge

Your job 236268 ("STDIN") has been submitted
[te1st@sharc-login1 ~]$ qstat -u $USER
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
 217834 0.00000 longtask   te1st        qw    03/20/2017 10:48:39                                    1        

Job 217834 will now sit queuing forever.
Not only will you not be told why, you won't be given any notification that the job will not run.

In situations like this it can be useful to ask the scheduler to validate a job.
One way of doing this is to run 'qalter -w v <myjobid>' after job submission if say you think that a job has now been queueing for longer than previously-submitted jobs of a similar nature:

[te1st@sharc-login1 ~]$ qalter -w v 217834
Job 217834 (-l h_rt=3600000) cannot run in queue "flybrain.q" because of cluster queue
Job 217834 (-l h_rt=3600000) cannot run in queue "gpu.q" because of cluster queue
Job 217834 (-l h_rt=3600000) cannot run in queue "gen2reg.q" because of cluster queue
Job 217834 (-l h_rt=3600000) cannot run in queue "rse.q" because of cluster queue
Job 217834 (-l h_rt=3600000) cannot run in queue "gpu-vis.q" because of cluster queue
Job 217834 (-l h_rt=3600000) cannot run in queue "insigneo-polaris.q" because of cluster queue
Job 217834 (-l h_rt=3600000) cannot run in queue "interactive.q" because of cluster queue
Job 217834 (-l h_rt=3600000) cannot run in queue "shortint.q" because of cluster queue
Job 217834 (-l h_rt=3600000) cannot run in queue "all.q" because of cluster queue
Job 217834 (-l h_rt=3600000) cannot run in queue "evolgen.q" because of cluster queue
Job 217834 (-l h_rt=3600000) cannot run in queue "rse-training.q" because of cluster queue
Job 217834 (-l h_rt=3600000) cannot run in queue "cstest.q" because of cluster queue
verification: no suitable queues

What this 'qalter -w v <myjobid>' command does is check to see whether the job could run in any of the job queues on the cluster if the cluster were free of other jobs.

The last line of output is key: our job will never be run given the current cluster configuration.
Looking above that, we can see that it cannot run in any of the general-purpose job queues (such as all.q) and there is specific mention of our 1000 hour (3600000s) run-time resource request.
We can therefore deduce that our run-time resource request wasn't satisfiable.

Modifying a resource request post-submission

Once we know that our job can't run we could then delete our job...

[te1st@sharc-login1 ~]$ qdel 217834 
te1st has deleted job 217834 

...then consult the cluster's documentation to discover the maximum possible run-time and resubmit using more sensible resource requests.

Alternatively we can use qalter to modify the resource requests associated with a queueing job:

qalter -l h_rt=96:00:00 -l rmem=4G 217834 

Important: using qalter in this fashion will change all resource requests for the job so here we need to re-specify the rmem request.

Job validation at submission time

You can also perform the same type of job validation at job submission time using -w v e.g.

qsub -w v -l 1000:00:00 -l rmem=4G myjobscript.sge

This won't actually submit your job; it just performs validation.

Why is validation not performed by default?

You may ask why such validation is not enabled by default for all jobs; one reason for this is that it is believed it would place undue burden on the scheduler.

Another is that sometimes a validation attempt results in a false negative that can be difficult to automatically identify (edit: see this more recent post for details).

Other types of resources

If you repeat the experiment outlined above but instead of requesting 1000 hours of runtime you ask for 100 GPUs, 9999GB of RAM or 10000 cores you'll observe the same behaviour: jobs that make requests unsatisfiable under the current cluster configuration can be submitted but will never run.

Again, job validation can help here but depending on the type of resource the validation error messages can be more or less cryptic.
For example, if you try to validate a 100000-'slot' (core) MPI job using -w v you get the following:

qsub -pe mpi 100000 -w v somejob.sge
...
Job 311838 cannot run in PE "mpi" because it only offers 0 slots

This is rather misleading but the mention of 'slots' should prompt you to check the number of cores you've requested is sensible.

'Poke' validation: consider the current cluster load

Another type of validation is poke validation, which checks if a job could be run under the current cluster load i.e. with many of the cluster's resources already in use.
See man qsub and search for -w for more information on the different types of validation.