Error performing Parallel Runs with SLURM

Parallelization with OpenMP and MPI, scalability, reproducibility, errors, problems suggestions
Post Reply
Odera Dim
Posts: 55
Joined: Wed Oct 15, 2014 11:30 pm
Security question 1: No
Security question 2: 92

Error performing Parallel Runs with SLURM

Post by Odera Dim » Tue Jan 05, 2021 3:23 pm

Hi All,

I am using SLURM to perform a parallel run on a cluster. The code runs for a while then quits with an error on one of the nodes. The error I receive is:

Code: Select all

srun: error: node10: task 7: Exited with exit code 255
I believe that the whole job then gets aborted due to this. As a note this particular code has ran successfully in serial mode.

Please does anyone have an idea on how I could resolve this. I have also attached the error log and the SLURM scheduler script. Thank you

Code: Select all


SLURM Script

#!/bin/csh
###
### serpent run
###
 

#SBATCH -J input
#SBATCH -e input.std.err%j
#SBATCH -o input.std.out%j
#SBATCH --mail-type=begin,end,fail
#SBATCH --mail-user=email@me.now
#SBATCH --mem-per-cpu=32768
#SBATCH -t 01:00:00
#SBATCH -N 16
#SBATCH --ntasks-per-socket=1
#SBATCH --cpus-per-task=16
#SBATCH -p cluster
module load gcc/4.9.4
cd $SLURM_SUBMIT_DIR
 
srun ./sss2 -omp $SLURM_CPUS_PER_TASK input

Attachments
error.png
error.png (158.14 KiB) Viewed 375 times

Ana Jambrina
Posts: 659
Joined: Tue May 26, 2020 5:32 pm
Security question 1: No
Security question 2: 7

Re: Error performing Parallel Runs with SLURM

Post by Ana Jambrina » Tue Jan 05, 2021 6:49 pm

The error seems to come from the cluster side, not Serpent’s. It is related with the inter-node communications; it might be triggered by server network, MPI-implementation (MPICH/MPICH2 ??) built and/or Slurm configuration.
- Ana

Odera Dim
Posts: 55
Joined: Wed Oct 15, 2014 11:30 pm
Security question 1: No
Security question 2: 92

Re: Error performing Parallel Runs with SLURM

Post by Odera Dim » Tue Jan 05, 2021 9:18 pm

Okay Ana,

Thanks. I will reach out to the server admin. Happy new year.

Post Reply