I succeed to run without using SLURM
But I tried to run CMAQ, using SLURM but the SLURM doesn’t work with /bin/rm error message.
The error may be due to your SLURM settings.
The message in the first screenshot says that you are trying to run 2 processors on 1 cpu.
The relevant section of your run script may look something like this:
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
set PROC = mpi #> serial or mpi
@ NPCOL = 1; @ NPROW = 2
@ NPROCS = $NPCOL * $NPROW
setenv NPCOL_NPROW "$NPCOL $NPROW";
/usr/bin/time -p mpirun -np $NPROCS $BLD/$EXEC ) |& tee buff_${EXECUTION_ID}.txt
However, if your system only has one processor per node, then you would need to change the SLURM setting to use 2 nodes and 1 task per node.
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
This would give 2 cpus with one task per cpu and should successfully run a job using NPROCS = 2
Or you can try to use hyperthreading, and run two processes on one cpu, by changing the mpirun command to include the suggested option.
mpirun ... --bind-to core overload-allowed
Change this line in the run script
/usr/bin/time -p mpirun -np $NPROCS $BLD/$EXEC ) |& tee buff_${EXECUTION_ID}.txt
to
/usr/bin/time -p mpirun --bind-to core overload-allowed -np $NPROCS $BLD/$EXEC ) |& tee buff_${EXECUTION_ID}.txt
If you use hyperthreading, be aware that this may negatively impact the run time of the job.
In this case, you need to look up the queue policy. For the 528_queue that you are submitting to, there is a minimum CPU limit of 45 tasks.
You could either try to resubmit to the debug queue using the following:
#SBATCH --partition=debug_queue
or use the 528_queue and increase your nodes to 80 and set NPCOL to 8 and NPROW to 10
See the Table of Limits in the following link:
The NPCOL x NPROW = NPROCS
NPROCS needs to be set to be equal to be less than or equal to the number of requested tasks in your slurm settings.
NODES x ntasks-per-node = 50 in your case (2 nodes x 25 tasks-per-node)
You can set
NPCOL = 5
NPROW = 10
Then rerun.
Hi Liz, what should I do for the specific case where my runscript reports:
/usr/bin/time: Command not found.
@bambila,
While the runscript attempts to gather and keep track of timing metrics for you it is not a necessity to run the model. If your system does not have the time command please amend the runscript:
From:
( /usr/bin/time -p mpirun -np $NPROCS $BLD/EXEC ) |& teebuff_{EXECUTION_ID}.txt
To:
mpirun -np $NPROCS $BLD/$EXEC
Thanks for the response, but it fails to run and outputs an error message that is slightly different from the former:
id: cannot find name for user ID 1341697614
/bin/rm: No match.
This probably has to do with ‘EXECUTION ID’ in the runscript but do not know how to resolve it. If you look in the job.*.out file that is attached, the job kills after call to mpirun, but I do not think that the problem is from the number of processors.
run_cctm_Bench_2016_12SE1.txt (35.1 KB)
job.12993.err.txt (63 Bytes)
job.12993.out.txt (12.5 KB)
@bambila1 the run is dying somewhere within the CMAQ code itself. I don’t think it has anything to do with the execution id.
When running a CMAQ job in parallel, there is a “main” log file, and “ancillary” log files output by each processor. By default, these have names like CTM_LOG_000…, while the main log takes on what is written by the slurm output.
The main log message indicates that process 0 (“MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD”) aborted, so check the CTM_LOG_000 file. There should be a “PM3EXIT” error message at the bottom.
Please also post this log file here so, we can help interpret the message for you.
For both /bin/time and /bin/rm, it is possible that the commands are in a different location.
Please try removing the path, and allow the model to find the command if it exists.
Change
( /usr/bin/time -p mpirun -np $NPROCS
to
( time -p mpirun -np $NPROCS
and change the rm command from
/bin/rm
to
rm
Thank you.
I checked the log files and saw that some file paths for my application (eg. bi-directional ammonia) were wrong, so either I had to set the flags as ‘N’ or specified the appropriate paths.