CMAQ v5.3.3 running error using SLURM

kkkghs0828 · March 7, 2022, 7:53pm

I succeed to run without using SLURM
But I tried to run CMAQ, using SLURM but the SLURM doesn’t work with /bin/rm error message.

lizadams · March 7, 2022, 8:25pm

The error may be due to your SLURM settings.
The message in the first screenshot says that you are trying to run 2 processors on 1 cpu.

The relevant section of your run script may look something like this:

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2

 set PROC      = mpi               #> serial or mpi

 @ NPCOL  =  1; @ NPROW = 2
   @ NPROCS = $NPCOL * $NPROW
   setenv NPCOL_NPROW "$NPCOL $NPROW"; 

 /usr/bin/time -p mpirun -np $NPROCS $BLD/$EXEC ) |& tee buff_${EXECUTION_ID}.txt

However, if your system only has one processor per node, then you would need to change the SLURM setting to use 2 nodes and 1 task per node.

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1

This would give 2 cpus with one task per cpu and should successfully run a job using NPROCS = 2

Or you can try to use hyperthreading, and run two processes on one cpu, by changing the mpirun command to include the suggested option.

mpirun ... --bind-to core overload-allowed

Change this line in the run script

 /usr/bin/time -p mpirun -np $NPROCS $BLD/$EXEC ) |& tee buff_${EXECUTION_ID}.txt

to

 /usr/bin/time -p mpirun --bind-to core overload-allowed -np $NPROCS $BLD/$EXEC ) |& tee buff_${EXECUTION_ID}.txt

If you use hyperthreading, be aware that this may negatively impact the run time of the job.

kkkghs0828 · March 7, 2022, 9:13pm

I changed as you mentioned but failed to submit

lizadams · March 7, 2022, 9:19pm

In this case, you need to look up the queue policy. For the 528_queue that you are submitting to, there is a minimum CPU limit of 45 tasks.
You could either try to resubmit to the debug queue using the following:

#SBATCH --partition=debug_queue

or use the 528_queue and increase your nodes to 80 and set NPCOL to 8 and NPROW to 10

See the Table of Limits in the following link:

kkkghs0828 · March 7, 2022, 9:43pm

I can use nodes up to 12.
I got similar errors…
I changed to 50 cores

kkkghs0828 · March 7, 2022, 10:00pm

Should I change this?

lizadams · March 8, 2022, 12:43am

The NPCOL x NPROW = NPROCS

NPROCS needs to be set to be equal to be less than or equal to the number of requested tasks in your slurm settings.

NODES x ntasks-per-node = 50 in your case (2 nodes x 25 tasks-per-node)

You can set
NPCOL = 5
NPROW = 10

Then rerun.

kkkghs0828 · March 8, 2022, 4:10pm

Of course, NPROCS is equal to ntasks X nodes
still same error

bambila1 · May 26, 2022, 7:00pm

Hi Liz, what should I do for the specific case where my runscript reports:

/usr/bin/time: Command not found.

fsidi · May 27, 2022, 3:30am

@bambila,

While the runscript attempts to gather and keep track of timing metrics for you it is not a necessity to run the model. If your system does not have the time command please amend the runscript:

From:

( /usr/bin/time -p mpirun -np $NPROCS $BLD/EXEC ) |& teebuff_{EXECUTION_ID}.txt

To:

mpirun -np $NPROCS $BLD/$EXEC

bambila1 · May 31, 2022, 6:17pm

Thanks for the response, but it fails to run and outputs an error message that is slightly different from the former:

id: cannot find name for user ID 1341697614
/bin/rm: No match.

This probably has to do with ‘EXECUTION ID’ in the runscript but do not know how to resolve it. If you look in the job.*.out file that is attached, the job kills after call to mpirun, but I do not think that the problem is from the number of processors.
run_cctm_Bench_2016_12SE1.txt (35.1 KB)
job.12993.err.txt (63 Bytes)
job.12993.out.txt (12.5 KB)

fsidi · May 31, 2022, 11:36pm

@bambila1 the run is dying somewhere within the CMAQ code itself. I don’t think it has anything to do with the execution id.

When running a CMAQ job in parallel, there is a “main” log file, and “ancillary” log files output by each processor. By default, these have names like CTM_LOG_000…, while the main log takes on what is written by the slurm output.

The main log message indicates that process 0 (“MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD”) aborted, so check the CTM_LOG_000 file. There should be a “PM3EXIT” error message at the bottom.

Please also post this log file here so, we can help interpret the message for you.

lizadams · June 1, 2022, 3:20pm

For both /bin/time and /bin/rm, it is possible that the commands are in a different location.
Please try removing the path, and allow the model to find the command if it exists.

Change

( /usr/bin/time -p mpirun -np $NPROCS

to

( time -p mpirun -np $NPROCS

and change the rm command from

/bin/rm
to
rm

bambila1 · June 2, 2022, 4:10am

Thank you.

I checked the log files and saw that some file paths for my application (eg. bi-directional ammonia) were wrong, so either I had to set the flags as ‘N’ or specified the appropriate paths.

Topic		Replies	Views
CMAQ v5.2 running error Run Time Errors and Issues	1	311	June 6, 2022
PROBLEM runnning CMAQ with MPI Intel Run Time Errors and Issues	1	467	July 26, 2021
CMAQ5.3.1 Multi-node run fails CMAQ	1	267	May 19, 2023
Is it possible (how?) to run containerized CMAQ across multiple nodes using Azure CycleCloud & Slurm? CMAQ	6	81	August 23, 2024
CCTM running error using SLURM：/bin/rm: No match Run Time Errors and Issues	3	39	April 17, 2025

CMAQ v5.3.3 running error using SLURM

Related topics