CMAQ v5.3.3 running error using SLURM

I succeed to run without using SLURM
But I tried to run CMAQ, using SLURM but the SLURM doesn’t work with /bin/rm error message.

image

The error may be due to your SLURM settings.
The message in the first screenshot says that you are trying to run 2 processors on 1 cpu.

The relevant section of your run script may look something like this:

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2

 set PROC      = mpi               #> serial or mpi

 @ NPCOL  =  1; @ NPROW = 2
   @ NPROCS = $NPCOL * $NPROW
   setenv NPCOL_NPROW "$NPCOL $NPROW"; 

 /usr/bin/time -p mpirun -np $NPROCS $BLD/$EXEC ) |& tee buff_${EXECUTION_ID}.txt

However, if your system only has one processor per node, then you would need to change the SLURM setting to use 2 nodes and 1 task per node.

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1

This would give 2 cpus with one task per cpu and should successfully run a job using NPROCS = 2

Or you can try to use hyperthreading, and run two processes on one cpu, by changing the mpirun command to include the suggested option.

mpirun ... --bind-to core overload-allowed

Change this line in the run script

 /usr/bin/time -p mpirun -np $NPROCS $BLD/$EXEC ) |& tee buff_${EXECUTION_ID}.txt

to

 /usr/bin/time -p mpirun --bind-to core overload-allowed -np $NPROCS $BLD/$EXEC ) |& tee buff_${EXECUTION_ID}.txt

If you use hyperthreading, be aware that this may negatively impact the run time of the job.

I changed as you mentioned but failed to submit

In this case, you need to look up the queue policy. For the 528_queue that you are submitting to, there is a minimum CPU limit of 45 tasks.
You could either try to resubmit to the debug queue using the following:

#SBATCH --partition=debug_queue

or use the 528_queue and increase your nodes to 80 and set NPCOL to 8 and NPROW to 10

See the Table of Limits in the following link:

I can use nodes up to 12.
I got similar errors…
I changed to 50 cores



Should I change this?

The NPCOL x NPROW = NPROCS

NPROCS needs to be set to be equal to be less than or equal to the number of requested tasks in your slurm settings.

NODES x ntasks-per-node = 50 in your case (2 nodes x 25 tasks-per-node)

You can set
NPCOL = 5
NPROW = 10

Then rerun.

Of course, NPROCS is equal to ntasks X nodes
still same error

image

Hi Liz, what should I do for the specific case where my runscript reports:

/usr/bin/time: Command not found.

@bambila,

While the runscript attempts to gather and keep track of timing metrics for you it is not a necessity to run the model. If your system does not have the time command please amend the runscript:

From:

( /usr/bin/time -p mpirun -np $NPROCS $BLD/EXEC ) |& teebuff_{EXECUTION_ID}.txt

To:

mpirun -np $NPROCS $BLD/$EXEC

1 Like

Thanks for the response, but it fails to run and outputs an error message that is slightly different from the former:

id: cannot find name for user ID 1341697614
/bin/rm: No match.

This probably has to do with ‘EXECUTION ID’ in the runscript but do not know how to resolve it. If you look in the job.*.out file that is attached, the job kills after call to mpirun, but I do not think that the problem is from the number of processors.
run_cctm_Bench_2016_12SE1.txt (35.1 KB)
job.12993.err.txt (63 Bytes)
job.12993.out.txt (12.5 KB)

@bambila1 the run is dying somewhere within the CMAQ code itself. I don’t think it has anything to do with the execution id.

When running a CMAQ job in parallel, there is a “main” log file, and “ancillary” log files output by each processor. By default, these have names like CTM_LOG_000…, while the main log takes on what is written by the slurm output.

The main log message indicates that process 0 (“MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD”) aborted, so check the CTM_LOG_000 file. There should be a “PM3EXIT” error message at the bottom.

Please also post this log file here so, we can help interpret the message for you.

1 Like

For both /bin/time and /bin/rm, it is possible that the commands are in a different location.
Please try removing the path, and allow the model to find the command if it exists.

Change

( /usr/bin/time -p mpirun -np $NPROCS

to

( time -p mpirun -np $NPROCS

and change the rm command from

/bin/rm
to
rm

Thank you.

I checked the log files and saw that some file paths for my application (eg. bi-directional ammonia) were wrong, so either I had to set the flags as ‘N’ or specified the appropriate paths.

1 Like