CMAQ-5.4 failed when running north hemispheric simulation

Hi everyone,

I am trying to run a north hemispheric simulation using CMAQ-5.4. The run script is attached below. I compiled the model with DEBUG mode on. The simulation stopped at (UTC): 0:45:00 with no error message in the log file and the buff file. I wonder if the error message in cctm.o file is the reason. I am totally confused and don’t know how to fix this. Could you help with this?
cctm.o2568773.txt (154.6 KB)
buff_CMAQ_CCTMv54_sha=fb7856ef5c_hmao_20230814_140029_935996302.txt|attachment (14.6 KB)
CTM_LOG_000.v54_intel_108NHEMI_20190101.txt (162.7 KB)

Thanks!
Lin
run_cctm_2019_HEMI.txt (38.3 KB)

*** ERROR in INIT3/INITLOG3 ***
Error opening log file on unit 99
I/O STATUS = 10
DESCRIPTION: cannot overwrite existing file, unit 99, file /glade/u/home/hmao/CMAQ-5.4/CCTM/scripts/CTM_LOG_032.v54_intel_108NHEMI_20190101

You need to remove existing log-files before doing a run.

Thanks for your reply. I actually removed existing log files before the new run. But this error message still appeared. Do you think it is this error message that caused the stop of the running? Because even if this message existed, the simulation still runs for 45 minutes.

Thanks!
Lin

How are you submitting this job?

Are you using

qsub run_cctm_2019_HEMI.csh

I don’t see commands such as this in your run script:

Job Name

#PBS -N mpi_job

If you are submitting the job interactively, then you are running on the login nodes, and it will fail.

If you are submitting the job to the queue, then I would search your HPC system help desk for tips on why jobs fail, including filling up the home directory with log files.

(Documentation | ARC NCAR)

https://arc.ucar.edu/knowledge_base/72581486#Cheyennejobscriptexamples-BatchscripttorunanMPIjob

Thanks Liz,
I submitted the job to the queue and my script is this:
cctm_run.pbs.txt (460 Bytes)

Thanks!
Lin

I would edit your run script to add the PBS commands to the top of the run script, rather than calling the run script from your cctm_run.pbs.csh script.

You are calling mpirun -np 36 twice, once in the cctm_run.pbs.csh, and a second time in the run_cctm_2019_HEMI.csh.

Do a grep on mpirun for both scripts, and you will see the issue.

The comment below to select 2 nodes with 36 CPUs for a total of 72 MPI processes is not what you are doing.

### *Select 2 nodes with 36 CPUs each for a total of 72 MPI processes*
#PBS -l select=1:ncpus=36:mpiprocs=36
### Send email on abort, begin and end
###PBS -m abe
### Specify mail recipient
###PBS -M email_address

### Run the executable
mpirun -n 36 ./run_cctm_2019_HEMI.csh

and your run script is set to use 36 processors and contains the following commands

   @ NPCOL  =  6; @ NPROW =  6
   @ NPROCS = $NPCOL * $NPROW
   setenv NPCOL_NPROW "$NPCOL $NPROW"; 
endif

  #> Executable call for multi PE, configure for your system 
  # set MPI = /usr/local/intel/impi/3.2.2.006/bin64
  # set MPIRUN = $MPI/mpirun
  ( /usr/bin/time -p mpirun -np $NPROCS $BLD/$EXEC ) |& tee buff_${EXECUTION_ID}.txt

I would change your workflow to add the PBS commands at the top of the run_cctm_2019_HEMI.csh and then use

qsub run_cctm_2019_HEMI.csh

Alternatively, you could edit your cctm_run.pbs.txt script as follows:

### Run the executable
./run_cctm_2019_HEMI.csh

Thanks Liz,

I followed your notes and the error in the cctm.o file disappeared. But the run still stopped after 45 minutes. I guess i will need to ask the HPC help desk for help.

Best!
Lin

If you can run top or htop on one of the compute nodes, you can check to see if you are close to using all of the memory. It may be that you can use 72 processors for the run to have more memory available per compute node.

Please ask the help desk how to set your PBS batch commands to do this. It may be as follows:

#PBS -l select=2:ncpus=36:mpiprocs=72

Then change the domain decomposition in your run script to use 72 processors

@ NPCOL  =  8; @ NPROW =  9