*** ERROR in INIT3/INITLOG3 ***
Error opening log file on unit 99
I/O STATUS = 10
DESCRIPTION: cannot overwrite existing file, unit 99, file /glade/u/home/hmao/CMAQ-5.4/CCTM/scripts/CTM_LOG_032.v54_intel_108NHEMI_20190101
You need to remove existing log-files before doing a run.
Thanks for your reply. I actually removed existing log files before the new run. But this error message still appeared. Do you think it is this error message that caused the stop of the running? Because even if this message existed, the simulation still runs for 45 minutes.
I don’t see commands such as this in your run script:
Job Name
#PBS -N mpi_job
If you are submitting the job interactively, then you are running on the login nodes, and it will fail.
If you are submitting the job to the queue, then I would search your HPC system help desk for tips on why jobs fail, including filling up the home directory with log files.
I would edit your run script to add the PBS commands to the top of the run script, rather than calling the run script from your cctm_run.pbs.csh script.
You are calling mpirun -np 36 twice, once in the cctm_run.pbs.csh, and a second time in the run_cctm_2019_HEMI.csh.
Do a grep on mpirun for both scripts, and you will see the issue.
The comment below to select 2 nodes with 36 CPUs for a total of 72 MPI processes is not what you are doing.
### *Select 2 nodes with 36 CPUs each for a total of 72 MPI processes*
#PBS -l select=1:ncpus=36:mpiprocs=36
### Send email on abort, begin and end
###PBS -m abe
### Specify mail recipient
###PBS -M email_address
### Run the executable
mpirun -n 36 ./run_cctm_2019_HEMI.csh
and your run script is set to use 36 processors and contains the following commands
@ NPCOL = 6; @ NPROW = 6
@ NPROCS = $NPCOL * $NPROW
setenv NPCOL_NPROW "$NPCOL $NPROW";
endif
#> Executable call for multi PE, configure for your system
# set MPI = /usr/local/intel/impi/3.2.2.006/bin64
# set MPIRUN = $MPI/mpirun
( /usr/bin/time -p mpirun -np $NPROCS $BLD/$EXEC ) |& tee buff_${EXECUTION_ID}.txt
I would change your workflow to add the PBS commands at the top of the run_cctm_2019_HEMI.csh and then use
qsub run_cctm_2019_HEMI.csh
Alternatively, you could edit your cctm_run.pbs.txt script as follows:
I followed your notes and the error in the cctm.o file disappeared. But the run still stopped after 45 minutes. I guess i will need to ask the HPC help desk for help.
If you can run top or htop on one of the compute nodes, you can check to see if you are close to using all of the memory. It may be that you can use 72 processors for the run to have more memory available per compute node.
Please ask the help desk how to set your PBS batch commands to do this. It may be as follows:
#PBS -l select=2:ncpus=36:mpiprocs=72
Then change the domain decomposition in your run script to use 72 processors