ERROR - Run CCTM with 112 and 224 processors

Hi there,

I run without problem CCTM with 128 and 256 processors using NPCOL_NPROW=“8 16” and NPCOL_NPROW=“8 32” respectively, in CMAQ5.0.2.

When I try to use NPROC=112 or 224 it doesn’t work and I have always an error of OOM (out of memory):

slurmstepd: error: Detected 2 oom_kill events in StepId=1768085.0. Some of the step tasks have been OOM Killed.
srun: error: gs06r3b68: task 0: Out Of Memory
srun: Terminating StepId=1768085.0

Q1: Can CCTM run with 112 or 224 processors?
Q2: Could I change something for example in the script of run_cctm or config, in order to work those proc? Or its just a custom directive?

Thanks a lot, Lucas.

Hi Lucas,

No it should work fine for any processors combination as long as NPCOL and NPROW are specified correctly.

For example, if you request 112 processors from your system, possible NPCOL and NPROW could be:

NPCOL = 14
NPROW = 8

OR

NPCOL = 8
NPROW = 14

OR

NPCOL = 16
NPROW = 7

etc…

2 Likes

Hi Fahim,

Thanks for your answer. Sadly after try those and another combination it didn’t work, and it always shows the error message about out of memory.

I believe, it something with 112 multiple combination (112, 224, 336, etc), because for 128, 256 and 512 it works simply fine.

I put here the run.cctm if its helps. Maybe is something related with how the processors are configurate for the domain? Thanks !

Summary

#! /bin/csh -f

====================== CCTMv5.0.2 Run Script ======================

Usage: run.cctm >&! cctm_D502a.log &

The following environment variables must be set for this script to

execute properly:

setenv M3DATA = input/output data directory

To report problems or request help with this script/program:

http://www.cmascenter.org/html/help.html

===================================================================

#> Source the config.cmaq file to set the run environment
source /exp/CMAQ/CMAQv5.0.2/scripts/config.cmaq

#> Check that M3DATA is set:
if ( ! -e $M3DATA ) then
echo " $M3DATA path does not exist"
exit 1
endif
echo " “; echo " Input data path, M3DATA set to $M3DATA”; echo " "

set PROC = mpi #> serial or mpi
set APPL = CCTM_cb05tucl_ae6_aq
set CFG = EU
set MECH = cb05tucl_ae6_aq
set EXEC = CCTM_cb05tucl_ae6_aq

#> horizontal domain decomposition
setenv NPCOL_NPROW “8 28”
set NPROCS = 224

#> Set the working directory:
set BASE = /exp/run/cctm/EU/20230304
set BLD = {BASE}/BLD_{APPL}

cd $BASE; date; cat $BASE/cfg.$CFG; echo " "; set echo

cd $BASE; date; cat $BLD/cfg.$CFG; echo " "; set echo

#> timestep run parameters

set STDATE = 2023063 # beginning date
set STTIME = 000000 # beginning GMT time (HHMMSS)
set NSTEPS = 240000 # time duration (HHMMSS) for this run
set TSTEP = 010000 # output time step interval (HHMMSS)
set YEAR = 2023
set YR = 23
set MONTH = 03
set DAY = 04
set YMD = {YEAR}{MONTH}${DAY}

=====================================================================

CCTM Configuration Options

=====================================================================

#setenv LOGFILE $BASE/$APPL.log #> log file name; uncomment to write standard output to a log, otherwise write to screen

#setenv GRIDDESC $M3DATA/mcip/GRIDDESC #> horizontal grid defn
setenv GRIDDESC /exp/a75n/output/mcip/EU/20230304/GRIDDESC
setenv GRID_NAME EU #> check GRIDDESC file for GRID_NAME options

#setenv CONC_SPCS “O3 NO ANO3I ANO3J NO2 FORM ISOP ANH4J ASO4I ASO4J” #> CONC file species; comment or set to “ALL” to write all species to CONC
#setenv CONC_SPCS “O3 NO CO NO2 SO2 HNO3 FORM ALD2 PAR PAN OLE ETH TOL ISOP SULF BENZENE ASO4J ASO4I ANH4I ANH4J ANO3I ANO3J AORGPAI AORGPAJ A25J ACORS ANAJ”
#setenv CONC_BLEV_ELEV " 1 4" #> CONC file layer range; comment to write all layers to CONC

#setenv AVG_CONC_SPEC “O3 CO NO NO2 SO2”
setenv AVG_CONC_SPCS “ALL”
#setenv AVG_CONC_SPCS “O3 NO CO NO2 ASO4I ASO4J NH3 SO2 ANH4I ANH4J ANO3I ANO3J AORGPAI AORGPAJ AECI AECJ A25I A25J ACORS ASOIL ANAI ANAJ ACLJ ACLI ACLK ASO4K” #> ACONC file species; comment or set to “ALL” to write all species to ACONC
setenv ACONC_BLEV_ELEV " 1 1" #> ACONC file layer range; comment to write all layers to ACONC
#setenv ACONC_END_TIME Y #> override default beginning ACON timestamp [ default: N ]

setenv CTM_MAXSYNC 100 #> max sync time step (sec) [default: 720]
setenv CTM_MINSYNC 10 #> min sync time step (sec) [default: 60]
setenv CTM_CKSUM N #> write cksum report [ default: Y ]
setenv CLD_DIAG N #> write cloud diagnostic file [ default: N ]
setenv CTM_AERDIAG Y #> aerosol diagnostic file [ default: N ]
setenv CTM_PHOTDIAG N #> photolysis diagnostic file [ default: N ]
setenv CTM_SSEMDIAG N #> sea-salt emissions diagnostic file [ default: N ]
setenv CTM_WB_DUST N #> use inline windblown dust emissions [ default: Y ]
setenv CTM_DUSTEM_DIAG N #> windblown dust emissions diagnostic file [ default: N ]; ignore if CTM_WB_DUST = N
setenv CTM_LTNG_NO N #> turn on lightning NOx [ default: N ]
setenv CTM_WVEL N #> save derived vertical velocity component to conc file [ default: N ]
setenv KZMIN Y #> use Min Kz option in edyintb [ default: Y ], otherwise revert to Kz0UT
setenv CTM_ILDEPV Y #> calculate in-line deposition velocities [ default: Y ]
setenv CTM_MOSAIC N #> landuse specific deposition velocities [ default: N ]
setenv CTM_ABFLUX N #> Ammonia bi-directional flux for in-line deposition velocities [ default: N ]; ignore if CTM_ILDEPV = N
setenv CTM_HGBIDI N #> Mercury bi-directional flux for in-line deposition velocities [ default: N ]; ignore if CTM_ILDEPV = N
setenv CTM_SFC_HONO Y #> Surface HONO interaction [ default: Y ]; ignore if CTM_ILDEPV = N
setenv CTM_DEPV_FILE N #> write diagnostic file for deposition velocities [ default: N ]
setenv CTM_BIOGEMIS N #> calculate in-line biogenic emissions [ default: N ]
setenv B3GTS_DIAG N #> write biogenic mass emissions diagnostic file [ default: N ]; ignore if CTM_BIOGEMIS = N
setenv CTM_PT3DEMIS N #> calculate in-line plume rise for elevated point emissions [ default: N ]
setenv PT3DDIAG N #> optional 3d point source emissions diagnostic file [ default: N]; ignore if CTM_PT3DEMIS = N
setenv PT3DFRAC N #> optional layer fractions diagnostic (play) file(s) [ default: N]; ignore if CTM_PT3DEMIS = N
setenv IOAPI_LOG_WRITE F #> turn on excess WRITE3 logging [ options: T | F ]
setenv FL_ERR_STOP N #> stop on inconsistent input files
setenv PROMPTFLAG F #> turn on I/O-API PROMPT*FILE interactive mode [ options: T | F ]
setenv IOAPI_OFFSET_64 YES #> support large timestep records (>2GB/timestep record) [ options: YES | NO ]
setenv EXECUTION_ID $EXEC #> define the model execution id

set DISP = delete #> [ delete | update | keep ] existing output files

@lucasb,

Could you also attach any other log files that the model outputs (if you are getting any). Additionally are you submitting it as a batch job? If so, how? Additionally what does your system configuration look like (processor counts, per core memory etc).

I’m not sure if you can try another version of CMAQ to still see if this problem exists?