I am running CMAQ v5.3.3 with the EQUATES 2013 case. The model run around 6-7 hours and then was killed. Attached please find the slurm log and computing node log files.
I contacted our supercomputer administrator for help and got the following information about the job.
login-1.zaratan.umd.edu{haohe}1403: sacct -j 3847222 -o NodeList%20,Start,Elapsed,State%16,ExitCode,AveRSS%16,MaxRSS,AllocTRES%40
NodeList Start Elapsed State ExitCode AveRSS MaxRSS AllocTRES
compute-a8-[25-26,3+ 2023-10-14T21:55:39 00:38:04 COMPLETED 0:0 billing=108,cpu=108,mem=432000M,node=4
compute-a8-25 2023-10-14T21:55:39 00:38:04 COMPLETED 0:0 31683384K 31683384K cpu=9,mem=36000M,node=1
compute-a8-[25-26,3+ 2023-10-14T21:55:39 00:38:04 COMPLETED 0:0 1798K 1816K billing=108,cpu=108,mem=432000M,node=4
compute-a8-[26,31-3+ 2023-10-14T21:55:40 00:38:03 OUT_OF_MEMORY 0:125 129761143466 192096384K cpu=99,mem=396000M,node=3
My run with 108 CPUs allocated 432 Gb memory, sufficient for the 12US1 domain. He suggested that there could be a memory leak somewhere which consumed more and more memory until it crashed. I have used v533 for a lot of modeling before and did not meet this problem, so I believe the code should be fine.
The only new thing is that for this EQUATES study, I used the input files from the EPA shared Google Drive. Is it possible the input files cause this problem? If yes, how can I debug the problem? Suggestions and comments are highly appreciated.
Hao
CTM_LOG_083.v533_cb6r3_ae7_aq_WR413_MYR_STAGE_EQUATES_20130102.txt (300.2 KB)
slurm-3847222.out.txt (112.4 KB)