CMAQ memory problem

I am running CMAQ v5.3.3 with the EQUATES 2013 case. The model run around 6-7 hours and then was killed. Attached please find the slurm log and computing node log files.

I contacted our supercomputer administrator for help and got the following information about the job.

login-1.zaratan.umd.edu{haohe}1403: sacct -j 3847222 -o NodeList%20,Start,Elapsed,State%16,ExitCode,AveRSS%16,MaxRSS,AllocTRES%40
NodeList Start Elapsed State ExitCode AveRSS MaxRSS AllocTRES


compute-a8-[25-26,3+ 2023-10-14T21:55:39 00:38:04 COMPLETED 0:0 billing=108,cpu=108,mem=432000M,node=4
compute-a8-25 2023-10-14T21:55:39 00:38:04 COMPLETED 0:0 31683384K 31683384K cpu=9,mem=36000M,node=1
compute-a8-[25-26,3+ 2023-10-14T21:55:39 00:38:04 COMPLETED 0:0 1798K 1816K billing=108,cpu=108,mem=432000M,node=4
compute-a8-[26,31-3+ 2023-10-14T21:55:40 00:38:03 OUT_OF_MEMORY 0:125 129761143466 192096384K cpu=99,mem=396000M,node=3

My run with 108 CPUs allocated 432 Gb memory, sufficient for the 12US1 domain. He suggested that there could be a memory leak somewhere which consumed more and more memory until it crashed. I have used v533 for a lot of modeling before and did not meet this problem, so I believe the code should be fine.

The only new thing is that for this EQUATES study, I used the input files from the EPA shared Google Drive. Is it possible the input files cause this problem? If yes, how can I debug the problem? Suggestions and comments are highly appreciated.

Hao

CTM_LOG_083.v533_cb6r3_ae7_aq_WR413_MYR_STAGE_EQUATES_20130102.txt (300.2 KB)
slurm-3847222.out.txt (112.4 KB)

This is likely the netcdf4 memory leak issue. The solution is to convert EQUATES netcdf4 files to netcdf3-classic files using nccopy.

Please see:
Example instruction to convert EQUATES input data from nc4 to nc3
This ncopy conversion requires using the netCDF version built to support netCDF-4 compression.

Then [use the I/O API and netCDF libraries that were built to disable netCDF-4 compression](./configure --prefix=$cwd/…/netcdf --disable-netcdf-4 --disable-dap) to run CMAQ using the netcdf3-classic input files.

To manage multiple versions of netCDF builds, please consider building and using custom Environment Modules.

1 Like

This methods works, thanks!