CMAQ configuration for latest AMD EPYC 7713 with 1TB RAM machines

We are trying to compile CMAQ on new AMD machines (AMD EPYC 7713 with 1TB RAM. This processor is unique in that the new architecture where the processor is a set of distributed chips (dies) connected by a central interconnect chip) and it is working for the benchmark case fine, but when we run our larger forecast runs (that were completing fine on old intel machines) we are getting sigbus errors and the memory usage is more then doubling.

As there are no default configuration options for AMD we are trying to figure them out ourselves. Does anyone have any experience with this and suggestions on flags to use or avoid?

Please try recompiling I/O API to use the medium memory model.

Documentation regarding this flag here: Availability/Download of the BAMS/Models-3 I/O API

What compiler are you using? It might be beneficial to use your compiler’s processor-specific flags for this processor (e.g., with a recent gcc/gfortran, the “lazy way” to do this is to use -march=native -mtune=native for example; with Intel compilers you might well need to use an explicit -xCORE-AVX2 -march=core-avx2 instead of -xHost (some versions of Intel compilers are suspected of not doing this latter correctly for non-Intel processors)

And of course use compatible flags for compiling the entire system – not just CMAQ but also I/O API, netCDF, …

Thank you for your suggestions. We have narrowed down the issue to IOAPI XTRACT3 function. On the intel machine it worked fine and at most the overall run would require 200G RAM whereas our latest runs will max out our 1T of RAM and eventually we’ll get an out of memory error. This XTRACT3 function appears to be the culprit in centrealised_io_module.F when loading our emission files (which are ~1.1G).

further testing I believe has just shown that without loading the emission file its still maxing out the memory, its just getting there quicker (within 1 minute instead of 5).

Please specify the version of CMAQ that you are using.

Have you tried the most recent version? CMAQv5.3.3?

what compiler ??? …and compile-flags? …and are they consistent across netCDF, I/O API, and CMAQ?

Hi kmonk,

If your system has sufficient number of nodes, please try to double number of nodes and halve the number of cores per node in your batch job script.

By the way, how large is your simulation (ncols x nrows x nlays)? Did you change any parameter in the IOAPI library when you recompile it on the AMD system?

Cheers,
David

Hi kmonk,

Since this is a brand new system, sometime certain system parameters such as stacksize, did not set properly. Even though in the CMAQ run script, it contains the following line but most of the time it can't overwrite the system setting.

limit stacksize unlimited

So please check with your sys. admin to make sure stacksize does not set to a lower value somewhere.

Cheers,
David

Update on our progress. We are using CMAQv5.3.2 and its compiled with gcc and flags have been kept consistent as possible, and we have also been testing with an intel compiler.

We were sticking with running on a single node and utilizing the 128 cpus available. We have tested it with multithreaded using up to 240 cpu but we got no improvement in performance due to the HADV step.

We have been using limit stacksize unlimited.

We managed to get the runs to work by converting our input files from NETCDF4 to either CDF5 (NETCDF3_64BIT_DATA), 64BIT_OFFSET or NETCDF4_CLASSIC. The system was not handling the compressed netcdf4 files. We believe this issue has arisen because our new system does not cache the uncompressed file (and our previous system did).

1 Like