Different simualtion results of CMAQ using same machine with different CPU cores numbers

chenya · December 13, 2023, 12:32pm

Hello everyone. I ran the CMAQv5.3.3 but met a problem. When using different numbers of CPU cores of the same machine, the CCTM simulation results in the same domain are quite different. It could be more than 10 ppb(ug/m3) for O3 and PM25.
The machine I use is AMD EPYC 7713. The CPU information is listed below.
cat /proc/cpuinfo | grep “processor” | wc -l (128)
cat /proc/cpuinfo | grep “cpu cores” | wc -l (128)
cat /proc/cpuinfo| grep “cpu cores”| uniq (cpu cores : 64)
The configuration of domain grids is 182x138(27km),98x74(9km),152x110(3km). The differences are larger in the simulations of 9km and 3km domains.

When I use the number of CPU cores lower than 64, the simulation result is the same, but it will be different in each simulation when the number of CPU cores is larger than 64, such as 128, 96, 88.

Has anyone met this situation before? Any suggestion on how to address this problem? Thank you in advance!

lizadams · December 19, 2023, 5:07pm

Please share the compiler version and compiler optimization flags you are using.

In recent testing of a CMAQv5.4+ version, I am not seeing differences when NPCOLxNPROW is varied.

However, we had found differences when doing testing for CMAQv5.3.3, and these were resolved by modifying the compiler options for the optimized version.

Identify what version of the compiler you are using.
Example:

gcc --version

Output:

gcc (GCC) 9.1.0

Identify what compiler options you are using:

In the config_cmaq.csh, examine the compiler options for your compiler.

Example for the gcc compiler, the setting is under the case gcc section:

case gcc

setenv myFSTD "-O3 -funroll-loops -finit-character=32 -Wtabs -Wsurprising -ftree-vectorize -ftree-loop-if-convert -finline-limit=512"

In CMAQv5.3.3, we had identified that the option -march=native had caused differences in the answers when NPCOL was different, answers were different for 8x4 than 4x8. (NPCOLxNPROW), so that compiler option has since been removed from the default setting for CMAQv5.4.

setenv myFSTD “-O3 -funroll-loops -finit-character=32 -Wtabs -Wsurprising -march=native -ftree-vectorize -ftree-loop-if-convert -finline-limit=512”

If you are using a different compiler or set of compiler options, please let us know.

chenya · December 20, 2023, 3:44am

Thanks for your kind reply!

The compiler I used is the intel compiler.
ifort -v ( ifort version 2021.7.1 )
Icc -v ( icc version 2021.7.1 (gcc version 8.5.0 compatibility) )

The compiler options are shown below. (Sorry, I cannot attach the file as a new user)

case intel:

   #> I/O API and netCDF root
   setenv NCDIR  /usr/local/netcdf4-intel
   setenv NFDIR  /usr/local/netcdf4-intel
   setenv NETCDF /usr/local/netcdf4-intel # Note only for  WRF-CMAQ as it requires combining the netcdf C and netcdf F into a single directory. CMAQ users - don't change this setting
   setenv IOAPI  /home/pathsys/Utils/ioapi_3.2_intel
   setenv WRF_ARCH 3                           # [1-75] Optional, ONLY for WRF-CMAQ 


    #> I/O API, netCDF, and MPI library locations
    setenv IOAPI_INCL_DIR   ${IOAPI}/ioapi/fixed_src    #> I/O API include header files
    setenv IOAPI_LIB_DIR    ${IOAPI}/Linux2_x86_64ifort #> I/O API libraries

if ( $NETCDF == "/usr/local/netcdf4-intel" ) then
       setenv NETCDF_LIB_DIR   ${NCDIR}/lib                #> netCDF C directory path
       setenv NETCDF_INCL_DIR  ${NCDIR}/include            #> netCDF C directory path
       setenv NETCDFF_LIB_DIR  ${NFDIR}/lib                #> netCDF Fortran directory path
       setenv NETCDFF_INCL_DIR ${NFDIR}/include            #> netCDF Fortran directory path
    endif 

    setenv MPI_INCL_DIR     /usr/local/intel-oneapi/mpi/latest/include              #> MPI Include directory path
    setenv MPI_LIB_DIR      /usr/local/intel-oneapi/mpi/latest/lib                  #> MPI Lib directory path

    #> Compiler Aliases and Flags
    #> set the compiler flag -qopt-report=5 to get a model optimization report in the build directory with the optrpt extension
    setenv myFC mpiifort
    setenv myCC icc       
    setenv myFSTD "-O3 -fno-alias -mp1 -fp-model source -ftz -simd -align all -vec-guard-write -unroll-aggressive"
    setenv myDBG  "-O0 -g -check bounds -check uninit -fpe0 -fno-alias -ftrapuv -traceback"
    setenv myLINK_FLAG     # -qopenmp # openMP may be required if I/O API was built using this link flag.
    setenv myFFLAGS "-fixed -132"
    setenv myFRFLAGS "-free"
    setenv myCFLAGS "-O2"
    setenv extra_lib ""

    breaksw

Sorry, I didn’t describe the problem clearly before. I ran a three-nested domain simulation instead of one domain with different NPCOLxNPROW. However, the result will be different when using different numbers of CPU cores.

Thanks again for your help!

cgnolte · December 20, 2023, 1:45pm

Your description is rather vague, and it is difficult to tell exactly what you have done.
Running a triple-nested domain simulation is a lot of work:

Run the CCTM for domain 1.
Use the outputs from domain 1 to run BCON and ICON to generate inputs for domain 2.
Run the CCTM for domain 2.
Use the outputs from domain 2 to run BCON and ICON to generate inputs for domain 3.
Run the CCTM for domain 3.

Did you really do all steps (1) through (5) using differing numbers of processors? If so, did you compare the results of steps (1) through (4), or did you look only at the end results in step (5)?
Or did you perform the test of running on differing numbers of processors only for one of the above steps?

chenya · December 21, 2023, 12:28pm

Hi Chris, thanks for your kind reply!

I ran a triple-nested domain simulation using 64 processors as follows. (Simulation64)

Run the CCTM for domain 1. (64 processors) (27km,182x138)
Use the outputs from domain 1 to run BCON and ICON to generate inputs for domain 2. (1 processor)
Run the CCTM for domain 2. (64 processors) (9km,98x74)
Use the outputs from domain 2 to run BCON and ICON to generate inputs for domain 3. (1 processor)
Run the CCTM for domain 3. (64 processors) (3km,152x110)

Then, I re-conducted the simulation using 128 processors. (Simulation128)

Run the CCTM for domain 1. (128 processors)
Use the outputs from domain 1 to run BCON and ICON to generate inputs for domain 2. (1 processor)
Run the CCTM for domain 2. (128 processors)
Use the outputs from domain 2 to run BCON and ICON to generate inputs for domain 3. (1 processor)
Run the CCTM for domain 3. (128 processors)

I compared the results of all steps and the results of all steps between Simualtion128 and Simulation64 showed differences.
I also tested the simulation using 96, 88, 44, and 32 processors.

The result of Simulation64 is the same as Simulation44 and Simulation32. But
Simulation128≠Simulation96 ≠Simulation88≠Simualtion64

Thanks again for your help!

cgnolte · December 21, 2023, 12:59pm

If the results from all steps have differences, then domain 2 and domain 3 are not relevant to this issue. Let’s just focus on the results from domain 1. You are obtaining significantly different results when you use more than 64 processors, but consistent results when using 64 or fewer processors, correct?

This is probably outside my area of competence, but let me ask a few other questions that may help others answer your question:
What is your system architecture, how many processors per node do you have? (Is it 64?)
What compiler are you using, and what implementation of MPI?
Have you chosen any non-default CMAQ build options? In particular, what chemical mechanism and solver are you using?

chenya · December 22, 2023, 6:31am

Hi Chris,

You are obtaining significantly different results when you use more than 64 processors, but consistent results when using 64 or fewer processors, correct?
Yes.

The configuration of our system is shown below.

What is your system architecture?
CPUs: Dual AMD EPYC 7713 64-Core Processors
OS: Rocky Linux 8 (equivalent to Redhat 8)

How many processors per node do you have? (Is it 64?)
128-core (64-core per CPU, dual CPU)

Have you chosen any non-default CMAQ build options? In particular, what chemical mechanism and solver are you using?
I used the default build options. The mechanism and solver are cb6r3_ae7_aq and ebi, respectively.

Thanks for your help!

cgnolte · December 22, 2023, 2:03pm

I am hoping that someone else with greater knowledge of high-performance computing will jump in here, but it seems to me that one possibility is that data between CPUs is getting corrupted, implying a fault in how CMAQ uses MPI. (Conceivably the underlying Intel implementation could be a problem. You might check what version of intel-oneapi you have, and see if there are updates or release notes indicating a problem with that version.)

You might try conducting two simulations (domain 1 only) in debug mode, first with 64 and then with 128 processors. These simulations will run much slower, but they may help diagnose the issue.

chef · January 11, 2024, 9:05pm

Not sure if you have made any progress with this but I agree with @cgnolte’s points and suggestions to diagnose and optimize for your architecture.

Regarding setting up EPYC, it sounds from what you say the machine is not on the cloud and you have two cpus on it.

I would, however, also follow @lizadams suggestion and go through a setup similar to what has been tested on the cloud for EPYC (gnu compilers on ubuntu). Intel compilers can be tricky with AMD processors, their flags don’t always work, so I woudn’t mess with that!

Topic		Replies	Views
CMAQ 3-d simulation with 12US1: running extremely slow with time-step length (HHMMSS): 000400 Run Time Errors and Issues	1	760	September 5, 2019
System REquiremnt confirmaton Hardware	5	653	May 20, 2019
CCTM error in CMAQv5.0.2 CMAQ	1	732	August 11, 2020
Computational Efficiency of the WRF-CMAQ model CMAQ equates	15	686	January 9, 2024
Is 68GB memory enough to run CMAQv5.3.1 with 12US1 platform? CMAQ	10	1487	May 5, 2020

Different simualtion results of CMAQ using same machine with different CPU cores numbers

Related topics