Hello everyone. I ran the CMAQv5.3.3 but met a problem. When using different numbers of CPU cores of the same machine, the CCTM simulation results in the same domain are quite different. It could be more than 10 ppb(ug/m3) for O3 and PM25.
The machine I use is AMD EPYC 7713. The CPU information is listed below.
cat /proc/cpuinfo | grep “processor” | wc -l (128)
cat /proc/cpuinfo | grep “cpu cores” | wc -l (128)
cat /proc/cpuinfo| grep “cpu cores”| uniq (cpu cores : 64)
The configuration of domain grids is 182x138(27km),98x74(9km),152x110(3km). The differences are larger in the simulations of 9km and 3km domains.
When I use the number of CPU cores lower than 64, the simulation result is the same, but it will be different in each simulation when the number of CPU cores is larger than 64, such as 128, 96, 88.
Has anyone met this situation before? Any suggestion on how to address this problem? Thank you in advance!
Please share the compiler version and compiler optimization flags you are using.
In recent testing of a CMAQv5.4+ version, I am not seeing differences when NPCOLxNPROW is varied.
However, we had found differences when doing testing for CMAQv5.3.3, and these were resolved by modifying the compiler options for the optimized version.
Identify what version of the compiler you are using.
Example:
gcc --version
Output:
gcc (GCC) 9.1.0
Identify what compiler options you are using:
In the config_cmaq.csh, examine the compiler options for your compiler.
Example for the gcc compiler, the setting is under the case gcc section:
In CMAQv5.3.3, we had identified that the option -march=native had caused differences in the answers when NPCOL was different, answers were different for 8x4 than 4x8. (NPCOLxNPROW), so that compiler option has since been removed from the default setting for CMAQv5.4.
The compiler I used is the intel compiler.
ifort -v ( ifort version 2021.7.1 )
Icc -v ( icc version 2021.7.1 (gcc version 8.5.0 compatibility) )
The compiler options are shown below. (Sorry, I cannot attach the file as a new user)
case intel:
#> I/O API and netCDF root
setenv NCDIR /usr/local/netcdf4-intel
setenv NFDIR /usr/local/netcdf4-intel
setenv NETCDF /usr/local/netcdf4-intel # Note only for WRF-CMAQ as it requires combining the netcdf C and netcdf F into a single directory. CMAQ users - don't change this setting
setenv IOAPI /home/pathsys/Utils/ioapi_3.2_intel
setenv WRF_ARCH 3 # [1-75] Optional, ONLY for WRF-CMAQ
#> I/O API, netCDF, and MPI library locations
setenv IOAPI_INCL_DIR ${IOAPI}/ioapi/fixed_src #> I/O API include header files
setenv IOAPI_LIB_DIR ${IOAPI}/Linux2_x86_64ifort #> I/O API libraries
if ( $NETCDF == "/usr/local/netcdf4-intel" ) then
setenv NETCDF_LIB_DIR ${NCDIR}/lib #> netCDF C directory path
setenv NETCDF_INCL_DIR ${NCDIR}/include #> netCDF C directory path
setenv NETCDFF_LIB_DIR ${NFDIR}/lib #> netCDF Fortran directory path
setenv NETCDFF_INCL_DIR ${NFDIR}/include #> netCDF Fortran directory path
endif
setenv MPI_INCL_DIR /usr/local/intel-oneapi/mpi/latest/include #> MPI Include directory path
setenv MPI_LIB_DIR /usr/local/intel-oneapi/mpi/latest/lib #> MPI Lib directory path
#> Compiler Aliases and Flags
#> set the compiler flag -qopt-report=5 to get a model optimization report in the build directory with the optrpt extension
setenv myFC mpiifort
setenv myCC icc
setenv myFSTD "-O3 -fno-alias -mp1 -fp-model source -ftz -simd -align all -vec-guard-write -unroll-aggressive"
setenv myDBG "-O0 -g -check bounds -check uninit -fpe0 -fno-alias -ftrapuv -traceback"
setenv myLINK_FLAG # -qopenmp # openMP may be required if I/O API was built using this link flag.
setenv myFFLAGS "-fixed -132"
setenv myFRFLAGS "-free"
setenv myCFLAGS "-O2"
setenv extra_lib ""
breaksw
Sorry, I didn’t describe the problem clearly before. I ran a three-nested domain simulation instead of one domain with different NPCOLxNPROW. However, the result will be different when using different numbers of CPU cores.
Your description is rather vague, and it is difficult to tell exactly what you have done.
Running a triple-nested domain simulation is a lot of work:
Run the CCTM for domain 1.
Use the outputs from domain 1 to run BCON and ICON to generate inputs for domain 2.
Run the CCTM for domain 2.
Use the outputs from domain 2 to run BCON and ICON to generate inputs for domain 3.
Run the CCTM for domain 3.
Did you really do all steps (1) through (5) using differing numbers of processors? If so, did you compare the results of steps (1) through (4), or did you look only at the end results in step (5)?
Or did you perform the test of running on differing numbers of processors only for one of the above steps?
I ran a triple-nested domain simulation using 64 processors as follows. (Simulation64)
Run the CCTM for domain 1. (64 processors) (27km,182x138)
Use the outputs from domain 1 to run BCON and ICON to generate inputs for domain 2. (1 processor)
Run the CCTM for domain 2. (64 processors) (9km,98x74)
Use the outputs from domain 2 to run BCON and ICON to generate inputs for domain 3. (1 processor)
Run the CCTM for domain 3. (64 processors) (3km,152x110)
Then, I re-conducted the simulation using 128 processors. (Simulation128)
Run the CCTM for domain 1. (128 processors)
Use the outputs from domain 1 to run BCON and ICON to generate inputs for domain 2. (1 processor)
Run the CCTM for domain 2. (128 processors)
Use the outputs from domain 2 to run BCON and ICON to generate inputs for domain 3. (1 processor)
Run the CCTM for domain 3. (128 processors)
I compared the results of all steps and the results of all steps between Simualtion128 and Simulation64 showed differences.
I also tested the simulation using 96, 88, 44, and 32 processors.
The result of Simulation64 is the same as Simulation44 and Simulation32. But
Simulation128≠Simulation96 ≠Simulation88≠Simualtion64
If the results from all steps have differences, then domain 2 and domain 3 are not relevant to this issue. Let’s just focus on the results from domain 1. You are obtaining significantly different results when you use more than 64 processors, but consistent results when using 64 or fewer processors, correct?
This is probably outside my area of competence, but let me ask a few other questions that may help others answer your question:
What is your system architecture, how many processors per node do you have? (Is it 64?)
What compiler are you using, and what implementation of MPI?
Have you chosen any non-default CMAQ build options? In particular, what chemical mechanism and solver are you using?
You are obtaining significantly different results when you use more than 64 processors, but consistent results when using 64 or fewer processors, correct?
Yes.
The configuration of our system is shown below.
What is your system architecture?
CPUs: Dual AMD EPYC 7713 64-Core Processors
OS: Rocky Linux 8 (equivalent to Redhat 8)
How many processors per node do you have? (Is it 64?)
128-core (64-core per CPU, dual CPU)
Have you chosen any non-default CMAQ build options? In particular, what chemical mechanism and solver are you using?
I used the default build options. The mechanism and solver are cb6r3_ae7_aq and ebi, respectively.
I am hoping that someone else with greater knowledge of high-performance computing will jump in here, but it seems to me that one possibility is that data between CPUs is getting corrupted, implying a fault in how CMAQ uses MPI. (Conceivably the underlying Intel implementation could be a problem. You might check what version of intel-oneapi you have, and see if there are updates or release notes indicating a problem with that version.)
You might try conducting two simulations (domain 1 only) in debug mode, first with 64 and then with 128 processors. These simulations will run much slower, but they may help diagnose the issue.
Not sure if you have made any progress with this but I agree with @cgnolte’s points and suggestions to diagnose and optimize for your architecture.
Regarding setting up EPYC, it sounds from what you say the machine is not on the cloud and you have two cpus on it.
I would, however, also follow @lizadams suggestion and go through a setup similar to what has been tested on the cloud for EPYC (gnu compilers on ubuntu). Intel compilers can be tricky with AMD processors, their flags don’t always work, so I woudn’t mess with that!