How to keep the simulation results of two times completely consistent?

Hi all,

I’m using WRFv4.5-CMAQv5.4 coupled model to conduct some sensitivity experiments. I used 28 cpus NPROCS = 28 for the simulation and found that the results of the two simulations deviated at the second hour, and the deviation would increase with the duration of the simulation. Even though I used exactly the same namelist, the microphysical mechanism should not contain random variables.

When I only ran it with 1 cpu NPROCS = 1, I found that the simulation results of the two times were exactly the same. Someone reported that this is related to the randomness of the communication sequence between cpus.

Is there any way to eliminate this deviation while maintaining high efficiency? It takes a lot of time to run with only one cpu. If you are familiar with this, please give me some suggestions. Thanks.
namelist.input.txt (7.2 KB)

You can see that there are deviations in some areas, and these areas are usually those with more cloud water QCLOUD.

Liu

It is my experience that the initiation of convection is very sensitive to round-off error effects; this is an inherent problem (not an artifact of the modeling), because the initiation of convection in real life is likewise very sensitive. As soon as you get different initiation-times for two different simulations, those simulations will diverge ;-(

Hi Carlie

Can this problem be solved when the initiation-times of the two simulations are the same? For example, the start time in namelist.input &time_control of both simulations is 2019.09.01 00:00:00, and the initial conditions and boundary conditions remain the same.

Liu

The problem is convective initiation, not model initialization. As soon as there is the possibility of numerical roundoff effects (which is, basically, always), then convective initiation happens at different times and the simulations diverge.

I see, thanks for your reply.

Another possible source of differences between your simulations is if some of your processors differ, either in their hardware or in their library or compiler versions.

Yes, I did take this into account. In both simulations, I used the same computing nodes and the number of cpus, and did not change the compiler version, but the divergence still existed. Therefore, I suspect that the divergence is caused by the calculation and communication among multiple cpus.

Hi Fortuna,

Couple things you can try:

  1. run the model with option 0 (WRF only) twice and see there is any difference in the result
  2. run the model twice with nproc_x and nproc_y set to 4 and 7 respectively in the domain section of the namelist, and compare results

Cheers,
David

Hi David,

Sorry for my late reply.When I set wrf_cmaq_option = 0, there is still a divergence between the results of the two runs. And after I set nproc_x and nproc_y as you suggested, the situation had not improved.
In addition, I find that when NPROCS=3, the divergence disappears. I can understand the situation where only 1 or 2 cpus are used for operation, but somehow 3 cpus also work.

Liu

Hi Fortuna,

With NPROCS = 3, it works and that is quite an interesting fact. I wonder how core there is in a node. What kind of inter-processor connect network is on the system?

Cheers,
David

Hi David,

After my reconfirmation, the divergence still exists when NPROCS=3, and the occurrence time has been delayed.

The node I use contains 28 cpus. I’m not quite familiar with the inter-processor connect network you mentioned, do you mean NUMA? Command lscpu shows “NUMA node” is 2.

In addition, I test the situation when the multi-core was running wrf_cmaq_option=0. When I set feedback=T, wrfout files have divergence. When feedback=F, the divergence occurres slightly later with the figure like this. At the next moment, it expand to almost the entire domain. (Please ignore the inaccurate colorbar, I just want to show how the divergence occurs.)


Liu

Hi Liu,

At the beginning, you have said NPROCS = 3 was fine but now it is not. So now only when NPROCS = 1 or 2 is fine. If that is still true, could you please do another test: run with NPROCS = 2 (and run with 2 nodes) and specify 1 cpu per node. This will force the processor communication through the inter-connect network to see the issue is caused by latency.

Hi David,

These days, I have been retesting using 1 cpu and 2 cpus under the same node. I found that the results of simulating the first day with 2 cpus under the same node were exactly the same, but there was a divergence at the first hour of the second day. The situation of 1 cpu is still under testing.

Next, I will test the situation of NPROCS=2 with 2 nodes according to your suggestion.

Liu

Hi David,

I specified that each node uses 1 cpu (“nohup mpiexec -f hostid2 -n 2 ${OUTDIR}/wrf.exe” in runscript.csh).
hostid2

The simulation results on the first day were the same, but the divergence occurred at the first TSTEP of the simulation on the second day. I checked wrfrst and CGRID files, and they were the same. Therefore, I manually restarted the second day and found that the wrfout divergence disappeared. However, ACONC had difference at the last moment (TSTEP=24).

This divergence seems to occur randomly, and it will increase in the subsequent simulation. I have no idea why ACONC only has a divergence at the last moment when the wrfout files are the same. Because in the simulation on the first day, ACONC was exactly the same at all TSTEP, and the wrfrst and CGRID files used for the restart on the second day were also the same.

Does WRF-CMAQ coupled model have floating-point consistency or settings that force multiple cores to calculate in the same order during compilation or execution? The computing efficiency of using only 1 or 2 cpus is pretty low.

Liu

Hi Liu,

Thanks for doing all these tests. When you ran the code across multiple nodes, I expected (hope to see due to latency across interconnect network) difference but it did not until the second day. This might means latency in the file access between multiple node. I strongly recommend to check with your system folks to determine the cause.

Your question “Does WRF-CMAQ coupled model have floating-point consistency or settings that force multiple cores to calculate in the same order during compilation or execution?” No. If you have time, please determine whether this issue is model dependent. In other words, run WRF alone and run CMAQ alone (offline model) to see the problem occurs or not.

Hi David,

Finally, I solved the problem. I originally used PGI compiler. Everything worked fine after I recompiled the coupled model using the intel compiler. I ran a 72-hour simulation with shortwave radiation feedback on, and no divergence occurred during this period.

My PGI version is 17.10-0 and intel version is 18.0.1 20170928. Hope to know why different compilers produce different results.

Liu

Look, you’re asking for a pipe-dream.

REAL arithmetic is inherently subject to round-off errors, in this case notably from arranging the arithmetic to work in terms of the operations the hardware provides.

Let us consider a very simple example: compute Z = A*X + B*Y where hardware provides the following operations:

  • ADD(P,Q) computes P+Q
  • MUL(U,V) computes U*V
  • MADD(R,S,T) computes R*S+T using internal precision for the intermediate product-term
    There are three plausible ways to decompose A*X + B*Y into these operations:
  1. T1=MUL(A,X); T2=MUL(B,Y); Z=ADD(T1,T2)
  2. T1=MUL(A,X); Z=MADD(B,Y,T1)
  3. T2=MUL(B,Y); Z=MADD(A,X,T2)

Because of the way round-off works, all three of these may generate different results.
Get over it. When you see different results the way you are, consider it as representing inadequacies in the underlying model, not something that needs to be “fixed.” The underlying model is an approximation, not a reality.

Hi Liu,

That is a good news. When you used PGI compiler, which mpich library you used and was it compiled with PGI or Intel compiler? If it was compiled with PGI, you need to let the system folks know.