Hi all,
I’m using WRFv4.5-CMAQv5.4 coupled model to conduct some sensitivity experiments. I used 28 cpus NPROCS = 28
for the simulation and found that the results of the two simulations deviated at the second hour, and the deviation would increase with the duration of the simulation. Even though I used exactly the same namelist, the microphysical mechanism should not contain random variables.
When I only ran it with 1 cpu NPROCS = 1
, I found that the simulation results of the two times were exactly the same. Someone reported that this is related to the randomness of the communication sequence between cpus.
Is there any way to eliminate this deviation while maintaining high efficiency? It takes a lot of time to run with only one cpu. If you are familiar with this, please give me some suggestions. Thanks.
namelist.input.txt (7.2 KB)
You can see that there are deviations in some areas, and these areas are usually those with more cloud water QCLOUD
.
Liu
It is my experience that the initiation of convection is very sensitive to round-off error effects; this is an inherent problem (not an artifact of the modeling), because the initiation of convection in real life is likewise very sensitive. As soon as you get different initiation-times for two different simulations, those simulations will diverge ;-(
Hi Carlie
Can this problem be solved when the initiation-times of the two simulations are the same? For example, the start time in namelist.input &time_control
of both simulations is 2019.09.01 00:00:00, and the initial conditions and boundary conditions remain the same.
Liu
The problem is convective initiation, not model initialization. As soon as there is the possibility of numerical roundoff effects (which is, basically, always), then convective initiation happens at different times and the simulations diverge.
I see, thanks for your reply.
Another possible source of differences between your simulations is if some of your processors differ, either in their hardware or in their library or compiler versions.
Yes, I did take this into account. In both simulations, I used the same computing nodes and the number of cpus, and did not change the compiler version, but the divergence still existed. Therefore, I suspect that the divergence is caused by the calculation and communication among multiple cpus.
Hi Fortuna,
Couple things you can try:
- run the model with option 0 (WRF only) twice and see there is any difference in the result
- run the model twice with nproc_x and nproc_y set to 4 and 7 respectively in the domain section of the namelist, and compare results
Cheers,
David
Hi David,
Sorry for my late reply.When I set wrf_cmaq_option = 0, there is still a divergence between the results of the two runs. And after I set nproc_x and nproc_y as you suggested, the situation had not improved.
In addition, I find that when NPROCS=3, the divergence disappears. I can understand the situation where only 1 or 2 cpus are used for operation, but somehow 3 cpus also work.
Liu
Hi Fortuna,
With NPROCS = 3, it works and that is quite an interesting fact. I wonder how core there is in a node. What kind of inter-processor connect network is on the system?
Cheers,
David