I am currently using CMAQv5.4 and have encountered an issue regarding the synchronization time step and processor number.
I am attempting to run a 15 km grid simulation (180 x 210) without enabling DDM and ISAM, just running the basic CMAQ model. My processor configuration is 2x96. I found that when setting CTM_MAXSYNC to 720 and CTM_MINSYNC to 60, the simulation fails to continue and gets killed. However, when I reduce the processor number to 2x90, the simulation runs smoothly.
I am curious about the relationship between the synchronization time step and the number of processors. Could you please help me understand why this might be happening? Also, does the grid size affect the time step settings much?
Note that picking a domain decomposition should be done carefully. In your case, each processor would be responsible for (90 columns x ~ 2-3 rows). Is there a reason you did it this way, instead of balancing the rows/columns per processor?
Also when you say it “fails” are there any messages in the main and ancillary log files (usually found at the very bottom)?
Thank you very much for your reply. Since my workstation has two computing nodes, each with 96 computing cores, I have set the configuration to 2x96. I am wondering if there is a better way to balance the number of rows and columns per processor while maintaining better computational efficiency.
Additionally, the error message primarily states “cores killed,” without providing more detailed error information.
This method is indeed very helpful. I can now successfully use all the nodes on my workstation for simulations.
I had always misunderstood that domain decomposition needed to match the number set in #PBS exactly, but it turns out it needs to be adjusted according to the number of grids.
Thank you very much for your detailed response, and I hope others who encounter this issue in the future can refer to your reply.