Synchronization time step and processor number problem

Hi all,

I am currently using CMAQv5.4 and have encountered an issue regarding the synchronization time step and processor number.

I am attempting to run a 15 km grid simulation (180 x 210) without enabling DDM and ISAM, just running the basic CMAQ model. My processor configuration is 2x96. I found that when setting CTM_MAXSYNC to 720 and CTM_MINSYNC to 60, the simulation fails to continue and gets killed. However, when I reduce the processor number to 2x90, the simulation runs smoothly.

I am curious about the relationship between the synchronization time step and the number of processors. Could you please help me understand why this might be happening? Also, does the grid size affect the time step settings much?

Thank you for your attention

Best regards.

@bb404bb,

Could you post your entire runscript here. Additionally, I would read these two links:

  1. Is it possible (how?) to run containerized CMAQ across multiple nodes using Azure CycleCloud & Slurm?
  2. CMAQ/DOCS/Users_Guide/Appendix/CMAQ_UG_appendixD_parallel_implementation.md at main · USEPA/CMAQ · GitHub

Note that picking a domain decomposition should be done carefully. In your case, each processor would be responsible for (90 columns x ~ 2-3 rows). Is there a reason you did it this way, instead of balancing the rows/columns per processor?

Also when you say it “fails” are there any messages in the main and ancillary log files (usually found at the very bottom)?

2 Likes

Dear fsidi,

Sorry for the late response.

Thank you very much for your reply. Since my workstation has two computing nodes, each with 96 computing cores, I have set the configuration to 2x96. I am wondering if there is a better way to balance the number of rows and columns per processor while maintaining better computational efficiency.

Additionally, the error message primarily states “cores killed,” without providing more detailed error information.

Below is my run_script file.

run_CMAQ_15km.csh (40.0 KB)

Thank you for your assistance.

Your run script contained the following settings

#PBS -l nodes=2:ppn=88

#> Horizontal domain decomposition
   @ NPCOL  =  2; @ NPROW = 88

(NPCOL * NPROW) needs to be set equal to (nodes * ppn) or the number of cores available.

However, the decomposition is best set so that each processor has a similar number of grid cells to work on.

As an example, for the 12 km US1 domain, the GRIDDESC File contains:
‘12CONUS’ -2556000.0 -1728000.0 12000.0 12000.0 459 299 1

This means the domain size is (459x299x35) (COLSxROWSxLAYERS). With layers determined by the number of layers used in WRF.

So, to make the decomposition balanced, it would be good to have NPCOL slightly larger than NPROW for the 12US1 Domain.

If you had 176 cores, then you could set the
NPCOL x NPROW = 16 x 11
using the following setting in your run script

#> Horizontal domain decomposition
   @ NPCOL  =  16; @ NPROW = 11

Your 15km domain is 180 x 210, therefore you would want to make NPCOL slightly smaller than NPROW.

If you had 176 cores, then you could set the
NPCOL x NPROW = 11 x 16
using the following setting in your run script

#> Horizontal domain decomposition
   @ NPCOL  =  11; @ NPROW = 16

To help us further, please attach the per-processor CTM_LOG_000* and main log file, you may need to rename them to have an extension of .txt

3 Likes

Dear lizadams,

Thank you very much for your reply!

This method is indeed very helpful. I can now successfully use all the nodes on my workstation for simulations.

I had always misunderstood that domain decomposition needed to match the number set in #PBS exactly, but it turns out it needs to be adjusted according to the number of grids.

Thank you very much for your detailed response, and I hope others who encounter this issue in the future can refer to your reply.