CMAQ CCTM performance seems "bursty" (poor) on 176 core HPC Node

garymansell · August 16, 2024, 1:26pm

CMAQ version 5p4
gfortran version 11.4
OS version - MS HPC Ubuntu 22.04 LTS Azure
Azure HB174rs_v4 VM SKU (176 core AMD EPYC™ 9V33X (“Genoa-X”) / 760GB RAM

I have built an Azure CycleCloud HPC Environment to run WRF and CMAQ utilising the above config and VM SKUs.

WRF runs on it very nicely - maxing out all 176 cores for the duration of the run.

CMAQ CCTM - doesn’t seem so good. It seems really “bursty” - the cpus max up to 100% for about a second or two, then drop back to nothing for a second or two and then back to 100%. I am not getting anywhere near the performance that I expected and I am not sure how to improve it.

I am running 160 of the 176 cores of a single node - NCOLS=16 and NROWS=10 (because 11 is prime?), but have tried various different combinations and the performance does not change much.

Inputs and outputs are to an NFS server - but I don’t see any IO wait, so I don’t think this is the issue. There is plenty of RAM and WRF runs on it a treat.

Is this expected - that it should be “bursty” as it solves each “chunk” - or should it be expected to sit at 100% CPU for the whole run if everything is correct?

If it should be running better (smoother) than this, how do I go about debugging what the problem is and what can I do to improve things?

I am the IT Admin and this is the first time we have run WRF/CMAQ - so forgive any ignorance.

Thanks in advance for any advice

Gary

garymansell · August 16, 2024, 1:29pm

Not sure if this is needed, but this is the MCIP input data set params:

Met domain dimensions (col, row, lay): 160 160 39
MCIP X domain dimensions (col, row, lay): 149 149 39
Output domain dimensions (col, row, lay): 147 147 39

Output grid resolution: 27.000000000000000 km
Window domain origin on met domain (col,row): 6 , 6
Window domain far corner on met domain (col,row): 155 , 155

cjcoats · August 16, 2024, 2:29pm

Just on general principles, I would suspect a load-balance issue: some cell or cells are very “hot” and require much more intense computation than the rest.

BTW: unless this is a time-critical run, you will get much better computational efficiency using fewer nodes, since parallel overhead goes up rapidly (faster than quadratic) with the number of nodes involved.

garymansell · August 16, 2024, 2:54pm

Hey, thanks so much for getting back to me…

FYI - this is just a single node, but with 176 cores (I am using 160).

So, I think you are probably saying (I am no expert) that some parts (cells) of the analysis are “hot” and need a lot of processing and others are not and need little processing - and this is why I am seeing the cpu bursting between 0 and 100% cpu all the time.

Have I understood correctly?

If so, surely if the “cool” cells don’t need much processing, then it should pass through them quickly until it gets to another “hot” cell - and hence should still run near 100% cpu all the time (no doubt I am miss-understanding something here)?

Another observation is that I see only marginal performance increase from about 30 cores all the way up to 176 cores. Why is this - surely the increase in cores should make it run faster (unless it is spending all the time communicating between threads?)

Is there anything I can do to get this to run quicker as it’s taking too long and we were hoping this 176 core machine might be the answer to quicker cycle times?

lizadams · August 16, 2024, 2:59pm

In our experience with running CMAQ on Azure CycleCloud, we found it was necessary to use either Lustre or the Beeyond filesystem to get scalable performance.

https://cyclecloud-cmaq.readthedocs.io/en/latest/user_guide_cyclecloud/timing/parse_timing_cyclecloud.html

cjcoats · August 16, 2024, 3:17pm

…If so, surely if the “cool” cells don’t need much processing, then it should pass through them quickly until it gets to another “hot” cell -

Not quite: CMAQ actually divides the modeling domain into a checkerboard (16 by 10, in your case above) and assigns one processor to each of the resulting checkerboard-squares.
So what happens is “is this a hot square” – if so, it takes lots of CPU; if not, it doesn’t.

garymansell · August 16, 2024, 3:31pm

Ah - I see…

So, the simulation will always take as long as the longest cell to process - is that it?

Is there anything that can be done to spread the load better across cpus - so that the “hot” cells are smaller and the “cool” cells are bigger?

wong.david-c · August 16, 2024, 4:30pm

Hi Gary,

The domain decomposition is fixed so under the current implementation, there is no way to re-distribute the work load. It looks like you are testing the CMAQ offline case. May I suggest to download the WRF-CMAQ coupled model and it associated test data (can’t remember is 1 or 2 day test data). After you have constructed (if you need help, please feel free to reach out), you can do a test run with the coupled model with setting option = 0, i.e. running WRF only. The reason to do this is to determine the poor performance is from the CMAQ side or not. WRF uses a similar domain decomposition as in CMAQ. Please let me know your test rest.

Cheers,
David

garymansell · August 20, 2024, 8:32am

All - thanks for your advice on this, appreciated as is all new to me.

I have tried manually running both the CMAQ container and the output folder locally on the compute node’s SSD (to see if the CMAQ run is throttled by NFS Server IO) and I can report that it has improved the run time by about 12%, but if you then add on the time that would be needed to sync the files to the SSD at the start of the Slurm job and then back to the NFS server at the end of the job, then this becomes a negligible improvement in run-time overall.

Still - is useful to know that it is only marginally IO bound and that the bottleneck is mainly compute…

I have noticed that the choice of NROW and NCOL values seems to make a difference to the CMAQ CCTM runtime, though… Can anyone advise me on how best choose these values for CMAQ CCTM?

For WRF - I read that I should use the two closest multiplicative factors of the no. CPUs available, but this does not seem to be optimal for CMAQ as I have had better runtimes skewing towards a much higher NCOL count than NROW.

@wong.david-c - thanks for the suggestion of the WRF-CMAQ coupled model, as I was not aware of this. It sounds like it might be a better than separate WRF and CMAQ runs (as there is a feedback look from CMAQ back into WRF calcs, I think?), but I will need to speak with my Air Quality colleagues first to see if this is suitable for what they want to do (I am just the IT Guy!)

Topic		Replies	Views
Different simualtion results of CMAQ using same machine with different CPU cores numbers CMAQ	8	460	January 11, 2024
Computational Efficiency of the WRF-CMAQ model CMAQ equates	15	652	January 9, 2024
12CONUS1 run taking long CMAQ cmaq	1	151	April 8, 2024
Is it possible (how?) to run containerized CMAQ across multiple nodes using Azure CycleCloud & Slurm? CMAQ	6	88	August 23, 2024
Installing CMAQ on HPC CMAQ	7	387	March 19, 2025

CMAQ CCTM performance seems "bursty" (poor) on 176 core HPC Node

Related topics