Turn off CTM_LOG files

Dear all,
Is there a way to turn off the CTM_LOG files? Currently, we are testing the performance of CMAQ-model on the new HPC cluster and it turns out that I/O times are quite high. We are trying to test different options (using pnetcdf, netcdf3, netcdf4) but we are also wonder how this Ascii CTM_LOG files affect I/O times. Can you help us how to turn off the production of CTM_LOG files? (we perfom simulation with 900 processors = 900 log files)

thank you

Hi Dusan,

Even with 900 log files, I’d be surprised if this was a major drain on the runtime compared to writing out the data. But it would be an interesting test. Unfortunately, there is no way I’m aware of to turn off the log file output with an option. You should be able to modify the source code to mostly achieve this though.

In the file RUNTIME_VARS.F, modify the routines LOG_HEADING, LOG_SUBHEADING, and LOG_MESSAGE by either commenting out all write statements or just entering a RETURN after the variable definition section in each. This won’t eliminate all of the writes, but should take care of most of them.

Do you have access to a profiler that will run on 900 cores? If so, I would be very interested to know what uses most of the time in that configuration.

Best wishes,
Ben

You’re probably already doing this, but just in case. On many clusters, you can write much faster to local node disk than network/shared disk. Therefore, it is often much faster to do the run with all outputs directly to local storage and then move the results after each day. This is a minor change to the run scripts, but can be very helpful. As an alternative, you could edit the source as Ben says to output just the log to local disk and then move the logs.

Just a thought.

Dear Ben,

thank you for the help, but this elimination of the writes in Logs does not help. A person from HPC company who helped me to test and optimized the performance of the CMAQ model for the specific simulation wrote me finally:

I tested a variety of processor grids with the result that a quadratic one is optimal and that we are at the very end of MPI scalability.

Anyhow, any grid with less than 900 processors showed worse calculation step performances. The hourly steps on the other hand are more demanding with regards to memory bandwidth which limits the number of utilized cores per socket to maximize the performance.

This would be an ideal situation where shared memory parallelism (OpenMP) would pay off but unfortunately it is not implemented in CMAQ and it is not trivial to do so.

Also, there were no configurations found where the overhead of parallel I/O through pnetcdf was compensated by better read/write times or lower wait times.

Thanks for this update dusan,

I wanted to check in and see if the solution proposed by Barronh above solved your issue with long model runtimes, or if you are still facing difficulties.

It is interesting to learn that on your system no configuration of the processors made up for the overhead cost of using pnetcdf.

On our system, we rarely use more than 500 processors ourselves for simulations. May I ask the size of your domain, in grid cells.

As your contact stated, the resources need to implement mixed MPI/OpenMP are large; indeed they are larger than what our team can afford at the moment. I think we would be highly interested in anyone in the community taking this on though.

Best wishes,
Ben

Hi Dusan,

As @Ben_Murphy suggests, it is a good idea to post your domain specs and how many chemical species you are saving in your files.

@barronh suggestion is also good, as saving to the local to the node SSD will shave some time off with a few caveats (Those SSD’s are only available to a single node- meaning the max cores for your runscript will be the max cores/node, leading you essentially to manual parallelization. If you go that way, try also tests leaving a few cores in your node unused, as this has been found to produce better results with certain processors).

Regarding:
Anyhow, any grid with less than 900 processors showed worse calculation step performances.

This would be true if you want to find the absolute endpoint of your single-run scaling plateau. As Ben mentions, CMAQ users rarely use such a large amount of processors, because your efficiency at that point is pretty bad (you can be having other projects/users and manual parallelization).

What would make more sense to find your optimal specs, is a performance scaling or efficiency plot, where you run benchmark tests for your domain starting with something low (i.e. 16, 32, 64 cores and going up up).

If you have tried any of the benchmark case that comes with 5.3.3 or a larger CONUS2 domain, it may help put the times you get in perspective with other systems. Also note if hyperthreading is turned off in your system.

Cheers,
Christos

Dear Ben,

I do not fully understand the Barronh proposal. On our met institute we have HPC with 240 nodes with 40 proc for each. The disk space as I understand is common for all nodes. The original reason why we wanted to speed up the CMAQ performance is that NWP model (Aladin) on the same domain (and even with more vertical layers) is more than 4 times faster than CMAQ simulation. After many tests we found out that for given domain using run CMAQ on 900 proc is the fastest. NWP model was of course run with more proc and therefore it can perform much faster. External expert from HPC-building company concluded that the main reason between the performances of both model is that CMAQ does not have OpenMP parallelization (of course those two models are completely different). Our model domain has 2 km resolution, 366*494 horizontal dimension, and 19 vertical layers and the performance for the 24 h simulation was roughly 10-15 minutes, it is not too bad, but NWP model with same domain but 87 vertical layers can do 24 h simulation for less than 4 minutes but using more proc. For this simulation the 12 minutes is not too much, but maybe in future when we will want to make 1 km ( or better resolution ) simulations with larger domain it can be limiting that OpenMP parallelization is missing, but I understand that it can be very difficult to implemented it.

best regards

Hi,

Sorry for the delay in responding. I’m excited that you are pushing the boundaries of CMAQ’s parallelization and performance in this way. Maybe in the future we will have OpenMP available. It’s certainly a limitation our team is aware of, but at present, our resources are assigned to other priorities. Good luck and stay in touch! I’m sure we’d be interested to learn more about the results you are getting and any other helpful feedback you have on the future needs of the system.

Best wishes,
Ben