Computational Efficiency of the WRF-CMAQ model

Hi all,

I’ve encountered an issue where running a one-day CMAQ simulation takes an entire day (using 64 cores with mpirun). I’m working with CMAQv5.4, and the simulation encompasses the CONUS (US1) with a grid resolution of 12 km and 459*299 grids. The simulation relies on emissions from EQUATES, including in-line point emissions

I have examined the options in “bldit_cctm.csh” and “run_cctm.csh” and found nothing unusual. The commands related to mpirun are as follows. Could you provide some insights into potential reasons for this issue (e.g., other configurations)? Specifically, I’m curious if in-line emissions, such as point sources, lightning NOx, and biogenic emissions, might be significantly influencing the simulation duration. I have attached one logfile.
CTM_LOG_001.v54_intel_2023_12US1_20230116.txt (715.3 KB)

Thank you for your time and help!

set ParOpt

if ( $PROC == serial ) then
   setenv NPCOL_NPROW "1 1"; set NPROCS   = 1 # single processor setting
else
   @ NPCOL  =  8; @ NPROW =  8
   @ NPROCS = $NPCOL * $NPROW
   setenv NPCOL_NPROW "$NPCOL $NPROW";
endif

set MPI = /opt/intel/compilers_and_libraries_2018.5.274/linux/mpi/intel64/bin/
set MPIRUN = $MPI/mpirun
( /usr/bin/time -p mpirun -np $NPROCS $BLD/$EXEC ) |& tee buff_${EXECUTION_ID}.txt

If you want to try running without point sources, change the run script to:
setenv N_EMIS_PT 0

This is much longer than I would expect. When we ran EQUATES 2019 for January 16 (the same day you’re borrowing emissions from) using 256 cores, we saw about 5 seconds per model time step, whereas you see typical values of about 200 seconds. Even if CMAQ scaled perfectly from 64 to 256 cores, your run is still about 10 times slower than what we saw.

Large number of inline point sources have the potential to slow down the model a bit, though this has been improved in CMAQv5.4. You could try running a test without inline point sources, but I don’t think it will change your runtime by an order of magnitude. System setup issues such as communication between nodes might be more of an issue. Inline lightning and biogenic emissions do not add a significant computational burden so I wouldn’t expect much of a change not invoking those options.

Not related to the runtime issues, but I noticed that you’re currently using 2019 EQUATES emissions for running 2023, and just mapped January 16 2019 to January 16 2023. Even if you want to use 2019 anthropogenic for running 2023, you would at least want to map weekdays to weekday and weekends to weekends. Emissions like fires and EGUs clearly will be off no matter what mapping 2019 calendar days to 2023 calendar days.

If you want a detailed report on just where your run is spending its time, use the perf command:
perf record …. saves the performance-data and then perf report generates the report from it.

See
https://perf.wiki.kernel.org/index.php/Main_Page and https://stackoverflow.com/questions/38972147/run-perf-with-an-mpi-application

Did you compile in debug mode? That turns off optimizations, which slows down the model considerably.

Please let us know how many cpus or cores are available on your machine.

lscpu

If you have only 16 cores available but you try to run using 64, then you are giving too much work to each compute node, and it will give this type of poor performance.

You can also check this using top or htop on your compute cluster. If you are on an hpc system with a login node, and are using a job scheduler, you would need to login to the compute node and run top or htop.

Another question is how much memory is available, but typically if you were exceeding the memory requirements, the model would crash with a FPE.

@cgnolte ran the same domain and determined that 89 GB of memory is required for the 12US1 domain. Is 68GB memory enough to run CMAQv5.3.1 with 12US1 platform? - #2 by cgnolte

It may also be that your filesystem is slow, and if you are exceed the l2 cache then the model tries to read information from disk instead of from memory, and if the filesystem is slow, this would lead to poor performance.

Performance benchmark information for this domain is also available here: [CMAQ on AWS)(3. Performance and Cost Optimization - aws-cmaq documentation)

1 Like

Thanks! I have done that, but found no obvious improvement.

The option “#set Debug_CCTM” is commented in my “bldit_cctm.csh”.

Thanks Liz! The way of submitting job might be the most likely influencing factor. The lscpu command displays as follows. And I attached the information of job status.
job_status.txt (14.1 KB)

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    1
Core(s) per socket:    16
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz
Stepping:              7
CPU MHz:               2900.000
BogoMIPS:              5800.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              22528K
NUMA node0 CPU(s):     0-15
NUMA node1 CPU(s):     16-31

Thank you for sharing the experience. 200 seconds is an unacceptable performance. You’re right, the inline emissions are not the main factor. I agree that system setup issues are more critical. I have attached my job status below, could you help diagnose that together?

I appreciate the reminder for the date mapping issue!

If you are using slurm, if you want to run using 256 cores

@ NPCOL = 16; @ NPROW = 16

Then your slurm settings need to match

#!/bin/csh -f
#SBATCH -t 4:00:00
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=32

So, depending on the number of cores you have available you need

nodes x ntasks-per-node = NPCOL x NPROW = ncores

You will need to look at the instructions provided by your institute on how to submit jobs to their scheduler.

It looks like you are using LSF scheduler, based on your job_status.txt

#BSUB -n 256

\f0\fs24 \cf0 Job <877060>, Job Name , User , Project , Status , Qu
eue <single_chassis>, Command <#!/bin/csh;#BSUB -n 256;##B
SUB -R “select[model==Plat8358]”;##BSUB -R span[hosts=1] #
#span[ptile=16] ;##BSUB -R “rusage[mem=150GB]”;#BSUB -J CM
AQ;#BSUB -W 20:00 #120:00;#BSUB -o test.out.%J;#BSUB -e te
st.err.%J; module load cmaq-libs; #./run_cctm_Bench_2018_1
2NE3.csh;#./run_cctm_2019_12US1_allfrom_EQUATES.csh;./run_
cctm_2023_12US1_EQUATES.csh;#./run_cctm_2023_12US1_EQUATES
_DDM3D.csh>, Share group charged \

Additional resources on how to submit using bsub commands:
https://www.hpc.dtu.dk/?page_id=1428
https://labs.icahn.mssm.edu/minervalab/documentation/lsf-job-scheduler/

BSUB options from above link:

-n Ncore Total number of cpu cores requested (default: 1)
-R span number of cores per node

Example:
#BSUB -R span[ptile=8] # 8 cores per node

In your case, where you have 32 cores per node

I think this would be the command to use if you want to run on 256 cores using 32 cores per node.

-n 256
-R span[ptile=32]

Then set NPCOLxNPROW = 16x16

In your job_status.txt you may not have used -R span[ptile=32] which would likely have put CMAQ across multiple hosts, but then you may have been sharing those hosts with other jobs.

In SLURM, you can also use a command

#SBATCH --exclusive

Where the compute nodes are not shared with any other job, they are used exclusively for the job that you submit to the queue.

I would ask your system administrator if there is an equivalent setting you can use for LSF.

Thanks! I’ll try it!

Thank you so much for your detailed instructions. I will give it a try and keep you updated on my progress.

Hi Liz,

I have submitted a job using the following script, and I will provide a conclusion when it begins running. I also noticed another thing from my previous job: the consumed memory is significantly larger than what you mentioned. I have attached the resource usage summary. Is this issue related to the model configuration? Thanks!

#BSUB -n 64
#BSUB -R span[ptile=32]
#BSUB -x
Resource usage summary:

    CPU time :                                   29520.00 sec.
    Max Memory :                                 381 GB
    Average Memory :                             328.79 GB
    Total Requested Memory :                     -
    Delta Memory :                               -
    Max Swap :                                   -
    Max Processes :                              78
    Max Threads :                                144
    Run time :                                   513 sec.
    Turnaround time :                            531 sec.

Hi nna,

 Are you still having the computational efficiency issue with your model? Please let me know before I jump in. Thanks.

Cheers,
David

Hi David,

I’ve resolved the issue by recompiling the model without modifying the configuration. I apologize for not sharing the experience earlier, as I am unable to provide a reasonable explanation for it.

Using the “BSUB -x” option to run the job in an exclusive execution environment can improve the computational efficiency slightly, but it may not significantly impact the order of magnitude.

Thanks!