Computational Efficiency of the WRF-CMAQ model

nna · December 14, 2023, 4:01am

Hi all,

I’ve encountered an issue where running a one-day CMAQ simulation takes an entire day (using 64 cores with mpirun). I’m working with CMAQv5.4, and the simulation encompasses the CONUS (US1) with a grid resolution of 12 km and 459*299 grids. The simulation relies on emissions from EQUATES, including in-line point emissions

I have examined the options in “bldit_cctm.csh” and “run_cctm.csh” and found nothing unusual. The commands related to mpirun are as follows. Could you provide some insights into potential reasons for this issue (e.g., other configurations)? Specifically, I’m curious if in-line emissions, such as point sources, lightning NOx, and biogenic emissions, might be significantly influencing the simulation duration. I have attached one logfile.
CTM_LOG_001.v54_intel_2023_12US1_20230116.txt (715.3 KB)

Thank you for your time and help!

set ParOpt

if ( $PROC == serial ) then
   setenv NPCOL_NPROW "1 1"; set NPROCS   = 1 # single processor setting
else
   @ NPCOL  =  8; @ NPROW =  8
   @ NPROCS = $NPCOL * $NPROW
   setenv NPCOL_NPROW "$NPCOL $NPROW";
endif

set MPI = /opt/intel/compilers_and_libraries_2018.5.274/linux/mpi/intel64/bin/
set MPIRUN = $MPI/mpirun
( /usr/bin/time -p mpirun -np $NPROCS $BLD/$EXEC ) |& tee buff_${EXECUTION_ID}.txt

cgnolte · December 14, 2023, 1:25pm

If you want to try running without point sources, change the run script to:
setenv N_EMIS_PT 0

hogrefe.christian · December 14, 2023, 1:32pm

This is much longer than I would expect. When we ran EQUATES 2019 for January 16 (the same day you’re borrowing emissions from) using 256 cores, we saw about 5 seconds per model time step, whereas you see typical values of about 200 seconds. Even if CMAQ scaled perfectly from 64 to 256 cores, your run is still about 10 times slower than what we saw.

Large number of inline point sources have the potential to slow down the model a bit, though this has been improved in CMAQv5.4. You could try running a test without inline point sources, but I don’t think it will change your runtime by an order of magnitude. System setup issues such as communication between nodes might be more of an issue. Inline lightning and biogenic emissions do not add a significant computational burden so I wouldn’t expect much of a change not invoking those options.

Not related to the runtime issues, but I noticed that you’re currently using 2019 EQUATES emissions for running 2023, and just mapped January 16 2019 to January 16 2023. Even if you want to use 2019 anthropogenic for running 2023, you would at least want to map weekdays to weekday and weekends to weekends. Emissions like fires and EGUs clearly will be off no matter what mapping 2019 calendar days to 2023 calendar days.

cjcoats · December 14, 2023, 1:33pm

If you want a detailed report on just where your run is spending its time, use the perf command:
perf record …. saves the performance-data and then perf report generates the report from it.

See
https://perf.wiki.kernel.org/index.php/Main_Page and https://stackoverflow.com/questions/38972147/run-perf-with-an-mpi-application

cgnolte · December 14, 2023, 1:43pm

Did you compile in debug mode? That turns off optimizations, which slows down the model considerably.

lizadams · December 14, 2023, 2:14pm

Please let us know how many cpus or cores are available on your machine.

lscpu

If you have only 16 cores available but you try to run using 64, then you are giving too much work to each compute node, and it will give this type of poor performance.

You can also check this using top or htop on your compute cluster. If you are on an hpc system with a login node, and are using a job scheduler, you would need to login to the compute node and run top or htop.

Another question is how much memory is available, but typically if you were exceeding the memory requirements, the model would crash with a FPE.

@cgnolte ran the same domain and determined that 89 GB of memory is required for the 12US1 domain. Is 68GB memory enough to run CMAQv5.3.1 with 12US1 platform? - #2 by cgnolte

It may also be that your filesystem is slow, and if you are exceed the l2 cache then the model tries to read information from disk instead of from memory, and if the filesystem is slow, this would lead to poor performance.

Performance benchmark information for this domain is also available here: [CMAQ on AWS)(3. Performance and Cost Optimization - aws-cmaq documentation)

nna · December 14, 2023, 2:21pm

Thanks! I have done that, but found no obvious improvement.

nna · December 14, 2023, 2:23pm

The option “#set Debug_CCTM” is commented in my “bldit_cctm.csh”.

nna · December 14, 2023, 2:40pm

Thanks Liz! The way of submitting job might be the most likely influencing factor. The lscpu command displays as follows. And I attached the information of job status.
job_status.txt (14.1 KB)

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    1
Core(s) per socket:    16
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz
Stepping:              7
CPU MHz:               2900.000
BogoMIPS:              5800.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              22528K
NUMA node0 CPU(s):     0-15
NUMA node1 CPU(s):     16-31

nna · December 14, 2023, 2:49pm

Thank you for sharing the experience. 200 seconds is an unacceptable performance. You’re right, the inline emissions are not the main factor. I agree that system setup issues are more critical. I have attached my job status below, could you help diagnose that together?

I appreciate the reminder for the date mapping issue！

lizadams · December 14, 2023, 2:49pm

If you are using slurm, if you want to run using 256 cores

@ NPCOL = 16; @ NPROW = 16

Then your slurm settings need to match

#!/bin/csh -f
#SBATCH -t 4:00:00
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=32

So, depending on the number of cores you have available you need

nodes x ntasks-per-node = NPCOL x NPROW = ncores

You will need to look at the instructions provided by your institute on how to submit jobs to their scheduler.

It looks like you are using LSF scheduler, based on your job_status.txt

#BSUB -n 256

\f0\fs24 \cf0 Job <877060>, Job Name , User , Project , Status , Qu
eue <single_chassis>, Command <#!/bin/csh;#BSUB -n 256;##B
SUB -R “select[model==Plat8358]”;##BSUB -R span[hosts=1] #
#span[ptile=16] ;##BSUB -R “rusage[mem=150GB]”;#BSUB -J CM
AQ;#BSUB -W 20:00 #120:00;#BSUB -o test.out.%J;#BSUB -e te
st.err.%J; module load cmaq-libs; #./run_cctm_Bench_2018_1
2NE3.csh;#./run_cctm_2019_12US1_allfrom_EQUATES.csh;./run_
cctm_2023_12US1_EQUATES.csh;#./run_cctm_2023_12US1_EQUATES
_DDM3D.csh>, Share group charged \

Additional resources on how to submit using bsub commands:
https://www.hpc.dtu.dk/?page_id=1428
https://labs.icahn.mssm.edu/minervalab/documentation/lsf-job-scheduler/

BSUB options from above link:

-n Ncore	Total number of cpu cores requested (default: 1)
-R span	number of cores per node

Example:
#BSUB -R span[ptile=8] # 8 cores per node

In your case, where you have 32 cores per node

I think this would be the command to use if you want to run on 256 cores using 32 cores per node.

-n 256
-R span[ptile=32]

Then set NPCOLxNPROW = 16x16

In your job_status.txt you may not have used -R span[ptile=32] which would likely have put CMAQ across multiple hosts, but then you may have been sharing those hosts with other jobs.

In SLURM, you can also use a command

#SBATCH --exclusive

Where the compute nodes are not shared with any other job, they are used exclusively for the job that you submit to the queue.

I would ask your system administrator if there is an equivalent setting you can use for LSF.

nna · December 14, 2023, 3:33pm

Thanks! I’ll try it!

nna · December 14, 2023, 4:16pm

Thank you so much for your detailed instructions. I will give it a try and keep you updated on my progress.

nna · December 15, 2023, 3:32am

Hi Liz,

I have submitted a job using the following script, and I will provide a conclusion when it begins running. I also noticed another thing from my previous job: the consumed memory is significantly larger than what you mentioned. I have attached the resource usage summary. Is this issue related to the model configuration? Thanks!

#BSUB -n 64
#BSUB -R span[ptile=32]
#BSUB -x

Resource usage summary:

    CPU time :                                   29520.00 sec.
    Max Memory :                                 381 GB
    Average Memory :                             328.79 GB
    Total Requested Memory :                     -
    Delta Memory :                               -
    Max Swap :                                   -
    Max Processes :                              78
    Max Threads :                                144
    Run time :                                   513 sec.
    Turnaround time :                            531 sec.

wong.david-c · January 9, 2024, 6:26pm

Hi nna,

 Are you still having the computational efficiency issue with your model? Please let me know before I jump in. Thanks.

Cheers,
David

nna · January 9, 2024, 6:48pm

Hi David,

I’ve resolved the issue by recompiling the model without modifying the configuration. I apologize for not sharing the experience earlier, as I am unable to provide a reasonable explanation for it.

Using the “BSUB -x” option to run the job in an exclusive execution environment can improve the computational efficiency slightly, but it may not significantly impact the order of magnitude.

Thanks!

Topic		Replies	Views
CMAQ running in 4 min interval not 1 hour interval CMAQ cmaq	4	566	October 29, 2022
Question About Running a One-Month CMAQ Simulation with Continuous Time Steps vs. Daily Runs CMAQ	5	43	April 1, 2025
Error running CMAQv5.4_2018_12NE3_Benchmark_2Day_Input Benchmark Issues and Errors	5	65	March 20, 2025
Low simulated PM2.5 concentrations of CMAQ results CMAQ cmaq	8	199	March 9, 2025
12CONUS1 run taking long CMAQ cmaq	1	151	April 8, 2024

Computational Efficiency of the WRF-CMAQ model

Related topics