Hi,
I am a new CMAQ user running CMAQv5.4 with the GNU Fortran (GCC) 11.3.0 compiler using EQUATES data. We are planning to simulate the entire year of 2018. A test run for November to December 2017 over the 12US1 domain was successfully completed in multi-node parallel mode. However, a one-day simulation currently takes over 1 hour of wall-clock time using 512 processors. I would like to enable true parallel I/O to see if it can help reduce the runtime.
I followed these steps:
- Installed PnetCDF/1.12.3-gompi-2022a and IOAPI/3.2-20200828-gompi-2022a-noomp-pncf (the MPI version of the I/O API library).
- Uncommented the lines
set build_parallel_io
andset MakefileOnly
inbldit_cctm.csh
before building CMAQ. - Edited the Makefile to include PnetCDF and the correct I/O API paths before compiling the code:
LIB = /home/jh94030/work/CMAQ/CMAQv54_orig/CMAQ_v5.4/lib/x86_64/gcc
include_path = -I /apps/gb/IOAPI/3.2-20200828-gompi-2022a-noomp-pncf/Linux2_x86_64gfortmpi \
-I /apps/gb/IOAPI/3.2-20200828-gompi-2022a-noomp-pncf/ioapi/fixed_src \
-I $(LIB)/mpi/include -I.
IOAPI = -L/apps/gb/IOAPI/3.2-20200828-gompi-2022a-noomp-pncf/Linux2_x86_64gfortmpi -lioapi
NETCDF = -L$(LIB)/netcdff/lib -lnetcdff -L$(LIB)/netcdf/lib -lnetcdff -lnetcdf
PNETCDF = -L$(LIB)/pnetcdf/lib -lpnetcdf
LIBRARIES = $(IOAPI) $(NETCDF) $(PNETCDF)
- Added the “MPI:” prefix to all output files in the job script.
After completing these steps and submitting the job, I encountered the following errors related to MPI:
[b5-23:3666158] *** An error occurred in MPI_Allreduce
[b5-23:3666158] *** reported by process [2364866560,3]
[b5-23:3666158] *** on communicator MPI_COMM_WORLD
[b5-23:3666158] *** MPI_ERR_COMM: invalid communicator
[b5-23:3666158] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[b5-23:3666158] *** and potentially your MPI job)
However, if I omit the “MPI:” prefix in the output directory, the job behaves strangely: although errors appear in the log files, the job doesn’t terminate automatically, and I have to manually cancel it. The job fails due to the following error related to opconc.o
:
***** ERROR ABORT in subroutine OPCONC on PE 032 Could not open CTM_CONC_1.
I’m curious why adding “MPI:” to the output directory causes such differences. Could you help clarify this behavior?
Also, I’m wondering whether using parallel I/O is necessary in CMAQ for my case. Is there likely to be a significant improvement in simulation runtime with the PIO feature?
Any further guidance on using parallel I/O in CMAQ would be greatly appreciated!
Thank you,
Jingting