Unable to invoke true parallel I/O feature

Hi,

I am a new CMAQ user running CMAQv5.4 with the GNU Fortran (GCC) 11.3.0 compiler using EQUATES data. We are planning to simulate the entire year of 2018. A test run for November to December 2017 over the 12US1 domain was successfully completed in multi-node parallel mode. However, a one-day simulation currently takes over 1 hour of wall-clock time using 512 processors. I would like to enable true parallel I/O to see if it can help reduce the runtime.

I followed these steps:

  1. Installed PnetCDF/1.12.3-gompi-2022a and IOAPI/3.2-20200828-gompi-2022a-noomp-pncf (the MPI version of the I/O API library).
  2. Uncommented the lines set build_parallel_io and set MakefileOnly in bldit_cctm.csh before building CMAQ.
  3. Edited the Makefile to include PnetCDF and the correct I/O API paths before compiling the code:
LIB = /home/jh94030/work/CMAQ/CMAQv54_orig/CMAQ_v5.4/lib/x86_64/gcc
include_path = -I /apps/gb/IOAPI/3.2-20200828-gompi-2022a-noomp-pncf/Linux2_x86_64gfortmpi \
              -I /apps/gb/IOAPI/3.2-20200828-gompi-2022a-noomp-pncf/ioapi/fixed_src \
              -I $(LIB)/mpi/include -I.
IOAPI  = -L/apps/gb/IOAPI/3.2-20200828-gompi-2022a-noomp-pncf/Linux2_x86_64gfortmpi -lioapi
NETCDF = -L$(LIB)/netcdff/lib -lnetcdff -L$(LIB)/netcdf/lib -lnetcdff -lnetcdf
PNETCDF = -L$(LIB)/pnetcdf/lib -lpnetcdf
LIBRARIES = $(IOAPI) $(NETCDF) $(PNETCDF)
  1. Added the “MPI:” prefix to all output files in the job script.

After completing these steps and submitting the job, I encountered the following errors related to MPI:

[b5-23:3666158] *** An error occurred in MPI_Allreduce
[b5-23:3666158] *** reported by process [2364866560,3]
[b5-23:3666158] *** on communicator MPI_COMM_WORLD
[b5-23:3666158] *** MPI_ERR_COMM: invalid communicator
[b5-23:3666158] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[b5-23:3666158] *** and potentially your MPI job)

However, if I omit the “MPI:” prefix in the output directory, the job behaves strangely: although errors appear in the log files, the job doesn’t terminate automatically, and I have to manually cancel it. The job fails due to the following error related to opconc.o:

***** ERROR ABORT in subroutine OPCONC on PE 032 Could not open CTM_CONC_1.

I’m curious why adding “MPI:” to the output directory causes such differences. Could you help clarify this behavior?

Also, I’m wondering whether using parallel I/O is necessary in CMAQ for my case. Is there likely to be a significant improvement in simulation runtime with the PIO feature?

Any further guidance on using parallel I/O in CMAQ would be greatly appreciated!

Thank you,
Jingting

Hi Jingting,

Thanks for your interest of using “True” parallel I/O in CMAQ. In other to utilize this feature in CMAQ, you need to have an appropriate hardware support, parallel file system (e.g. LUSTRE parallel file system and BeeGFS parallel file system) and a correct version IOAPI3 source code (I assume you downloaded from the CMAS web site and that version does not work. Please contact me directly to obtain the correct version, wong.david-c@epa.gov).

In the run script, all the output files must have “MPI:” prefix to let IOAPI3 library know that you are using parallel I/O feature. In terms of timing, base on the study I have conducted (GMD - An approach to enhance pnetCDF performance in environmental modeling applications), I have found an average reduction of output time was about 50% (depends of files, domain size, and hardware). This can be translated to how much time you can save in your model run time. In your case, since you are using 512 cores, you should see a substantial reduction in output time for each file when comparing to using traditional psuedo parallel I/O in CMAQ. If you have additional questions or want to learn more about parallel I/O, please feel free to shoot me an email.

Cheers,
David