Unable to write to AELMO even though new AELMO file successfully created

Problem:
When running the benchmark case with CMAQv5.5, the model fails to write to AELMO at the end of the first hour.

>>--->> WARNING in subroutine WRITE3
Invalid file.  FNAME="CTM_AELMO_1
M3WARN:  DTBUF 0:00:00   July 1, 2018  (2018182:000000)

*** ERROR ABORT in subroutine WRITE_ELMO on PE 001
Could not write CTM_AELMO_1      file
PM3EXIT:  DTBUF 0:00:00   July 1, 2018
Date and time 0:00:00   July 1, 2018   (2018182:000000)

In my output directory, I can see that the AELMO output file has been created with a name that matches CTM_AELMO_1 in my run_cctm.csh. The file has the correct dimensions, variables, etc. when I check with ncdump.

What might be causing the model to be unable to find the AELMO file at the end of the hour?

Some more details:
I have tried running in debug mode, but that did not add any additional information beyond pointing to ELMO_PROC.F because of the WRITE_ELMO subroutine. I’ve not modified the source code in any way.

When I turn off AELMO and ELMO output in CMAQ_Control_Misc.nml, the model runs the 2 day benchmark with no problems and my ACONC output compares well with the benchmark output.

I’ve included two example LOG files, as well as the output.txt file. In one LOG file, the processor creates a new AELMO file at the beginning of the simulation and then has no error at the end (until the whole simulation is aborted). The other LOG file does not report creating an AELMO output file at the beginning and reports the error message at the end of the hour. Is it potentially problematic that not all processors report creating a new AELMO file at the beginning of the simulation?

CTM_LOG_000_createsAELMO.txt (329.5 KB)
starting line 2757:

"CTM_AELMO_1" opened as NEW(READ-WRITE )
File name "/lustre/scafellpike/local/HT05673/ddb06/axh49-ddb06/CMAQ/benchmark_output/CCTM_AELMO_v55_intel_Bench_2018_12NE3_cb6r5_ae7_aq_m3dry_20180701.nc"
File type UNKNOWN
Execution ID "CMAQ_CCTMv55_axh49-ddb06_20250423_155926_809459766"
Grid name "2018_12NE3"
Dimensions: 105 rows, 100 cols, 1 lays, 151 vbles
NetCDF ID:         0  opened as VOLATILE READWRITE
Starting date and time  2018182:000000 (0:00:00   July 1, 2018)
Timestep                          010000 (1:00:00 hh:mm:ss)
Maximum current record number         0

CTM_LOG_001_noAELMO.txt (317.4 KB)
No AELMO file created, and the final error in the LOG file is that the model can’t find CTM_AELMO_1

output.txt (34.6 KB)
run_cctm_Bench_2018_12NE3_CB6R5_mpi.csh (39.4 KB)

Thanks for any help!

Hello @abhoffman ,

thank you very much for your thorough description of the problem you are encountering.

Seeing that you are using the pnetcdf library and are working on a lustre file system, my initial guess is that there is an issue with the pnetcdf portion of the code when it comes to handling the AELMO files. Process 0 is responsible for opening output files and - in the non-pnetcdf version of the code - also for gathering information from all other processes and then writing to the output files. In the pnetcdf version, each process should be writing to the output files directly, but based on your main log file, this fails for processes 1 - 47 which to me suggests an issue with how pnetcdf handles this AELMO file (the write to the ACONC file seems to work fine).

I’ll flag this issue for @wong.david-c and @Ben_Murphy for their insights.

Hi @abhoffman,

Since you are intended to output CMAQ files on a parallel file system with pnetCDF, I wonder where you downloaded the IOAPI library from.

Cheers,
David

Looks like a bug.
The OPEN_ELMO routine in ELMO_INIT_DEFN.F is missing an #ifdef parallel_io block.

2 Likes

Ah, this makes sense!

If I wanted to fix this in OPEN_ELMO, could I copy the chunk of code that starts with #ifndef mpas for opening each diagnostic file and replace the first line with #ifdef parallel_io? Would I need to make any modifications, or modify any of the other subroutines? Also, the OPEN_ELMO subroutine in my version of CMAQ is in the ELMO_PROC.F module, which I’ve attached in case it’s different from ELMO_INIT_DEFN.F.
Thanks!
ELMO_PROC.F.txt (147.0 KB)

Sorry, I was looking at our current developmental version rather than v5.5 release code. ELMO_PROC.F is the correct file.

I have not tested this because I do not have access to a parallel file system, but try inserting this block in the OPEN_ELMO routine after the #ifndef mpas line:

#ifdef parallel_io
     CALL SUBST_BARRIER
     IF ( .NOT. IO_PE_INCLUSIVE ) THEN
        IF ( .NOT. OPEN3( CTM_ELMO_1, FSREAD3, PNAME ) ) THEN
           XMSG = 'Could not open ' // TRIM( CTM_ELMO_1 )
           CALL M3EXIT( PNAME, JDATE, JTIME, XMSG, XSTAT2 )
        END IF
     END IF
#endif

Please let me know if this works so that we can get this corrected on the 5.5+ branch.

Thanks for the code suggestion! I’ve tried a few different things, but still not able to run with ELMO output on. Here’s what I’ve done:

  1. I added the block of code to OPEN_ELMO for both the instantaneous and average output files (modifying the file name as appropriate) after the #ifndef mpas line. This resulted in a compile error (undefined reference to se_barrier) because there was no import statement for SE_MODULES.
  2. Copied the following code block at the beginning of the subroutine for the SE_MODULES.
#ifdef parallel
      USE SE_MODULES            ! stenex (using SE_UTIL_MODULE)
#else
      USE NOOP_MODULES          ! stenex (using NOOP_UTIL_MODULE)
#endif

The model compiled. An AELMO file was created at the start of the run, but then the run failed when trying to create the new CONC file.
3. I turned off AELMO output and used the same build (with the modified OPEN_ELMO) just to check that all other modules still worked. The model successfully ran the benchmark case.
4. Some of the other output file open subroutines use CALL SE_BARRIER instead of CALL SUBST_BARRIER, so I also tried that. This did not change the error message about not being able to create a CONC file. Output and log file attached for this run.
CTM_LOG_043.v55_intel_Bench_2018_12NE3_cb6r5_ae7_aq_m3dry_20180701.txt (109.8 KB)
output.txt (19.1 KB)

I’m happy to keep trying different code modifications, though I’ll need suggestions as to what to try next since I’m stumped!

As a side note, there is also a small error in the MEGAN code when running with parallel I/O. In the BDSNP_MOD, on line 613, the code refers to FLUSH3, but this throws up an undefined reference to error. I added a definition at the beginning of the subroutine to fix it.

You’re correct; SUBROUTINE HRNOBDSNP needs

LOGICAL, EXTERNAL :: FLUSH3

Your latest crash is strange. Make sure that you don’t have a lingering CTM_CONC_1 file from a previous aborted run (though our script should detect that and abort if so).
Is there a further error message on PE 043, i.e., CTM_LOG_043* ?

I’ll return to the question @wong.david-c posted upthread… where did you obtain the IOAPI library you are using? Your log file has a lot of ā€˜==d== xtract a’ debugging lines that I do not see in the release CCTM code, and I am not sure where those are coming from.

Ah, sorry for not responding to the IOAPI question! I emailed David and got the IOAPI version from him.

I’d not seen the ==d== xtract before either, but since the model runs (when ELMO output is off) and compares well with the benchmark case, I didn’t think it mattered.

There is no further error message for that process, and the other log files except for PE 000 look pretty much the same. There wasn’t a CONC file from the previous run, since I always delete them to avoid this issue.

To follow-up regarding the ==d== xtract lines in the LOG files, this message comes from my version of xtract3.f90 in IOAPI, which I got from emailing @wong.david-c. The code to output ==d== xtract doesn’t seem like its an error message, though I’ve attached the code in case it helps. I get this message in the LOG files both when I have AELMO output on and when I have it off. The message also seems to be in response to reading in the inputs, rather than when generating new output files, so I don’t think it is connected to my initial problem with AELMO output.
xtract3.f90.txt (10.7 KB)

The ==d== are left over debugging lines, they are not an error message.

Please edit line 66 of wr_init.F so that it says

  CHARACTER( 16 ) :: PNAME = 'WR_INIT'

Recompile and rerun. The program will crash again. Does it say ERROR ABORT in subroutine OPCONC or ERROR ABORT in subroutine WR_INIT?

When PNAME is changed to WR_INIT, the error message is ERROR ABORT in subroutine WR_INIT

Ok, that’s some progress, now we know where the model is crashing.

Does your Makefile contain the cpp flag -Dparallel_io in addition to -Dparallel?

Yes, the Makefile contains both flags.

 cpp_flags = \
  -Dparallel \
  -Dparallel_io \

Hi abhoffman,

Please let me know at which part of WR_INIT when it aborted. Is it the very first portion of the WRITE3 statement for the GC section?

By the way, "==d==" is my signature of debugging.

Cheers,
David

The model does not make it to the WRITE3 step in WR_INIT. In the LOG files that output the error message, the final error is

     *** ERROR ABORT in subroutine WR_INIT on PE 004
     Could not open CTM_CONC_1
 PM3EXIT:  DTBUF 0:00:00   July 1, 2018
     Date and time 0:00:00   July 1, 2018   (2018182:000000)

The ā€œcould not openā€ message corresponds to the very first part of WR_INIT.F, starting on line 84:

#ifndef mpas
#ifdef parallel_io
      CALL SUBST_BARRIER
      IF ( .NOT. IO_PE_INCLUSIVE ) THEN
         IF ( .NOT. OPEN3( CTM_CONC_1, FSREAD3, PNAME ) ) THEN
            XMSG = 'Could not open ' // CTM_CONC_1
            CALL M3EXIT( PNAME, JDATE, JTIME, XMSG, XSTAT1 )
         END IF
      END IF
#endif

This is confusing because there is no error message from OPCONC when opening a new CONC file, and this step happens before WR_INIT. So the new CONC file should be there. The LOG files show that OPCONC wrote the new CONC file header description.

Also, just to make sure I’m including all the information, there is still an inconsistency between the LOG files regarding AELMO file creation before the WR_INIT error.

LOG files in multiples of 8 (I think because of how I set NCOLS and NROWS in run_cctm) show that AELMO was opened as new read/write and do not have the WR_INIT error message. All other LOG files do not include any reference to AELMO. Instead, they start the OPEN OR CREATE CONCENTRATION FILE section immediately with trying to open CTM_CONC_1. Because of this discrepancy, it seems like there is still a problem with AELMO and communication between all processes.

Hi abhoffman,

In theory, as you said ā€œthe new CONC file should be thereā€, the CONC file should be available on the system but for some reasons it is not. Even though, various measures have been put in the code to ensure CONC file is available for all non IO PEs in WR_INIT subroutine. These measures include:

the FLUSH3 instruction forcing CONC file to sync in the disk in OPCONC subroutine:

  IF ( IO_PE_INCLUSIVE ) THEN   ! open new

     IF ( .NOT. OPEN3( CTM_CONC_1, FSNEW3, PNAME ) ) THEN
        XMSG = 'Could not open ' // CTM_CONC_1
        CALL M3EXIT( PNAME, JDATE, JTIME, XMSG, XSTAT1 )
     END IF

     IF ( .NOT. FLUSH3 ( CTM_CONC_1 ) ) THEN
        XMSG = 'Could not sync to disk ' // CTM_CONC_1
        CALL M3EXIT( PNAME, JDATE, JTIME, XMSG, XSTAT1 )
     END IF

  END IF

The other two measures are: the calls to OPCONC and WR_INIT are sufficient apart, and a barrier statement to synchronize all PEs is inserted before the call of OPEN3 for non IO PEs in WR_INIT subroutine.

Could you please send me your WR_INIT.F to me (wong.david-c@epa.gov)? I will modify it, send it back to you for a test.

Cheers,
David

Hi abhoffman,

For the time being, please comment out (turn off) AELMO and ELMO call and focus on the CONC file first. Once we fix that, we will address the ELMO issue.

Cheers,
David

@wong.david-c identified the problem in the WRITE_ELMO subroutine in ELMO_PROC.F and the model is now working. Thanks so much for the help!