Segmentation Fault running Benchmark in CMAQv5.3

my bldit_cctm.csh file has the set build_parallel_io commented out

Hi again,
It looks like the CTM_LOG_000* files that you sent are the ones that we provided with the benchmark output data, not the ones created on your machine, as the path is specified for the EPA machine.

File “GRIDDESC” opened for input on unit: 98
/work/MOD3DATA/2016_12SE1/GRIDDESC

Can you look again for your CTM_LOG_000* file under your output directory:

/home/colethatcher/Documents/5.3_CMAQ/data/output_CCTM_v53_gcc_Bench_2016_12SE1/LOGS

Sorry about that here is the right log
CTM_LOG_000.v53_gcc_Bench_2016_12SE1_20160701.txt (559.9 KB)

Can you check to see what was written to your CGRID output file using the following command:

ncdump -h CCTM_CGRID_v53_gcc_Bench_2016_12SE1_20160701.nc > CCTM_CGRID_v53_gcc_Bench_2016_12SE1_20160701.nc.txt

It should contain one timestep if it was written out correctly.

netcdf CCTM_CGRID_v53_gcc_Bench_2016_12SE1_20160701 {
dimensions:
TSTEP = UNLIMITED ; // (1 currently)
DATE-TIME = 2 ;
LAY = 35 ;
VAR = 223 ;
ROW = 80 ;
COL = 100 ;

This is what I found.

netcdf CCTM_CGRID_v53_gcc_Bench_2016_12SE1_20160701 {
dimensions:
TSTEP = UNLIMITED ; // (1 currently)
DATE-TIME = 2 ;
LAY = 35 ;
VAR = 223 ;
ROW = 80 ;
COL = 100 ;

The next step is to compare the last hour of the conc file to the cgrid file, as they should be identical. Just note that the cgrid may contain more variables than the conc file.

Please try to use m3diff or VERDI to compare the two files.
Just make sure that you put the cgrid file first, before the conc file, because the TFLAG is different so it might crash if you put the conc file first.

Secondly, please add print statements in wr_cgrid.F @ line 299 something along the lines of :

         write(logdev,*), ‘==before ptrwrite3=='

IF ( .NOT. PTRWRITE3( S_CGRID, ALLVAR3, JDATE, JTIME, CGRID ) ) THEN

         write(logdev,*), '==after ptrwrite3=='

Then compile, and re-run.

This will show whether there is an issue with the parallel writing function that is a wrapper around mpi commands and the ioapi libraries. If this is the case, we can next take a look at that.

Here was comparison between the last hour of the simulation for 03.


cctm.log.txt (132.5 KB)

It looks like the CGRID file was not updated for the last timestep before the model failed.

Did you get any output by adding the following lines of code to the wr_cgrid.F routine?

Add the following putting statements in wr_cgrid @ line 299 something along the lines of :

     write(logdev,*), ‘==before ptrwrite3=='

IF ( .NOT. PTRWRITE3( S_CGRID, ALLVAR3, JDATE, JTIME, CGRID ) ) THEN

     write(logdev,*), '==after ptrwrite3=='

This will show whether there is an issue with the parallel writing function that is a wrapper around mpi commands and the ioapi libraries. If this is the case, we can next take a look.

I added those lines in the file. Do they not output in the cctm.log file?

There are two types of log files, the standard out and error associated with the overall job, and the per processor log files for an mpi run for each processor that the name begins with CTM_LOG_xxx that have the write(logdev,*) output. At the end of a the run, the log files are moved to following directory:

data/output_CCTM_v53_intel_Bench_2016_12SE1/LOGS

If the run was unsuccessful, the CTM_LOG_xxx_* files may remain in the script run directory.

You can use the following command to see if was printed out.
The following will print the 5 lines before the ptrwrite3 is found:

 grep -B 5 ptrwrite3 CTM_LOG_000*

This is what was in my CTM_LOG_000* file after I added those lines to my version of the code:

    Dimensions: 80 rows, 100 cols, 35 lays, 223 vbles
     NetCDF ID:   2752512  opened as READWRITE           
     Starting date and time  2016184:000000 (0:00:00   July 2, 2016)
     Timestep                          010000 (1:00:00 hh:mm:ss)
     Maximum current record number         0
 ==before ptrwrite3==
 ==after ptrwrite3==

To view the 10 lines after the ptrwrite3 is found using grep:

grep -A 10 ptrwrite3 CTM_LOG_000*
grep -A 10 ptrwrite3 CTM_LOG_000*
 ==before ptrwrite3==
 ==after ptrwrite3==

     Timestep written to S_CGRID          for date and time  2016184:000000
 
 
     ==============================================
     |>---   PROGRAM COMPLETED SUCCESSFULLY   ---<|
     ==============================================
     Date and time 0:00:00   July 2, 2016   (2016184:000000)
     The elapsed time for this simulation was     950.9 seconds.

The log file that you attached is the standard out from the job, not the mpi log file, so I can’t tell if the ptrwrite3 statement was printed out or not.

I added the lines to the wr_cgrid.F but it doesnt appear to output anything.

wr_cgrid.F.txt (11.6 KB)

CTM_LOG_000.v53_gcc_Bench_2016_12SE1_20160701.txt (567.2 KB)

One thing that I am doing that you could try is to edit your run script to comment out the CONC_SPCS environment variable to write all of the species and comment out the CONC_BLEV_ELEV to write all layers to the CONC file.
I don’t think that should explain your error, but it may be worth a try.

#> Output Species and Layer Options
   #> CONC file species; comment or set to "ALL" to write all species to CONC
   #   setenv CONC_SPCS "O3 NO ANO3I ANO3J NO2 FORM ISOP NH3 ANH4I ANH4J ASO4I ASO4J" 
   #setenv CONC_BLEV_ELEV " 1 1" #> CONC file layer range; comment to write all layers to CONC

Same error happened.

HI Zach,

It looks like the write statements are not in the correct location in your wr_cgrid.F file.

Can you use the attached version that has the write statements between the ifdef parallel section.wr_cgrid.F.txt (11.6 KB)

#ifdef parallel
      write(logdev,*), '==before ptrwrite3=='
      IF ( .NOT. PTRWRITE3( S_CGRID, ALLVAR3, JDATE, JTIME, CGRID ) ) THEN
         XMSG = 'Could not write S_CGRID'
         CALL M3EXIT( PNAME, JDATE, JTIME, XMSG, XSTAT1 )
      END IF
      write(logdev,*), '==after ptrwrite3=='
#else
      IF ( .NOT. WRITE3( S_CGRID, ALLVAR3, JDATE, JTIME, CGRID ) ) THEN
         XMSG = 'Could not write S_CGRID'
         CALL M3EXIT( PNAME, JDATE, JTIME, XMSG, XSTAT1 )
      END IF
#endif

Thanks, Liz

I used the attached version but still nothing prints out.

Just to make sure I did this write:

  • I downloaded the file you gave
  • renamed it to wr_cgrid.F
  • replaced the file of the same name in BLD_CCTM_v53_gcc
  • ran the benchmark again
  • searched the CCTM_LOG files in ~/5.3_CMAQ/data/output_CCTM_v53_gcc_Bench_2016_12SE1/LOGS

I can find the fiver lines before

Grid name “SE52BENCH_CROSS”
Dimensions: 80 rows, 100 cols, 35 lays, 16 vbles
NetCDF ID: 65536 opened as READONLY
Starting date and time 2016183:000000 (0:00:00 July 1, 2016)
Timestep 010000 (1:00:00 hh:mm:ss)
Maximum current record number 25

I cant find the ten lines after since the program doesn’t complete successfully

After you replaced the file wr_cgrid.F in the BLD_CCTM_v53_gcc directory you need to
remove the existing executable and object file for that program.

rm wr_cgrid.o
rm CCTM_v53.exe

Then use the make command to compile wr_cgrid.F and link and build the CCTM_v53.exe

make |& tee make.new_wr_cgrid.F.log

Please attach a copy of the make.new_wr_cgrid.F.log to this issue, so that I can review it.
Then rerun your benchmark again.

Will do that, sorry still a fairly new user.

make.new_wr_cgrid.F.log.txt (5.5 KB)

Ran the benchmark again but still no output

So i ran the entire 2011 benchmark in serial just to see if there was any difference and the CTM Logs showed up in the CCTM/scripts folder. I ran a grep search for errors and it came up with these.

CTM_LOG_004.v53_gcc_Bench_2011_12SE1_20110701: *** ERROR ABORT in subroutine RUNTIME_VARS
CTM_LOG_006.v53_gcc_Bench_2011_12SE1_20110701: *** ERROR ABORT in subroutine RUNTIME_VARS
CTM_LOG_007.v53_gcc_Bench_2011_12SE1_20110701: *** ERROR ABORT in subroutine RUNTIME_VARS
CTM_LOG_009.v53_gcc_Bench_2011_12SE1_20110701: *** ERROR ABORT in subroutine RUNTIME_VARS
CTM_LOG_010.v53_gcc_Bench_2011_12SE1_20110701: *** ERROR ABORT in subroutine RUNTIME_VARS
CTM_LOG_014.v53_gcc_Bench_2011_12SE1_20110701: *** ERROR ABORT in subroutine RUNTIME_VARS
CTM_LOG_015.v53_gcc_Bench_2011_12SE1_20110701: *** ERROR ABORT in subroutine RUNTIME_VARS
CTM_LOG_017.v53_gcc_Bench_2011_12SE1_20110701: *** ERROR ABORT in subroutine RUNTIME_VARS
CTM_LOG_021.v53_gcc_Bench_2011_12SE1_20110701: *** ERROR ABORT in subroutine RUNTIME_VARS
CTM_LOG_027.v53_gcc_Bench_2011_12SE1_20110701: *** ERROR ABORT in subroutine RUNTIME_VARS