ISAM v5.4 benchmark error with PM_TOT turned on

Compiler Version ifort (IFORT) 2021.8.0 20221119
CMAQ Version BLD_CCTM_v54_ISAM_intel_debug/CCTM_v54_ISAM.exe
Run Script run_cctm_Bench_2018_12NE3.csh

Hello,

I have successfully run the ISAM benchmark with tag classes SULFATE and OZONE. However, when I simply add PM_TOT to isam_control.2018_12NE3.txt, it throws the following error:

     ================================
     |>---   TIME INTEGRATION   ---<|
     ================================

     Processing Day/Time [YYYYDDD:HHMMSS]: 2018182:000000
       Which is Equivalent to (UTC): 0:00:00  Sunday,  July 1, 2018
       Time-Step Length (HHMMSS): 000500
                 VDIFF completed...   14.7 seconds
                COUPLE completed...    1.2 seconds
                  HADV completed...   35.1 seconds
                  ZADV completed...    3.8 seconds
                 HDIFF completed...    2.5 seconds
              DECOUPLE completed...    0.6 seconds
                  PHOT completed...    2.6 seconds
               CLDPROC completed...    4.1 seconds
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
libpthread-2.26.s  00007F7587E908E0  Unknown               Unknown  Unknown
CCTM_v54_ISAM.exe  000000000191E02B  sa_wrap_ae_               136  SA_WRAP_AE.F
CCTM_v54_ISAM.exe  00000000010DE032  sciproc_                  304  sciproc.F
CCTM_v54_ISAM.exe  00000000010C0279  cmaq_driver_              731  driver.F
CCTM_v54_ISAM.exe  00000000010B593E  MAIN__                     97  cmaq_main.F
CCTM_v54_ISAM.exe  0000000000407FFD  Unknown               Unknown  Unknown
libc-2.26.so       00007F75877B313A  __libc_start_main     Unknown  Unknown
CCTM_v54_ISAM.exe  0000000000407F29  Unknown               Unknown  Unknown

Then I set tag classes to ALL, and another error appears:

     ================================
     |>---   TIME INTEGRATION   ---<|
     ================================

     Processing Day/Time [YYYYDDD:HHMMSS]: 2018182:000000
       Which is Equivalent to (UTC): 0:00:00  Sunday,  July 1, 2018
       Time-Step Length (HHMMSS): 000500
                 VDIFF completed...   16.2 seconds
                COUPLE completed...    1.3 seconds
                  HADV completed...   44.6 seconds
                  ZADV completed...    4.8 seconds
                 HDIFF completed...    2.8 seconds
              DECOUPLE completed...    0.7 seconds
forrtl: severe (408): fort: (3): Subscript #1 of the array AE_MOLWT has value 0 which is less than the lower bound of 1

Image              PC                Routine            Line        Source
CCTM_v54_ISAM.exe  00000000017991DE  cldproc_                  586  cldproc_acm.F
CCTM_v54_ISAM.exe  00000000010DD66C  sciproc_                  279  sciproc.F
CCTM_v54_ISAM.exe  00000000010C0279  cmaq_driver_              731  driver.F
CCTM_v54_ISAM.exe  00000000010B593E  MAIN__                     97  cmaq_main.F
CCTM_v54_ISAM.exe  0000000000407FFD  Unknown               Unknown  Unknown
libc-2.26.so       00007F51E54A913A  __libc_start_main     Unknown  Unknown
CCTM_v54_ISAM.exe  0000000000407F29  Unknown               Unknown  Unknown

The same errors are found in CMAQ versions 5.4 and 5.4.0.1.

I have used ioapi-3.2-large, and the total free memory on my machine is ~135 GB. Just before the program crash, there is still ~90 GB left.

Any instructions and reproductions are welcome.

Thanks,
Liu

Hi Liu,

Both of those errors could seem like they could be array dimensioning issues. I will take a look and try to reproduce them on our systems. Thank you for the “-traceback” reports.

Have you made any modifications to the code?

Sergey

Hi Sergey,

I use the following script to modify the code:

cd $CMAQ_HOME/CCTM/scripts
sed -i 's/!'"<Example> *'ALL' *,'ISAM_REGIONS'/'ALL','ISAM_REGIONS'/" \
    BLD_CCTM_v54_ISAM_*/CMAQ_Control_DESID.nml
sed -i 's/SULFATE, OZONE/SULFATE, OZONE, PM_TOT/' isam_control.2018_12NE3.txt
sed "14ilimit stacksize unlimited
s/ v54 / v54_ISAM /
s/CTM_ISAM N/CTM_ISAM Y/" run_cctm_Bench_2018_12NE3.csh \
    > run_cctm_Bench_2018_12NE3.ISAM.csh

And then run the new run_cctm_Bench_2018_12NE3.ISAM.csh.
No other scripts, source or ancillary files are modified.

Thanks,
Liu

Hi Liu,

I ran this on our systems and it behaved as expected using the “SULFATE, OZONE, PM_TOT” tagline, the “SULFATE, OZONE, ALL” tagline and the “ALL” tagline, since some of the species in the first 2 are redundantly defined. Everything ran for a day and provided output.

What do you think is different for your simulation? How did you compile? Also, does it crash for you immediately on the first timestep or later?

Sergey

Hi Sergey,

The crash occurs on the first timestep.

All my compile options different from the tutorials are listed below:

  • I have netcdf-c and netcdf-fortran built with netcdf-4 support since it can be utilized by WRF.
  • When compiling I/O API, IOAPI_NCF4 is not defined. I just found the option after making this post.
  • I have also disabled static libraries (lib*.a) for hdf5, netcdf-c and netcdf-fortran.

I will try again with --disable-netcdf4 --disable-shared, and report the result later.

Thanks,
Liu

Sounds good. Let me know how it goes. I can’t seem to reproduce the behavior you are experiencing.

Sergey

Hi Sergey,

The errors are exactly the same for the --disable-netcdf4 --disable-shared build.
Switching to gcc and gfortran doesn’t work either.

I’m really unsure of what to do next. Could you offer any advice? Is there any additional detail I can provide?

Thanks,
Liu

The only thing I can think if is that your intel compiler version is higher than what we use. Perhaps that could be an issue. You said that gcc and gfortran didn’t work, but is it possible for you to try intel/18.0.1? That is the exact one I am using right now.

Sergey

Hi Sergey,

It seems that Intel has removed the 18.* versions from its website (Please correct me if not).
Is there any other tested version which is still publicly available?

Thanks,
Liu

Liu,

We have higher version of intel fortran compiler here. I will see how high I can get to for a PM_TOT case. I do believe others here have used intel/21.4 successfully. Similarly, pgi 17.4 and gcc6.1 are often used successfully for benchmark tests here as well.

Sergey

Hi Sergey,

Intel 21.4 failed with the same errors.
I don’t have a PGI license, so I cannot test PGI 17.4.
GCC 6.1 is still in test, and I will post the result soon.

Thanks,
Liu

The gcc-6.1 build gives me a different error when tag classes set to SULFATE, OZONE, PM_TOT:

     ================================
     |>---   TIME INTEGRATION   ---<|
     ================================

     Processing Day/Time [YYYYDDD:HHMMSS]: 2018182:000000
       Which is Equivalent to (UTC): 0:00:00  Sunday,  July 1, 2018
       Time-Step Length (HHMMSS): 000500
                 VDIFF completed...       4.9284 seconds
                COUPLE completed...       0.3741 seconds
                  HADV completed...      12.5092 seconds
                  ZADV completed...       2.1627 seconds
                 HDIFF completed...       1.0108 seconds
              DECOUPLE completed...       0.1962 seconds
                  PHOT completed...       1.0879 seconds
               CLDPROC completed...       1.6158 seconds
At line 337 of file SA_WRAP_AE.F
Fortran runtime error: Index '0' of dimension 4 of array 'isam1' below lower bound of 1

Error termination. Backtrace:

Could not print backtrace: unrecognized DWARF version in .debug_info at 6
#0  0xcf33ce in sa_wrap_ae_
	at /home/ec2-user/intel/cmaq-isam/CCTM/scripts/BLD_CCTM_v54_ISAM_gcc_debug/SA_WRAP_AE.F:337
#1  0x953707 in sciproc_
	at /home/ec2-user/intel/cmaq-isam/CCTM/scripts/BLD_CCTM_v54_ISAM_gcc_debug/sciproc.F:304
#2  0x9402f3 in cmaq_driver_
	at /home/ec2-user/intel/cmaq-isam/CCTM/scripts/BLD_CCTM_v54_ISAM_gcc_debug/driver.F:729
#3  0x939e06 in cmaq
	at /home/ec2-user/intel/cmaq-isam/CCTM/scripts/BLD_CCTM_v54_ISAM_gcc_debug/cmaq_main.F:97
#4  0x93a15c in main
	at /home/ec2-user/intel/cmaq-isam/CCTM/scripts/BLD_CCTM_v54_ISAM_gcc_debug/cmaq_main.F:32

The error for tag classes ALL is the same as the Intel build above.

Liu

Liu,

I am not sure what is going on for your application. I know you said you are using the “large” IOAPI library and that you have a lot of memory available during the crash. But it seems like it could still be somehow a memory issue, because all of the arrays that the errors point to on various compilers are correctly defined and don’t cause issues for other users. Maybe there is a stack size limit still?

One thing you could try is to run just PM_TOT and no other TAGCLASSES. That should be smaller array sizes than ALL. If that one works, I would suggest further exploration of memory availability and use on your system.

Another thought is that I don’t think its necessary to use “large” IOAPI for your application. I have been running all my tests to try to replicate your issue on “regular” IOAPI.

Unfortunately, I don’t have much more concrete suggestions to offer,

Sergey

1 Like

This may be the point.

On my machine, the command

csh -c 'limit stacksize unlimited && limit stacksize'

prints stacksize 10240 kbytes, while the command

sudo csh -c 'limit stacksize unlimited && limit stacksize'

prints stacksize unlimited.

I use && here to prove that limit stacksize unlimited fails silently if not elevated.

Then I add the following lines to the end of /etc/security/limits.conf:

* hard stack unlimited
* soft stack unlimited

Now limit stacksize always returns unlimited, but the same errors still exist when I run ISAM.

I have used PM_TOT as the only tag class and regular I/O API as you suggested.
There may be something wrong before reaching the stack size limit.

Thanks,
Liu

Liu,

With a help of another colleague here, we may have uncovered something relating to your issues in the code. I will get back to you after I run some more tests.

Sergey

1 Like

Hi Liu,

We found that one of the subroutines was coded with the incorrect assumption that organic aerosol water is present in all 3 aerosol modes, when actually it only exists in the accumulation mode. Therefore, the below changes should limit the code from performing calculations on the missing species. This should not impact any results, but we are still running tests. It appears that the model crashes are triggered by specific TAGCLASS/compiler optimization combinations there were missed in the testing so far. When our tests are complete, we will release the changes in probably through the 5.4+ branch on github.

Please, have a look at the below and let me know if it addresses you crashes.

Thank you for pointing this out!

Sergey

Subroutine SA_WRAP_AE.F starting on line 330 is now as follows:

                 ! Assume that net gains are apportioned like current non-water aerosol
                 ! Determine the total non-water mass separating
                 ! inorganic and organic
                 SPEC_BULK0( C,R,L ) = SUM( SUM( ISAM0( C,R,L,:,: ), DIM=2),
 &                      MASK = (L_MASK_AERO .AND. 
 &                            L_MASK_TYPE .NE. 'H2O'  .AND.
 &                            OMH2O .EQV. L_MASK_OM   .AND.
 &                            IM .EQ. L_MASK_IM   ) )
                 DO K = 1,NTAG_SA
                   IF ( JAER(IM) .NE. 0 ) THEN
                      ISAM1( C,R,L,JAER(IM),K ) = ISAM0( C,R,L,JAER(IM),K ) +
 &                      SUM( ISAM0( C,R,L,:,K ),
 &                          MASK = (L_MASK_AERO .AND. 
 &                                L_MASK_TYPE .NE. 'H2O'  .AND.
 &                                OMH2O .EQV. L_MASK_OM   .AND. 
 &                                IM .EQ. L_MASK_IM  ) )
 &                          /SPEC_BULK0( C,R,L ) * BULK_TRANS_SRC( C,R,L )
                   END IF
                 END DO

Basically, there are 2 additions of JAER(IM) checks.

Hi Sergey,

All tests have passed with tag class PM_TOT or ALL for the gcc 6.1, Intel 21.4 and 2021.8 builds on my machine now.
Thank you very much. I look forward to the next release.

Liu

I’m glad to hear that it worked! I did just realize that I didn’t post the full block of code! But I guess the top missing part doesn’t trigger. This is the full change we are testing.

Starting on line 319:

            DO C = 1,NCOLS
            DO R = 1,NROWS
            DO L = 1,NLAYS
               IF ( BULK_TRANS_SRC( C,R,L ) .LT. 0.0 ) THEN
                 ! Assume that net losses pull proportionally from water species
                 IF ( JAER(IM) .NE. 0 ) THEN
                   SPEC_BULK0( C,R,L ) = SUM( ISAM0( C,R,L,JAER(IM),: ) )
                   ISAM1( C,R,L,JAER(IM),: ) = ISAM0( C,R,L,JAER(IM),: ) *
 &                       ( 1.0 + BULK_TRANS_SRC( C,R,L )/SPEC_BULK0( C,R,L ) )
                 END IF
               ELSE
                 ! Assume that net gains are apportioned like current non-water aerosol
                 ! Determine the total non-water mass separating
                 ! inorganic and organic
                 SPEC_BULK0( C,R,L ) = SUM( SUM( ISAM0( C,R,L,:,: ), DIM=2),
 &                      MASK = (L_MASK_AERO .AND. 
 &                            L_MASK_TYPE .NE. 'H2O'  .AND.
 &                            OMH2O .EQV. L_MASK_OM   .AND.
 &                            IM .EQ. L_MASK_IM   ) )
                 DO K = 1,NTAG_SA
                   IF ( JAER(IM) .NE. 0 ) THEN
                      ISAM1( C,R,L,JAER(IM),K ) = ISAM0( C,R,L,JAER(IM),K ) +
 &                      SUM( ISAM0( C,R,L,:,K ),
 &                          MASK = (L_MASK_AERO .AND. 
 &                                L_MASK_TYPE .NE. 'H2O'  .AND.
 &                                OMH2O .EQV. L_MASK_OM   .AND. 
 &                                IM .EQ. L_MASK_IM  ) )
 &                          /SPEC_BULK0( C,R,L ) * BULK_TRANS_SRC( C,R,L )
                   END IF
                 END DO
               END IF
            END DO
            END DO
            END DO

Sergey

Hi Sergey,

Both the benchmark case and my real case run well without the JAER(IM) check at line 324.

Liu