MPIRUN error for CMAQ 4.7.1 (Segmentation fault - invalid memory reference)

@cjcoats @wong.david-c

Dear Carlie and David,

I am using CMAQ ADJOINT and my MPI runs do not work after a cluster update. I know it is a very old version and you guys might not be interested, but I desperately need your help to resolve the issue. I have tried with both GNU and Intel compilers. Here is what I get with gcc.

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Backtrace for this error:
#0 0x14c6ba797acf in ???
#1 0x14c6ba815c95 in ???
#2 0x14c6ba799c50 in ???
#3 0x404d2e in ???
#4 0x5193c5 in par_init_
at src/par_init.F:162
#5 0x5e0e85 in driver_fwd
at src/driver_fwd.F:179
#6 0x5e3161 in main
at src/driver_fwd.F:44

Here is 179:

C Start up processor communication and define horizontal domain decomposition
C and vertical layer structure
      CALL PAR_INIT(COLROW, NSPCSD, CLOCK, PAR_ERR)
      IF ( PAR_ERR /= 0 ) THEN
         XMSG = 'Error in PAR_INIT'
         CALL M3EXIT(PNAME, JDATE, JTIME, XMSG, XSTAT1)
      END IF

      LOGDEV = INIT3()

      IF ( NSPCSD .GT. MXVARS3 ) THEN
         WRITE(XMSG,'(5X, A, I5, A)') 'The number of variables,', NSPCSD,
     &        ' to be written to the State CGRID File'
         WRITE(LOGDEV, '(A)') XMSG
         WRITE(XMSG,'(5X, A, I5)') 'exceeds the I/O-API limit:', MXVARS3
         WRITE(LOGDEV, '(A)') XMSG
         XMSG = 'Recompile with an I/O-API lib having a larger MXVARS3'
         WRITE(LOGDEV, '(5X, A)') XMSG
         CALL M3EXIT(PNAME, JDATE, JTIME, ' ', XSTAT1)
      END IF

Also below is distr_env which is called in par_init:

extern void distr_env_ (int *myid_p, int *numprocs_p)
{
   char **environ_ptr;
   int env_size, total_size, total_size_0, str_size, avail_size;
   int myid, numprocs;
   char temp_buf[TEMP_BUF_SIZE], curr_str[CURR_STR_SIZE], *curr_ptr, *curr_name, *curr_val;
   int ret, i, error;

   myid = *myid_p;
   numprocs = *numprocs_p;

   if (myid == 0)
      { environ_ptr = environ;
        env_size = 0;
        total_size = 0;
        i = 0;
        while (environ_ptr[i++] != NULL)
          {   env_size++;
              total_size = total_size + strlen(environ_ptr[i-1]) + 1;
          }

        DEBUG( printf ("last of environment context is %s, total_size is %d. \n",
                environ_ptr[env_size-1], total_size); )

        total_size_0 = total_size;
        curr_ptr = temp_buf;
        avail_size = TEMP_BUF_SIZE;

        for (i=0; i<env_size; i++)
            { str_size = strlen(environ_ptr[i]);
              if ( (environ_ptr[i] != NULL)&&(avail_size > str_size) )
                 { strcpy (curr_ptr, environ_ptr[i]);
                   curr_ptr = curr_ptr + str_size + 1;
                   avail_size = avail_size - str_size - 1;
                 }
              else
                 {
                   printf ("your temp_buf in distr_env may not big enough to ");
                   printf ("hold next environmental pair \n");
                   exit (1);
                 }
            }

      }

   error = MPI_Bcast (&total_size_0, 1, MPI_INT, 0, MPI_COMM_WORLD);

   error = MPI_Bcast (temp_buf, total_size_0, MPI_CHAR, 0, MPI_COMM_WORLD);

   if (myid != 0)
   {
      DEBUG( printf ("total_size_0 is: %d \n", total_size_0); )

      curr_ptr = temp_buf;
      while (curr_ptr < temp_buf+total_size_0)
      {
         if (strlen(curr_ptr) <= CURR_STR_SIZE)
       {
            strcpy (curr_str, curr_ptr);
          curr_ptr = curr_ptr+strlen(curr_str)+1;
       }
       else
       {

          printf ("The curr_str buffer is not big enough! \n");
          exit (1);
       }

       DEBUG( printf ("The current environmental value pair is: %s \n", curr_str); )

       curr_name = strtok (curr_str, "=");
       curr_val = strtok (NULL, "\0");

       if ( ret = setenv (curr_name, curr_val, 0) )
       {
          printf ("error in setting environmental variable %s = %s. \n", curr_name, curr_val);
          exit (1);
       }

       DEBUG( printf ("check the environmetal variable %s = %s. \n", curr_name, getenv(curr_name)); )
      }

      /* MPI_Barrier (MPI_COMM_WORLD); */
   }
/*
   else
   {
      MPI_Barrier (MPI_COMM_WORLD);
   }
*/
      MPI_Barrier (MPI_COMM_WORLD);

}

Your help is greatly appreciated.

I found a description of how to rebuild the par_io library, that you likely need to do as you are using new netcdf and ioapi libraries.

https://www.airqualitymodeling.org/index.php/CMAQ_version_5.0_(February_2010_release)_OGD#Description_9

PARIO compilation

First, it is assumed that you have already installed and compiled the I/O API, netCDF, and MPICH libraries (see Section 3.2.3), or that these are already available from a previous CMAQ compilation.

Section 3.3 provides an overview of how to install and compile the CMAQ programs for the tutorial simulation. Follow the steps outlined in Section 3.3 (summarized below) to compile new versions of PARIO:

        If you have not already done so, compile Bldmake, the CMAQ source code and compilation management program. This needs to be done only once—the first time CMAQ is installed.
        If needed, configure the PARIO build script to use the available I/O API and MPICH libraries.
        Invoke the build script to create an executable:

./bldit.pario

PARIO execution options

Because PARIO is not a program, it does not have an associated run script.
PARIO output files

Successful compilation of PARIO will produce the library file libpario.a. along with several module files in the $M3LIB/pario/$OS directory.

Did you do this?

Thank you @lizadams.

Yes. I am using STENEX and PARIO that come with the package, and they were working fine before the cluster update.

My concern is that if the cluster update required new versions of I/O API, netCDF, and MPICH, then you may need to also rebuild the pario library.

I appreciate your time @lizadams.

I have been using netcdf 3.6.3 and ioapi 3.1 before the update. Now my friend was able to run CMAQ 5.2 with these libraries after the cluster update.

Are you using a new version of MPI due to the cluster update?

Yes. I believe so. It is newer than previous one.

If that is the case, then you likely need to recompile the pario library, in addition to recompiling CMAQ.

Thank you @lizadams, I have done that too.

A few more recommendations:

Add the following to your .cshrc file.

limit stacksize unlimited

Try the -fbounds-check compiler option.

Thank you @lizadams. It is still not working. It is not working with intel either.

@cjcoats Can you please help me with this?

ifort: everithing ok, but with gfortran: segmentation fault

Aha!

You need to compile everything (and that includes MPI as well as netCDF, I/O API, pario, and CMAQ CCTM) with the same compiler-set and with compatible compiler-flags. gfortran and ifort use different run-time libraries, so if you have a mixed-compiler build, the one that was used last may (probably won’t) be able to access the other compiler’s libraries.