CMAQ v5.4 Benchmark Runtime Error: Illegal Seek

Hi all. I’ve compiled v5.4 succesfully and am now trying to run it on my university’s cluster. I tested it in an interactive session a week ago and it worked fine, so I stopped the script with ctrl-c. Now today, I’m running into errors. My module list has GCC/9.3.0 and OpenMPI/4.0.3, which I used to compile. Here is a paste of my run_cctm script. I had a previous issue where the program wasn’t finding my net-cdf installation due to an error in my LD_LIBRARY_PATH, but I edited my cshrc accordingly and now this is my LD_LIBRARY_PATH containing my netcdf installations which fixed it.

/scratch/gap6/LIBRARIES/netcdf/lib:/scratch/gap6/LIBRARIES/netcdf/lib:/opt/apps/software/OpenMPI/4.0.3-GCC-9.3.0/lib:/opt/apps/software/hwloc/2.2.0-GCCcore-9.3.0/lib:/opt/apps/software/libpciaccess/0.16-GCCcore-9.3.0/lib:/opt/apps/software/libxml2/2.9.10-GCCcore-9.3.0/lib:/opt/apps/software/XZ/5.2.5-GCCcore-9.3.0/lib:/opt/apps/software/numactl/2.0.13-GCCcore-9.3.0/lib:/opt/apps/software/zlib/1.2.11-GCCcore-9.3.0/lib:/opt/apps/software/GCCcore/9.3.0/lib64:/opt/apps/software/GCCcore/9.3.0/lib

This pastebin link is my output with an error from running CMAQ on the cluster. I’m assuming it’s not getting past the first fortran file given this error:

At line 88 of file setup_logdev.F
Fortran runtime error: Illegal seek

Line 88 here is

REWIND( LOGDEV )

Any advice would be appreciated. Thank you!

Your run script is using the following command:

( /usr/bin/time -p srun  -n $NPROCS $BLD/$EXEC ) |& tee buff_${EXECUTION_ID}.txt

This differs from the default method which uses mpirun.

  ( /usr/bin/time -p mpirun -np $NPROCS $BLD/$EXEC ) |& tee buff_${EXECUTION_ID}.txt

It appears from your log file that the I/O API banner is repeated 16 times, once for each task, when typically this should be opened once and then the environment variable report is generated. It is almost like each task is trying to it’s own copy of CMAQ , rather than the work within CMAQ being divided evenly among the tasks.

Please try to use the default method, on your cluster, to see if it works.

You may need to load modules required to run CMAQ such as openmpi, in addition to using the LD_LIBRARY_PATH.

Per my university’s cluster guidance:

I am a bit worried that it is trying to run 16 different CMAQ runs on each processor, but not really sure how to fix it. I’ve loaded OpenMPI version 4.0.3.

Perhaps use srun without the -n $NPROCS.

( /usr/bin/time -p srun  $BLD/$EXEC ) |& tee buff_${EXECUTION_ID}.txt

The srun command should see the -n tasks that you are requesting in the header of the SLURM settings.

I tried, it’s the same issue unfortunately. The part that confuses me the most is this Fortran error. LOGDEV seems to be an object created by IOAPI, so I wonder if I’ve built IOAPI wrong at all. I did not install netcdf-4 support per recommendation on this forum and the IOAPI website. Could that be why I can’t use the 2018 benchmark? I know it was mentioned somewhere that may cause problems.

Please try using this version of the benchmark data that has been uncompressed:

https://cmas-cmaq.s3.amazonaws.com/index.html#CMAQv5.4_2018_12US1_Benchmark_2Day_Input_uncompressed/

Sorry, I just remembered that you are using the 12NE3 Domain, which should be already uncompressed. You can test to see if I/O API is compiled correctly by using one of the m3tools such as m3stat.

You can also use the command to get a file that uses netCDF-4 versus netCDF-3.

wget https://cmas-cmaq.s3.amazonaws.com/CMAQv5.4_2018_12US1_Benchmark_2Day_Input/2018_12US1/met/WRFv4.3.3_LTNG_MCIP5.3.3_compressed/GRIDBDY2D_20171222.nc4

Then use the ncdump -k command to find the type of compression:

ncdump -k GRIDBDY2D_20171222.nc4

Output:

netCDF-4 classic model

A file that uses netCDF-3 classic model can be obtained similarly:

wget https://cmas-cmaq.s3.amazonaws.com/CMAQv5.4_2018_12US1_Benchmark_2Day_Input/2018_12US1/met/WRFv4.3.3_LTNG_MCIP5.3.3_compressed/GRIDBDY2D_20171222.nc

Using the ncdump -k command gives the following output:

ncdump -k GRIDBDY2D_20171222.nc

Output:

classic

Are you also following this guidance:

https://kb.rice.edu/page.php?id=108436

To ensure that your job will be able to access an mpi runtime, you must load an mpi module before submitting your job as follows:

module load GCC OpenMPI

I discovered the path to my netcdf/bin directory was malformed in ~/.cshrc and fixed that (note: in the CMAQ install docs, it recommends $NCDIR\bin instead of $NCDIR/bin, which was my issue). I also added my IOAPI files to my path. I rebuilt CMAQ (in case the incorrect path was the cuplrit) but I still get the same “Illegal seek” error. M3stat ran fine on the file you sent me with netcdf-3 compression, so I don’t know if I should try to rebuild IOAPI.

Yes, I have loaded the same versions of GCC and OpenMPI I used for compiling. Here’s the output of module list:

Make sure that all CTM_LOG files have been deleted before submitting a new run.

rm CTM_LOG*

There are no CTM_LOG files in CCTM/scripts/. The $CMAQ_DATA/output_CCTM_v54_gcc9.3.0_Bench_2018_12NE3/LOGS is also empty.

Please verify that you are using the sbatch command to submit the run script to the slurm scheduler.

sbatch run_cctm_Bench_2018_12NE3.csh

@cjcoats and @fsidi do you have any other suggestions for what may be wrong?

Yes, that’s my run command. Apologies for how difficult this issue is turning out to be.

@gpara35,

Unfortunately, I don’t have much advice for you because on my system using srun to start parallel MPI jobs doesn’t work as expected, mostly because I don’t think my version of SLURM was installed with the proper linkages to MPI to allow srun to work. However, I am following up with my system administrator.

From my tests, OpenMPI “srun” doesn’t work at all and crashes immediately with an error saying that SLURM wasn’t configured with this option. Intel-MPI seems to try, but launches several copies of the executable instead of 1 copy with the work divided.

I advise you to do two things:

  1. Work with you system administrator to understand the linkages with MPI and SLURM on your system.
  2. Work with a smaller problem to try and successfully see if you can run a MPI job, for example, for my tests cases I don’t run CMAQ but a simple “Hello, World” MPI code. Once I verify this works, then I know how to submit MPI type jobs on my system. Something like the below:

program hello_mpi
use mpi
implicit none
integer ierr, mype, npes
call MPI_INIT( ierr )
call MPI_COMM_SIZE( MPI_COMM_WORLD, npes, ierr )
call MPI_COMM_RANK( MPI_COMM_WORLD, mype, ierr )
write(*,*) 'Hello, world from process ', mype, ’ of ', npes
call MPI_FINALIZE( ierr )
end program hello_mpi

I compiled this program using my compiler of choice then try to run it in parallel. A successful “run” should look something like this with 4 processors :

$shell: mpirun -np 4 hello_world_mpi.exe
Hello, world from process 3 of 4
Hello, world from process 1 of 4
Hello, world from process 2 of 4
Hello, world from process 0 of 4

3 Likes

Thanks for the detailed response. I was able to compile and run the Hello, world program with this output. I compiled with the flags OpenMPI gave from running mpifort --showme:link.

[gap6@nlogin1 gap6]$ srun --partition=interactive --ntasks=4 --mem=1G --time=00:01:00 hello_world
srun: job 7849588 queued and waiting for resources
srun: job 7849588 has been allocated resources
 Hello, world from process            0  of            4
 Hello, world from process            1  of            4
 Hello, world from process            2  of            4
 Hello, world from process            3  of            4

Looks like it worked. I will also try to get in touch with a previous lab member, as they were able to run CMAQ on Rice computing resources. Their code I’ve seen shows a similar approach to me so not sure what I’m doing wrong.

EDIT 1: One possible source of error: for compiling the dependencies (netcdf, netcdf-fortran, and IOAPI), I believe I used gfortran and gcc for my FC and CC respectively. Was I supposed to use mpifort and mpicc so the wrapper versions would have the right flags for my MPI configuration?

EDIT 2: I tried recompiling all dependencies, consistently using mpifort and mpicc for compilation. Still the same problem.

As you know, mpifort and mpicc are really “wrappers” for a pair of underlying Fortran and C compilers (gfortran and gcc, in your case, I think). Unless you’re using the distributed-I/O variant, neither the I/O API nor netCDF use MPI at all, so it is sufficient to use those underlying Fortran and C compilers to compile them.

If you are using distributed-I/O, then you of course do need to use the same mpifort and mpicc as for your CCTM.

1 Like

Thank you all for the help. I was able to succesfully run the CMAQ benchmark in parallel. Only 2 changes were different:

  1. My cluster updated their filesystem. While this was not related to my issues, it could have been part of the problem.
  2. I recompiled everything using more recent intel compilers. I suspect this is what helped, as switching to intel has solved some situations in some previous users’ forum posts.
1 Like