MPI_ABORT CMAQ v5.3.1

Hello,

I am trying to run the two day benchmark data on the new CMAQ v5.3.1 but I keep running into this error. The openmpi did work on v5.3.
MPI_ABORT was invoked on rank 9 in communicator MPI_COMM_WORLD
with errorcode 538976288.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

[AdamsTower2:27095] 11 more processes have sent help message help-mpi-api.txt / mpi-abort
[AdamsTower2:27095] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages
Command exited with non-zero status 32
real 4336.04
user 0.20
sys 0.83


** Runscript Detected an Error: CGRID file was not written. **
** This indicates that CMAQ was interrupted or an issue **
** exists with writing output. The runscript will now **
** abort rather than proceeding to subsequent days. **


==================================
***** CMAQ TIMING REPORT *****

Start Day: 2016-07-01
End Day: 2016-08-01
Number of Simulation Days: 1
Domain Name: 2016_12SE1
Number of Grid Cells: 280000 (ROW x COL x LAY)
Number of Layers: 35
Number of Processes: 12
All times are in seconds.

Num Day Wall Time
01 2016-07-01 4336.04
Total Time = 4336.04
Avg. Time = 4336.04

cctm.log.txt (943 Bytes)

When running a job in parallel, there is a “main” log file, and “ancillary” log files output by each processor. By default, these have names like CTM_LOG_000..

The message indicates that process 9 aborted, so check the CTM_LOG_009 file.

Is it possible that “CSQY_DATA_cb6r3_ae7_aq not found” caused the abort?

That message seems to be coming from the shell, rather than from the program. But yes, if the CSQY file is missing, the model will abort.

This is the error I am seeing on the LOG file

—>> WARNING in subroutine RDTFLAG
Error reading netCDF time step flag for MET_CRO_3D
M3WARN: DTBUF 1:00:00 July 2, 2016 (2016184:010000)

 >>--->> WARNING in subroutine XTRACT3
 Time step not available for file:  MET_CRO_3D
 M3WARN:  DTBUF 1:00:00   July 2, 2016  (2016184:010000)


 *** ERROR ABORT in subroutine retrieve_time_de on PE 001
 Could not extract MET_CRO_3D       file

PM3EXIT: DTBUF 1:00:00 July 2, 2016
Date and time 1:00:00 July 2, 2016 (2016184:010000)

It happens as soon as it switches to the new day on the 2 day benchmark run

Processing Day/Time [YYYYDDD:HHMMSS]: 2016184:000000
Which is Equivalent to (UTC): 0:00:00 Saturday, July 2, 2016
Time-Step Length (HHMMSS): 000500

The error message is saying that there is no data for 1:00:00 on July 2, 2016, or 2016184:010000, in the MET_CRO_3D file. What is the value of the MET_CRO_3D environment variable? Check the file having that name… it does not have the data corresponding to that time stamp.

The distributed run script is set up to process a series of 24-hour runs. Each daily run has its own set up input and output files set up by the script, so that when you begin simulating July 2, the MET_CRO_3D file will be something like METCRO3D_20160702.nc.

In principle, one can instead do a 2-day run with a single execution of the model. However, the run script needs to be modified, and (most importantly) the input files need to be prepared so that they have all the needed time steps for the entire model run.

Is the input benchmark data not fully prepared? Thats how it is described in the tutorial on github

There is a METCRO3D_20160702.nc file

Also, in the log: OPEN3 will write the file’s dimernsions and time step sequence parameters to the log. What does that say? Is it a time step sequence that should contain this date&time?

@lawless:
I think you are following this tutorial.
You have gotten to the step where you execute this command:

./run_cctm_Bench_2016_12US1.csh |& tee cctm.log

That script runs the model for one day: July 1, 2016. Did that execute correctly? The log file you posted above is much shorter than you should get if the first day ran correctly.

If the first day did run correctly, how did you modify the script to run a second day?

That is where I am at. Im just making sure I can get the CMAQv5.3.1 working before switch my already working v5.3

A one day run executes correctly

This is the only section I changed:

#> Set Start and End Days for looping
setenv NEW_START TRUE #> Set to FALSE for model restart
set START_DATE = “2016-07-01” #> beginning date (July 1, 2016)
set END_DATE = “2016-07-02” #> ending date (July 14, 2016)

#> Set Timestepping Parameters
set STTIME = 000000 #> beginning GMT time (HHMMSS)
set NSTEPS = 480000 #> time duration (HHMMSS) for this run
set TSTEP = 010000 #> output time step interval (HHMMSS)

Ive been running the benchmark in the background after redownloading the input data and it successfully completed

Hi,

Please leave the NSTEPS variable to 240000
The comment should be changed to: time duration (HHMMSS) for each day
The script is design to run the cmaq executable once for each day (24 hour period).

set NSTEPS = 480000

should be

set NSTEPS = 240000
1 Like

The intended way to run that script for multiple days is to leave NSTEPS at 240000 and to modify END_DATE (as you have done). The model then runs twice, with each run executing for 1 day (24 hours).

1 Like