CMAQ v5.2.1 run-time error/stop

Hello,
I have CMAQv5.2.1 installed, and have run it successfully before, but the new run I’ve set up stops after a couple hours and I’ve hit a wall in figuring it out.

This new (failed) run has inputs identical to a successful run on a second machine (with CMAQv5.2.1, but with different compilers and libraries) -except- for the BCON and met. files. I’m confident that the BCON files are not the issue. The met. files could be the issue, but (1) I’ve run tests with different days (June 20, January 1) and the model crashes regardless, (2) changing MAXSYNC doesn’t help. So my questions are,

Is there anything that stands out in my configuration that could cause the run to stop a couple hours in?
Is there anything you recommend I test?

Thank you!

output log file:
https://uwmadison.box.com/s/9niwqsyhwm60wsbqxouhxqhy1qvbarct

If I’m reading you correctly, you have successfully used this version of the code previously, but now you have changed a) met files; b) BC files; and c) compilers and libraries, and the model dies after 4 hours with a hangup.

The fact that this happens for three different days strongly suggests it is not an issue with the met data. However, it would be pretty easy to test that by using the exact same met data that you successfully ran with before–can you get that day to complete, or not? Or can you get the benchmark tutorial domain to execute successfully?

Your log file includes a lot of MPI (PMI) debug lines that I have not seen output before. Did you (or your sysadmin) compile mpich under debug mode? Could it be running so slowly (due to debug flags) that a timeout is happening? For the 4 hours that are completed, do they execute in a reasonable amount of time, or is the model running really slowly?

Finally, consider updating to CMAQv5.3.2. I don’t recall anything specific, but you might review the release notes between v5.2.1 and now.

thanks for the speedy reply!
I did test my model install with the met. files that worked on the other machine/install–and that test run stopped after a couple hours as well. I wasn’t sure whether to chalk that up to the other met. files being created with MCIP and CMAQ compiled with IOAPI 3.2, rather than IOAPI 3.1, which the version I’m testing now used.
I’ll try running the WRF files from the other machine through MCIP with IOAPI 3.1 to see what happens.

I was able to successfully benchmark the version I’m testing now. It doesn’t seem like the model’s running extra slowly for the number of processors we have on this machine. I will check with our sysadmin about mpich; I’ve already asked him to upgrade our ifort and netcdf libraries for me to install CMAQ 5.3.2.

edit: here are the log files
from a test using met. from another day: https://uwmadison.box.com/s/1w5h4b1cr4hy7tph1m0c6m9boqgr80tw
and from a test using met. used in a different, successful simulation on a different machine: https://uwmadison.box.com/s/frmx2i8k4jvd39mkthamhiqf9ir8o6v0

The log file that you provided has the following output at the end.
Timestep written to CTM_AVIS_1 for date and time 2016172:030000
after AERO G 1.2345996E-01 A 1.4021042E+09 N 1.2794050E-04

 After NEXTIME: returned JDATE, JTIME 2016172 040000

APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)
58554.009u 1929.346s 1:54:11.91 882.7% 0+0k 143402712+58204384io 1827770pf+0w

Can you share the CTM_LOG files that were moved to this location.
/archive5/harkey/CMAQv5.2/cctm/July2016/LOGS
Typically, you need to check both the standard output in the June20_test-SYNC300.log and the per processor log output that gets moved to the $OUTPUT/LOGS directory.

CTM_LOG_001.v521_intel_12US2_cb6r3_ae6_aq_2016fi_16j_20160620
Where CTM_LOG_001* is from the first processor, etc.

Do you have the following setting in your .cshc file?
limit stacksize unlimited

yes, my .tcshrc has limit stacksize unlimited. It seemed like everything in the CTM_LOG files was covered in the piped screen output log, but your eyes are more tuned to these details than mine! Here are the CTM_LOG files for the main test run: https: //uwmadison. box. com /s/mmgxauayox5jpzcu6t4ukadwrwz9wsek (please remove spaces from the link/address; I wasn’t able to post it without the spaces)

Thank you.

Your Jan01_test is dying after 1 hour, your June20_test after 4 hours, and June20_test_YNTSSTmettest in the first time step.
Since everything is crashing on this machine, it is not a met problem.
Your log files are full of PMI messages I have not seen before, including those saying “we don’t understand the response”.

Since it seems like an MPI problem, I suggest you try running a smaller size domain on a single processor. If that works, then talk to your sysadmin about configuring MPI.

Do all of your nodes have the same /etc/hosts file?