The statements have been working fine until this time. I believed the cctm successfully finish as usual, but the “PROGRAM COMPLETED SUCCESSFULLY” message didn’t occur in CTM_LOG_000 so the whole process stopped. Besides, the “PROGRAM COMPLETED SUCCESSFULLY” message existed in other CTM_LOG files.
I would like to know if the cctm only successfully finished when “PROGRAM COMPLETED SUCCESSFULLY” message exists in CTM_LOG_000? Thanks!
What UNIX/Linux actually supports (at the system-level) is the exit-status number of the executables: 0 for success, 1 for I/O errors, 2 for algorithm errors, negative values for system-killed, and other positive errors for programer-customized errors. I/O API routine M3EXIT was designed to set these exit-status numbers.
What all these scripts actually should be checking is that status; the code for that is something like the following:
mpirun ${model}
set runstat = ${status}
if ( ${runstat} != 0 ) then
echo "ERROR ${runstat} on program ${model} for ..."
exit ( ${runstat} )
endif
And of course as soon as one program-execution fails, the entire script-system should likewise fail, rather than blindly barging on afterwards, generating garbage and a large set of logs which must all be examined in order to find the original failure and the reason for it.
Of course, two generations of script-programmers ignorant of UNIX/Linux system programming have insisted on trying the sort of search-for-messages approach that you describe. This approach usually works, but can sometimes fail.
FWIW –
Carlie J. Coats, Jr., Ph.D.
I/O API Author/Maintainer
Original CMAQ systems architect.
I agree with @cjcoats that basing script error checking on exit status is the preferred approach.
That said, you wrote
What does the end of your CTM_LOG_000 file show? Could you please post a copy of that file? In my experience, the “PROGRAM COMPLETED SUCCESSFULLY” message appears in CTM_LOG_000 if a run was successful, just as it appears in all the other CTM_LOG files in that case.
In the “wrong” CTM_LOG_000, the message from line 21715 to 21720 was missing. But in other log files (e.g. CTM_LOG_100), lines 21715 to 21720 were there.
Thanks for sharing. That’s interesting and I don’t have a good explanation, I have not seen this happen before. That said, I also haven’t run any CCTM simulation including DDM and ending at 12:00 so maybe it’s something in that setup that triggers this behavior, though I don’t know what it might be. But if your CGRID and S_CGRID files were created successfully and all other output files have the correct number of time steps and reasonable fields for the last hour, it’s probably not worth your time to investigate this further.
One reason for a run stopping without any error messages is if the disk space where the output is being written to is full. Especially if the run script and the log files are also on that same disk.
Actually we have been running the CMAQ model (prvious version 5.0.2 and current 5.3.3) for several years as an air quality forecast, and this is the first time we met this problem. I think this should be a small probability event. Thank you for your time.