Does SMKRUN use UNIX/Linux exit-status correctly for error-detection and run-control?
BACKGROUND
When the current modeling-system (of which SMOKE is a part) was first being developed thirty-odd years ago, a major problem was the problem of error-detection and making the automated run-control system use that detection in order to behave correctly – in particular, to avoid running
additional jobs after an error had happened. At the time, text-searches of program-logs were used for that detection. Particularly troublesome was the case when the system killed a job before the log could “scream for help.”
Fortunately, UNIX/Linux exit-status can be used to handle all of the problematic situations:
-
If a job succeeds, its exit-status should be 0.
-
If the system kills a job, its exit-status is a non-zero 8-bit integer describing the reason why, as given by a table in the system’s file */usr/include/errno.h"
-
All the programs were required to finish with I/O API routine M3EXIT, which “cleans up” and returns user-specified exit-status, which should be 0 for success, 1 for I/O related error, and 2 for other errors.
This allows robust, fool-proof errror detection which can (and should) be used, among other things, for job-control: only proceed whith the next job if that status is 0; otherwise, log an appropriate error message and exit with an appropriate non-zero error status.
What this means is that after running a job, the run-script should have something like the following csh-construct:
${job}
set foo = ${status}
if ( ${foo} != 0 ) then
echo "ERROR ${foo} on program ${job}..."
exit ( ${foo} )
endif
This should be true both for the low-level scripts that run single modeling programs, and for the automated run-control scripts that sequence multiple lower-level scripts: the original error would be detected, and it would quickly “bubble up” through the system, not running additional jobs that are guaranteed to fail because of failures in previous jobs.
CURRENT SYSTEMS for both CMAQ and SMOKE:
Unfortunately, a couple of decades ago some ignorant and/or lazy programming-contractor(s) did not know Linux systems-programming, did notread the documentation, and instead tried to re-invent the unreliable system used by the models forty years ago: exit-status is ignored, log-file text-searches are (unreliably) used in attempt to detect errors, and the run-control system blindly keeps going in spite of whatever errors may have occurred.
And so, for many failures one has to chase back through dozens of log-files in order to find where the first error happened – which may be confusing, since the first error may have happened in a program before the first program that generated an error-message.