'Could not write CTM_DEPV_DIAG file' error

Does the CMAQ runtime error ‘Could not write CTM_DEPV_DIAG file’ indicate anything:
image

The problem occurs on about 5 PEs from the total 176 PEs that I am using from the cluster. I cannot think of a reason the deposition velocity diagnostic file cannot be written after being written for 40 days of simulation?

Update: when I change only the value of the environment variable
setenv CTM_DEPV_FILE Y ###N #> deposition velocities diagnostic file [ default: N ]
to the default value N, then CMAQ runs smoothly. What would lead CMAQv5.3.2 to not write the CTM_DEPV_DIAG file and exit running?

There is probably a bug somewhere. Maybe it is only in the code that writes out the diagnostic file, so is not affecting your actual simulation, but we should try to track this down.
If you restart the simulation on Feb. 12, 2014, with the CTM_DEPV_DIAG flag set to Y, does the model crash at the same place?
If so, recompile the model in debug mode and rerun it. See where it crashes and what the error message is.

Thanks @cgnolte, I actually tried to restart the simulation on Feb 12, 2014 with CTM_DEPV_DIAG set to Y, and it crashed at the exact same time.

Then I recompiled CMAQ on the debug mode (make clean; make DEBUG=TRUE), and ran a job but the job exited in 5 seconds without giving any log, intermediate or output files - the part of the standard output where the error was is below:

setenv CTM_STDATE 2014043
setenv CTM_STTIME 000000
setenv CTM_RUNLEN 240000
setenv CTM_TSTEP 010000
setenv INIT_CONC_1 /nas/scratch/cmaq_run_18may2020/cmaq_out/CCTM_CGRID_v532_intel_2016_CONUS_20140211.nc
setenv BNDY_CONC_1 /nas/scratch/cmaq_run_18may2020/cmaq_in/icbc/m3interp.3hrInterv.m3tshift.minus.3hr.bc.12km.2014H1.cmaq.cb6r3_ae7.lst.ncf.24nov2020.36layers
setenv OMI /proj/MYLIB_PROJ/cmaqv5.3.2_official_git_17oct2020/CMAQ_Project_17oct2020/CCTM/scripts/BLD_CCTM_v532_intel/OMI_1979_to_2019.dat
setenv OPTICS_DATA /proj/MYLIB_PROJ/cmaqv5.3.2_official_git_17oct2020/CMAQ_Project_17oct2020/CCTM/scripts/BLD_CCTM_v532_intel/PHOT_OPTICS.dat
set TR_DVpath = /nas/scratch/cmaq_run_18may2020/cmaq_in/met
set TR_DVfile = /nas/scratch/cmaq_run_18may2020/cmaq_in/met/d02_2014H1_02_21nov2020/METCRO2D_d02_2014H1_02_21nov2020.nc
setenv gc_matrix_nml /proj/MYLIB_PROJ/cmaqv5.3.2_official_git_17oct2020/CMAQ_Project_17oct2020/CCTM/scripts/BLD_CCTM_v532_intel/GC_cb6r3_ae7_aq.nml
setenv ae_matrix_nml /proj/MYLIB_PROJ/cmaqv5.3.2_official_git_17oct2020/CMAQ_Project_17oct2020/CCTM/scripts/BLD_CCTM_v532_intel/AE_cb6r3_ae7_aq.nml
setenv nr_matrix_nml /proj/MYLIB_PROJ/cmaqv5.3.2_official_git_17oct2020/CMAQ_Project_17oct2020/CCTM/scripts/BLD_CCTM_v532_intel/NR_cb6r3_ae7_aq.nml
setenv tr_matrix_nml /proj/MYLIB_PROJ/cmaqv5.3.2_official_git_17oct2020/CMAQ_Project_17oct2020/CCTM/scripts/BLD_CCTM_v532_intel/Species_Table_TR_0.nml
setenv CSQY_DATA /proj/MYLIB_PROJ/cmaqv5.3.2_official_git_17oct2020/CMAQ_Project_17oct2020/CCTM/scripts/BLD_CCTM_v532_intel/CSQY_DATA_cb6r3_ae7_aq
if ( ! ( -e /proj/MYLIB_PROJ/cmaqv5.3.2_official_git_17oct2020/CMAQ_Project_17oct2020/CCTM/scripts/BLD_CCTM_v532_intel/CSQY_DATA_cb6r3_ae7_aq ) ) then
if ( ! ( -e /proj/MYLIB_PROJ/cmaqv5.3.2_official_git_17oct2020/CMAQ_Project_17oct2020/CCTM/scripts/BLD_CCTM_v532_intel/PHOT_OPTICS.dat ) ) then
if ( 0 != 0 ) then
echo

echo CMAQ Processing of Day 20140212 Began at `date`
date
CMAQ Processing of Day 20140212 Began at Tue Dec  1 18:09:40 EST 2020
echo

limit stacksize unlimited
srun -n 176 --mpi=pmi2 /proj/MYLIB_PROJ/cmaqv5.3.2_official_git_17oct2020/CMAQ_Project_17oct2020/CCTM/scripts/BLD_CCTM_v532_intel/CCTM_v532.exe
tee buff_CMAQ_CCTMv532_20201201_230940_450992317.txt
srun: TOPOLOGY: warning -- no switch can reach all nodes through its descendants.Do not use route/topology
[cli_135]: aborting job:
Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(490)............:
MPID_Init(381)...................: channel initialization failed
MPIDI_CH3_Init(320)..............: rdma_get_control_parameters
rdma_get_control_parameters(1534):
rdma_open_hca(701)...............: No active HCAs found on the system!!! 0

the error trend continues till [cli_175] in the standard output:

[cli_175]: aborting job:
Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(490)............:
MPID_Init(381)...................: channel initialization failed
MPIDI_CH3_Init(320)..............: rdma_get_control_parameters
rdma_get_control_parameters(1534):
rdma_open_hca(701)...............: No active HCAs found on the system!!! 0

I may have missed something with running CMAQ on the debug mode compared to the ‘normal mode’ that is causing the error given the error is about MPI?

Since the debug mode of CMAQ did not run on multiple nodes, I ran it on a single node with multiple (44) cores. Again, the log files for Feb 12, 2014 were stuck at either:


continuing from gas phase species to aerosol species

The remaining cpus had this at the end of their log files:

.

Thus, looking at the log files, all cpus completes emissions scaling preparation but one group cannot then log the ‘set up of gravitational settling’ while another group makes progress in setting up gravitational settling!

The standard output, towards the end, shows floating divide by zero at Time Integration:


image

Thanks for posting the results of your debug run. This points to the following line from m3dry.F as the source of the crash:

rgnd = 200. + 300. * MET_DATA%SOIM1(c,r)/GRID_DATA%WFC (c,r)

Specifically, GRID_DATA%WFC (c,r) is zero when being used in this computation.

This issue has come up before in a post by @dazhong.yin, it was listed as issue #2 in that post.

One of my colleagues later had some offline email exchanges with @dazhong.yin and in my recollection, the underlying reason was that the WRF files had an inconsistency between the spatial coverage of land vs. water cells as indicated by the LWMASK array and the spatial coverage of soil types.

The computation above only takes place for land grid cells, i.e. cells for which LWMASK is 1 and the vegetation fraction is greater than 0. For such cells, the soil type should be <= 12, but in @dazhong.yin’s case there were cells where LWMASK was 1 (land) but the soil type was 14 (a placeholder for water), and for that soil type WFC (soil field capacity) is set to the initialized value of 0 which leads to the ‘division by zero’ error you also encountered.

Here is the information my colleague shared with @dazhong.yin

The part of m3dry that uses WFC is for land only so ISLTYP=14 should not occur there. The only way this can happen is if LWMASK = 1 (land) and ISLTYP=14. Usually the soil data has greater extent at coastlines so this does not occur. You could compare SLTYP in the METCRO2D file to the LWMASK in the GRIDCRO2D file (or both in wrfout) to see if there are any grid cells where LWMASK=1 and SLTYP=14 or anything >11

@dazhong.yin then confirmed that indeed in his fields there were many cell with SLTYP=14 and LWMASK=1, but I don’t have any record of if/how this problem was ultimately resolved. Based on my colleague’s experience, a problem like this shouldn’t happen and points to some issues in the WRF setup. Maybe you can reach out to @dazhong.yin to find out if/how this was resolved.

Thanks @hogrefe.christian, I will look into the WRF land and soil parameters and get back!

The floating point exception (divide by zero) issue due to m3dry.F most probably is not related to the “not writing CTM_DEPV_DIAG file” problem.

The floating point exception occurs when WRF older than version 4 with P-X LSM was used to generate meteorological inputs for CMAQ at cells with LWMASK of 1 (indicating land), vegcr greater than 0 (indicating vegetation cover), and SLTYP of 14 (indicating water surface). When WRF v4+ with P-X LSM is used for met inputs, the floating point exception doesn’t happen for the same cells because variables such as WFC are non-zero for water surface. Ideally this case should not exist. But it happens due to data and/or methodology inconsistencies.

I made the following two changes in m3dry.F with CMAQv5.3. no experience with newer CMAQ yet.

< ! IF ( ( NINT(GRID_DATA%LWMASK( c,r )) .NE. 0 ) .AND. ( vegcr .GT. 0.0 ) ) THEN ! land
< IF ( ( NINT(GRID_DATA%LWMASK( c,r )) .NE. 0 ) .AND. ( vegcr .GT. 0.0 ) .AND.
< & ( GRID_DATA%SLTYP(c,r) .NE. 14 ) ) THEN

< ! IF ( ( NINT(GRID_DATA%LWMASK( c,r )) .EQ. 0 ) .OR. ( vegcr .EQ. 0.0 ) ) THEN ! water
< IF ( .NOT. ( ( NINT(GRID_DATA%LWMASK( c,r )) .NE. 0 ) .AND. ( vegcr .GT. 0.0 ) .AND.
< & ( GRID_DATA%SLTYP(c,r) .NE. 14 ) ) ) THEN

Thanks @dazhong.yin. I used WRFv3.9.1.1 but NOAH LSM, so WRF SLTYP would not be an issue. Actually, I just started another thread NaN values of NH3_Emis variable in CCTM_DRYDEP_* file stating NaN values of NH3 and NH3_Emis variables in the CCTM_DRYDEP_* file. However, I do not know how if this error is related to those NaN values of NH3 and where those NaN values came from?

The error you described here and the one you described in the other thread are unlikely to be related. The only way they could be related is for the error described here to be causing downstream effects in the bidi calculations, but not the other way around.

The error described here is definitely caused by a problem in the WRF fields. To fix the root cause, you’d need to trace back your WRF processing to find out why it generated grid cells with SLTYP > 11 and LWMASK = 0 and then address the problem there. To “patch” the problem in CMAQ, you could apply the code modification to m3dry.F posted by @dazhong.yin which has the effect of reducing the number of cells being treated as land cells, but you’d still want to dig into it to better understand what’s going on since it will affect your results in coastal areas.

The problem you described in the other thread may point to a problem with the EPIC input data, but I don’t have experience with it. In any case, that other problem cannot be causing the problem you’re describing in this thread.

Edited version: The Great Lakes, the somewhat big lakes in Florida+Utah, and some smaller lakes (appearing as red dots on the maps below) tend to be ‘inland water’ so they have been designated as water, similar to the oceans, for both soil categorization and land mask in WRF:
image

I think some very small red dots on the second map with SLTYPE=14 are actually classified as land on the first map (e.g. the red dot on Westernmost part of the second map is actually not visibie in the first map). I will investigate further with ‘if-then’ script in Python to see how many of such dots exist.

skuwar,

Could you email me directly at gilliam.robert@epa.gov?

I’d like to get a wrfinput_d01 file for your domain to look at these fields a bit closer in my sandbox.

It looks to me that this originates in the WRF preprocessing. LWMASK in MCIP comes from LANDMASK in WRF. Soil type in MCIP SLTYP comes from ISLTYP in WRF. These are derived during the real.exe processing step that produces the wrfinput_d01 file. The PX LSM does some stuff with soil type, but you are using Noah LSM. So not as familiar.

If the LANDMASK and ISLTYP in your wrfinput_d01 file are inconsistent, this could be something deeper in terms of how the Noah is initialized via real.exe. I know the ISLTYP is derived from SOILCTOP/SOILCBOT in the met_em* files from the WPS metgrid processing. It could go as deep as the geogrid processing and the interpolation of these fields. Just not sure. Please email so I can give a private FTP site and I’ll do some investigating.

Thank you @hogrefe.christian @dazhong.yin @rgilliam for your concern and information ! I performed a check of MCIP output variables LWMASK=0 and SLTYP=14 in (a) Python, (b) VERDI when zoomed in, and © VERDI when zoomed out. In Python code, I could not detect any differences between the two. In VERDI, when zoomed out, the two variables look similar:


However, when fully zoomed out, the same two maps look like this:



Look how SLTYP has two red dots (water=14=red) while LWMASK has only one red dot (water=0=red) inside the blue circle on Florida. You will also notice that the SLTYP map has thicker dots (or rather grid cells) than LWMASK map does. Given my Python check and VERDI-zoom-in check results, I am inclined to say that VERDI may have been showing a grid cell in different sizes for different maps, and that LWMASK=1 and SLTYP=14 are not overlapping in this case.

However, the suggestion from the colleague of @hogrefe.christian was that any grid cell where LWMASK=1 and SLTYP>=11 is a problem. When I checked WRF soil types, I saw:
image
I can understand soil type = 16 is land-ice whose properties may be more like water than land. But I am unsure why would classifying soil type = 12 (which is clay, or 13=Organic Material, or 15 Bedrock) as land would be a problem for CMAQ?

skunwar (sorry, I did not get your full name)

Please contact me directly at gilliam.robert@epa.gov so I can get you to upload a wrfinput_d01 file to my ftp.

I’d like to look at this closer. FYI, these values come from WRF directly as far as I know. You are using the Noah LSM. We typically do not use Noah in most runs. In the PX LSM there is some internal processing of the LANDMASK and soil to force consistency. I’d like to look a the origination of these fields closer. It may also help if you can provide a namelist.wps file where the geogrid.exe process was done. These maps do help by pointing to an inconsistency, but cannot specifically tell you why unless we do some testing. It could be something in the geogrid process and interpolation of landuse/soil datasets or an oddity of this domain. I do not think this is a CMAQ issue because it is dumb to the WRF inputs. MCIP does nothing but put WRF in a different format at least with these fields.

I’ll provide a ftp address/login/pass if you email me directly. Just did not want to post those on a forum.

Regards,

Rob