CMAQ Segmentation Fault Exactly at Met File Interval

Hello CMAS Community!

I’m consistently encountering segmentation faults (SIGSEGV) when running coupled WRF 4.7.5 and CMAQ (cb6r5_ae7_aq) simulations (see the end of this message for the error text from rsl.error.0001). These crashes always occur precisely at the interval for which MCIP-generated meteorological data was processed:

  • For instance, when running a simulation starting at 12:00:00 with MCIP met data generated at a 3-hour interval, the model successfully runs until exactly 15:00:00, then crashes.
  • Similarly, if I run a simulation starting at 12:00:00 using MCIP met data generated at a 6-hour interval, the model will run successfully up to exactly 18:00:00 before crashing.

These crashes occur reliably and exactly at these specified MCIP intervals. My domain is relatively small (~400 km²; 100 × 100 grid cells at 200 m resolution), and I’m running with a small WRF timestep (1 second). I’m coupling every 120 WRF timesteps (wrf_cmaq_freq = 120 ). I’ve tried changing CTM_MAXSYNC and CTM_MINSYNC but saw no improvement.

I reviewed this forum post on segmentation faults and CFL issues: Segmentation Faults and CFL Errors | WRF & MPAS-A Support Forum. I checked for CFL error messages in the output but saw none.

I also tried the recommended memory and stack size tweaks:

ulimit -s unlimited
export MP_STACK_SIZE=64000000
export OMP_STACKSIZE=64M

These did not resolve the issue.

You can access all the relevant input and output files here: Model inputs and outputs - Google Drive

Has anyone experienced a similar issue, or does anyone have suggestions on how to debug or resolve these crashes?

I’m attempting to model the hyper local cooling effect from reflecting sunlight via the injection of reflective aerosols over a small domain, including how quickly the aerosols disperse and the cooling effect is lost. This is a very unique use case. If anyone has any suggestions on how to better model something occur on the scale of a few km^2, I’d be happy to hear it!

Thanks very much for your time!

rsl.error.0001:
askid: 1 hostname: ip-172-31-83-148
module_io_quilt_old.F 2931 F
Quilting with 1 groups of 0 I/O tasks.
Ntasks in X 4 , ntasks in Y 8
Domain # 1: dx = 200.000 m
WRF V4.7.0 MODEL
git commit e204519f0dc13c99bbaf39e8a818993cc36209ad 3 files changed, 0 insertions(+), 0 deletions(-)


Parent domain
ids,ide,jds,jde 1 100 1 100
ims,ime,jms,jme 19 57 -4 20
ips,ipe,jps,jpe 26 50 1 13


DYNAMICS OPTION: Eulerian Mass Coordinate
alloc_space_field: domain 1 , 38940216 bytes allocated
med_initialdata_input: calling input_input
Input data is acceptable to use:
CURRENT DATE = 2024-08-07_12:00:00
SIMULATION START DATE = 2024-08-07_12:00:00
Max map factor in domain 1 = 1.00. Scale the dt in the model accordingly.
D01: Time step = 1.00000000 (s)
D01: Grid Distance = 0.200000003 (km)
D01: Grid Distance Ratio dt/dx = 5.00000000 (s/km)
D01: Ratio Including Maximum Map Factor = 4.98540926 (s/km)
D01: NML defined reasonable_time_step_ratio = 6.00000000
Normal ending of CAMtr_volume_mixing_ratio file
GHG annual values from CAM trace gas file
Year = 2024 , Julian day = 220
CO2 = 4.2868195693653640E-004 volume mixing ratio
N2O = 3.3564240469132870E-007 volume mixing ratio
CH4 = 2.0060859710514744E-006 volume mixing ratio
CFC11 = 2.7307591809652235E-010 volume mixing ratio
CFC12 = 4.6381286791458461E-010 volume mixing ratio
INPUT LandUse = “MODIFIED_IGBP_MODIS_NOAH”
LANDUSE TYPE = “MODIFIED_IGBP_MODIS_NOAH” FOUND 61 CATEGORIES 2 SEASONS WATER CATEGORY = 17 SNOW CATEGORY = 15
INITIALIZE THREE Noah LSM RELATED TABLES
d01 2024-08-07_12:00:00 Input data is acceptable to use:
Tile Strategy is not specified. Assuming 1D-Y
WRF TILE 1 IS 26 IE 50 JS 1 JE 13
WRF NUMBER OF TILES = 1
d01 2024-08-07_12:00:00 ----------------------------------------
d01 2024-08-07_12:00:00 W-DAMPING BEGINS AT W-COURANT NUMBER = 1.00000000
d01 2024-08-07_12:00:00 ----------------------------------------
d01 2024-08-07_18:00:00 Input data is acceptable to use:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0 0x56fa59b0d44b in ???
#1 0x56fa59b0c2df in ???
#2 0x56fa59a108f7 in ???
#3 0x56fa5a241ce8 in __memcpy_sve
at ../sysdeps/aarch64/multiarch/memcpy_sve.S:77
#4 0x56fa5a5ca63f in ???
#5 0x56fa5a03826b in ???
#6 0x56fa5a61bea7 in ???
#7 0x56fa5a59421f in ???
#8 0x56fa59e979fb in ???
#9 0x56fa59f04027 in ???
#10 0x56fa59f39493 in ???
#11 0x56fa59eade1b in ???
#12 0xc02bb3386037 in ???
#13 0xc02bb380a08f in ???
#14 0xc02bb38312eb in ???
#15 0xc02bb388ae97 in ???
#16 0xc02bb388f407 in ???
#17 0xc02bb30bbf87 in ???
#18 0xc02bb30ab783 in ???
#19 0xc02bb30aad7b in ???
#20 0x56fa5a1c84c3 in __libc_start_call_main
at ../sysdeps/nptl/libc_start_call_main.h:58
#21 0x56fa5a1c8597 in __libc_start_main_impl
at ../csu/libc-start.c:360
#22 0xc02bb30aadef in ???
#23 0xffffffffffffffff in ???

Hi WilloughbyWinograd,

Before I attempt to answer your question, let me get a better picture of your situation. From your description, it seems to me you are using the WRF-CMAQ coupled model but you have mentioned using MCIP met data generated at a specific hourly interval. This confuses me. The WRF-CMAQ does not utilize MCIP generated met data. All the met info is directly from WRF. Could you please to clarify this?

In addition, since the resolution is 200m, I wonder you have LES setup in your case. If you are using Intel ifort compiler, you can add -traceback (-fbacktrace for gfortran) flag in the configure.wrf under the FCOPTIM clause. With this flag, it will likely tell you where the code crashed.

Cheers,
David

Hi David! Thank you for the response. I misspoke about the met files. Yes, they are generated by WRF.

As for adding the flag, I’m using a precompiled WRF-CMAQ setup from Odycloud hosted on an AWS instance. I can’t recompile with the flag on without changing my setup. If I have to do that, I can, but would prefer not to. Have you heard of this crash at interval issue?

Hi WilloughbyWinograd,

I am sorry I am still not quite follow your notion of time interval. In the WRF-CMAQ coupled model, there is no time interval (3 hours, 6 hours, or other setting) this type of concept (there is one on the output side, typically at hourly interval and you can modify in the run script). WRF continuously produces met info and sends to CMAQ at the specific time step which is defined by WRF timestep and environment variable, wrf_cmaq_freq. I am very curious to find out how you have such “time interval” setup. Is it in your run script and how?

We never encounter such scenario and we have not heard from any user with the same issue. Without recompiling the code (indeed you only need to recompile the dynamic, physics, and cmaq portion), it is difficult to determine the root cause in my opinion.

Cheers,
David

Hi David,

Thank you for your patience with the terminology confusion! I’ve got a better idea of the root of the segmentation fault: it’s directly tied to the namelist.input setting used during the real.exe step. When I run real.exe before running the .sh file WRF-CMAQ, I set interval_seconds = 21600 (6-hours). This defines the spacing of lateral boundary updates via the wrfbdy file. I set it intentionally to match the temporal resolution of my input data.

Here’s the key point:
When I run WRF-CMAQ with coupling off with this setup, everything works fine — boundary updates at 6-hour intervals do not cause any crash or instability. The WRF run continues on past the 6-hour mark (I am trying to do a 24-hour run).

But when I run WRF-CMAQ in two-way coupled mode, the model crashes precisely at the first boundary update time (e.g., 18:00 for a 12:00 start and 6-hour interval). It doesn’t matter if I set interval_seconds to three hours and re-run real.exe, MCIB, BCIP, ICON, etc. This makes it clear the crash is not a WRF issue, but rather something in how CMAQ interacts with WRF’s boundary update during coupled execution — likely due to memory handling, field synchronization, or pointer access during the update step.

Here, you can see the interval settings in namelist.input and the associated met files spaced by 6-hours.

So this is not about MCIP, MCIP intervals, or even met_em file processing — it’s a boundary-handling issue that only manifests during coupling.

Have you encountered this kind of WRF-CMAQ coupling sensitivity to boundary update intervals before?

Best,
Willoughby Winograd

Hi Willoughby,

Thanks for the clarification.

When you said “When I run WRF-CMAQ with coupling off”, I believe you meant running CMAQ offline model, i.e. using the MCIP generated met files to drive CMAQ. At this point, I am puzzle with your case. We never have encountered anything like that. Could you please to confirm that with 3hr interval boundary file, your coupled run crashed at the 3rd hour? Could you please to provide these environment variable setting in your run script, CMAQ_COL_DIM, CMAQ_ROW_DIM, TWOWAY_DELTA_X, and TWOWAY_DELTA_Y? Could you do a test with the following setting: set option = 3 (I assume you are setting 2), set wrf_cmaq_freq = 3600, and make sure your have these environment variables PGRID_DOT_2D, PGRID_CRO_2D, PMET_CRO_2D, PMET_DOT_3D, and PMET_CRO_3D point to some specific output files? Once this test is done, please compare these files PGRID_DOT_2D, PGRID_CRO_2D, PMET_CRO_2D, PMET_DOT_3D, and PMET_CRO_3D with the corresponding MCIP output file with the similar setting. They result should be almost identical, only off by one WRF time step.

Cheers,
David

Hi Willoughby,

Sorry I forgot to indicate that by setting wrf_time_step * wrf_cmaq_freq = 3600 is equivalent to run the WRF-CMAQ coupled as in the regular offline mode (only off by one WRF time step).

Cheers,
David

Hi David,

Thanks again for your help on this! To clarify a key point: when I previously said “WRF-CMAQ with coupling off,” I meant that I set wrf_time_step * wrf_cmaq_freq to a value greater than the total runtime — effectively disabling coupling without changing the code path. When I did this, I did not get the boundary condition error at 15:00, 18:00, or any other time stamp. So CMAQ was technically included but never actually called, simulating uncoupled execution and allowing the simulation to run fine.

I have been working by default with coupling active (option = 3) and wrf_time_step = 1, I set wrf_cmaq_freq = 120, so coupling occurs every 2 minutes.

When using a 3-hour boundary interval (interval_seconds = 10800), the crash occurs exactly at 15:00 (3 hours in). When its a 6-hour boundary interval (interval_seconds = 21600 ), the crash occurs exactly at 18:00 (6 hours in).

Environment variables are set as follows:

export CMAQ_COL_DIM=87
export CMAQ_ROW_DIM=87
export TWOWAY_DELTA_X=6
export TWOWAY_DELTA_Y=6

I’ll also try setting wrf_cmaq_freq = 3600 to match the 1-hour interval you suggested and see if that resolves the crash. I’ll report back shortly with what I find. I’ll also check diagnostic outputs PGRID_DOT_2D, PGRID_CRO_2D, PMET_CRO_2D, PMET_DOT_3D, and PMET_CRO_3D to see if they are all written as expected. I’ll proceed with comparing these to MCIP outputs and will report any inconsistencies.

Thank you!

Best,
Willoughby

Hi Willoughby,

"I set wrf_time_step * wrf_cmaq_freq to a value greater than the total runtime

Parent domain" this is quite an interesting idea. Indeed that means running WRF only and setting option = 0 will do the same thing.

Thanks for providing additional information:

export CMAQ_COL_DIM=87
export CMAQ_ROW_DIM=87
export TWOWAY_DELTA_X=6
export TWOWAY_DELTA_Y=6

With the other information you have provided:
ids,ide,jds,jde 1 100 1 100

I believe your have the following environment variable settings. Please confirm.

setenv WRF_COL_DIM 101 # wrf west_east_stag
setenv WRF_ROW_DIM 101 # wrf south_north_stag

If that is the case, please try to set TWOWAY_DELTA_X and TWOWAY_DELTA_Y to 7 (2 x 7 + 87 = 101).

Cheers,
David