Slurmstepd: error: Detected 1 oom_kill event

Hello all,

I am running CMAQ for 12CONUS1 domain from 01 Feb,2023 for 2 days. After running for sometime, it says:

= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 1697494 RUNNING AT ec5
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

slurmstepd: error: Detected 1 oom_kill event in StepId=5112172.0. Some of the step tasks have been OOM Killed.
srun: error: ec5: task 0: Out Of Memory
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
real 9378.48
user 0.05
sys 0.04

I used following commands in slurm to execute this:
run --time=168:00:00 -c 16 --mem=32G --pty bash
sbatch -n 128 --time=168:00:00 run_cctm_202302_12US1.csh

Please find attached my runscript and log script.
slurm-5112172.txt (19.3 KB)
run_cctm_202302_12US1.txt (35.7 KB)
CTM_LOG_074.v532_gcc_12US1_459X299_20230201.txt (75.2 KB)

Thank you for your help.

Hasibul

The error indicates the program crashed because it ran out of memory. I am not familiar with the form of slurm commands you are giving, but perhaps you can increase the --mem=32G (or remove that argument entirely) to get the run to proceed.
Alternatively, you could try running on more processors, if you have more available.