Multi-Node CCTM on Azure CycleCloud with Slurm and OpenMPI has started crashing

Hi,

Running Multi-Node CCTM v5.4 on Azure CycleCloud (8.6.5) and Slurm Scheduler (23.11.7-1) with Open-MPI (5.0.2) installed from source alongside the CMAQ install.

We are running a (static) custom VM image for the compute nodes based on the Microsoft Ubuntu 22.04 HPC image with CMAQ installed (so the image has not changed).

It’s been running 3x Domains of CMAQ every day for over a month now.

But, since 25th December we are seeing the odd failure for 1 or more of the Domains at the start of CCTM - seems to be something to do with inter-node communication initialization:

CMAQ Processing of Day 20250106 Began at Mon Jan  6 05:49:34 GMT 2025
[aqpayg-hpchb120rsv3-2:13636] [[31353,1],201] selected pml ob1, but peer [[31353,1],0] on aqpayg-hpchb120rsv3-1 selected pml ucx
[aqpayg-hpchb120rsv3-2:13626] [[31353,1],193] selected pml ob1, but peer [[31353,1],0] on aqpayg-hpchb120rsv3-1 selected pml ucx

Seems that one node uses one proto (ob1) and the other (ucx) and the job fails.

If a job fails, it immediately tries to run it again on the same cluster nodes that are started for the first job, but this fails with the same error. If you manually run the exact same job again later (will be a new set of compute nodes) - then it seems to run fine.

I have seen this similar issue: https://github.com/open-mpi/ompi/issues/12475 which seems to suggest an issue that might be fixed in PMIX v 5.0.3 (does this come with Open-MPI 5.0.3?). But seems strange that it has been running fine until now…

Does anyone have any advice as to how I should go about either debugging or resolving this?

I do not have any experience with this problem on CycleCloud.

Hi Liz,

Thanks for dropping by.

In the end, I just added this Env variable to the CMAQ multi-node run script and the issue has not recurred.

setenv PMIX_MCA_gds ^shmem2

I not sure if it was a hardware issue with a particular Azure rack/node/switch at the time (as it just started happening out of the blue) or whether this variable fixed things…

Rgds

Gary

1 Like