Multi-Node CCTM on Azure CycleCloud with Slurm and OpenMPI has started crashing

garymansell · January 7, 2025, 9:06am

Hi,

Running Multi-Node CCTM v5.4 on Azure CycleCloud (8.6.5) and Slurm Scheduler (23.11.7-1) with Open-MPI (5.0.2) installed from source alongside the CMAQ install.

We are running a (static) custom VM image for the compute nodes based on the Microsoft Ubuntu 22.04 HPC image with CMAQ installed (so the image has not changed).

It’s been running 3x Domains of CMAQ every day for over a month now.

But, since 25th December we are seeing the odd failure for 1 or more of the Domains at the start of CCTM - seems to be something to do with inter-node communication initialization:

CMAQ Processing of Day 20250106 Began at Mon Jan  6 05:49:34 GMT 2025
[aqpayg-hpchb120rsv3-2:13636] [[31353,1],201] selected pml ob1, but peer [[31353,1],0] on aqpayg-hpchb120rsv3-1 selected pml ucx
[aqpayg-hpchb120rsv3-2:13626] [[31353,1],193] selected pml ob1, but peer [[31353,1],0] on aqpayg-hpchb120rsv3-1 selected pml ucx

Seems that one node uses one proto (ob1) and the other (ucx) and the job fails.

If a job fails, it immediately tries to run it again on the same cluster nodes that are started for the first job, but this fails with the same error. If you manually run the exact same job again later (will be a new set of compute nodes) - then it seems to run fine.

I have seen this similar issue: https://github.com/open-mpi/ompi/issues/12475 which seems to suggest an issue that might be fixed in PMIX v 5.0.3 (does this come with Open-MPI 5.0.3?). But seems strange that it has been running fine until now…

Does anyone have any advice as to how I should go about either debugging or resolving this?

lizadams · January 21, 2025, 2:52pm

I do not have any experience with this problem on CycleCloud.

garymansell · January 22, 2025, 9:07am

Hi Liz,

Thanks for dropping by.

In the end, I just added this Env variable to the CMAQ multi-node run script and the issue has not recurred.

setenv PMIX_MCA_gds ^shmem2

I not sure if it was a hardware issue with a particular Azure rack/node/switch at the time (as it just started happening out of the blue) or whether this variable fixed things…

Rgds

Gary

Topic		Replies	Views
Error in running CCTM CMAQ	2	1369	November 8, 2019
MPI and CCTM error Run Time Errors and Issues	9	4514	June 21, 2019
CCTM build error Compiling	12	2168	November 8, 2019
CMAQ MPI and undefined symbols from NetCDF Fortran library I/O API	7	2614	November 27, 2019
Is it possible (how?) to run containerized CMAQ across multiple nodes using Azure CycleCloud & Slurm? CMAQ	6	106	August 23, 2024

Multi-Node CCTM on Azure CycleCloud with Slurm and OpenMPI has started crashing

Related topics