Hi,
Running Multi-Node CCTM v5.4 on Azure CycleCloud (8.6.5) and Slurm Scheduler (23.11.7-1) with Open-MPI (5.0.2) installed from source alongside the CMAQ install.
We are running a (static) custom VM image for the compute nodes based on the Microsoft Ubuntu 22.04 HPC image with CMAQ installed (so the image has not changed).
It’s been running 3x Domains of CMAQ every day for over a month now.
But, since 25th December we are seeing the odd failure for 1 or more of the Domains at the start of CCTM - seems to be something to do with inter-node communication initialization:
CMAQ Processing of Day 20250106 Began at Mon Jan 6 05:49:34 GMT 2025
[aqpayg-hpchb120rsv3-2:13636] [[31353,1],201] selected pml ob1, but peer [[31353,1],0] on aqpayg-hpchb120rsv3-1 selected pml ucx
[aqpayg-hpchb120rsv3-2:13626] [[31353,1],193] selected pml ob1, but peer [[31353,1],0] on aqpayg-hpchb120rsv3-1 selected pml ucx
Seems that one node uses one proto (ob1) and the other (ucx) and the job fails.
If a job fails, it immediately tries to run it again on the same cluster nodes that are started for the first job, but this fails with the same error. If you manually run the exact same job again later (will be a new set of compute nodes) - then it seems to run fine.
I have seen this similar issue: https://github.com/open-mpi/ompi/issues/12475 which seems to suggest an issue that might be fixed in PMIX v 5.0.3 (does this come with Open-MPI 5.0.3?). But seems strange that it has been running fine until now…
Does anyone have any advice as to how I should go about either debugging or resolving this?