Is it possible (how?) to run containerized CMAQ across multiple nodes using Azure CycleCloud & Slurm?

Hi,

I have setup an Azure CycleCloud environment to run WRF and CMAQ (run from within Docker containers using apptainer/singluarity) and I am using the Slurm scheduler for submitting and running single node jobs on the cluster perfectly fine.

The problem that I now have is that single node containerized CMAQ / CCTM runs are taking too long, so I now need to find a way to submit and run multi-node containerized CMAQ jobs via the Slurm scheduler - and I don’t know if this is possible, and if so how - does anyone know / have experience of this?

Using sbatch, I have been running single node CMAQ Containers successfully with a job script like this:

#SBATCH --partition=hpchb176rsv4
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=176
#SBATCH --cpus-per-task=1
apptainer exec --contain --bind [outside init folder]:/usr/local/init, [outside data folder]:/data cmaq_cctm_5p4_ubuntu_2204.sif /usr/local/init/cctm-init.sh

But, if I am going to run a multi-node MPI job (I have made sure I have the same version of openmpi inside the container for cmaq and on the compute nodes) - then I presume that the CMAQ processes running inside each container need to know about the processes and hostnames of those processes in containers on the other compute nodes, so that they can find each other and communicate - any ideas how I do this?

If anyone can share how they do this (if it is possible to do) - that would be a huge help.

Thanks in advance for any advice

Gary

Where are you on the scaling-curve for your particular modeling-scenario?
Recall that while wall-clock time per job tends to go down linearly in the number of CPUs, the parallel overhead tends to go up quadratically, so that at some point adding more CPUs costs overall run-time. There is a “sweet spot” for each model configuration, which generally must be found by experimentation.

Back many years ago, we saw that on a particular IBM supercomputer every 64-CPU configuration ran slower than any 32-CPU configuration…

If your 176 nodes is past the “sweet spot” for your problem, then going multi-node will only make that worse. In that case, the only way to improve the situation is to improve the quality of the scalar code that each CPU runs.Again, I’ve seen a model with “sweet spot” at 12 CPUs, out-performed by a properly-optimized model on just 2 CPUs (and which had a “sweet spot” at 7 CPUs…)

Hi @cjcoats - it’s me again from the previous post where I was suffering from CMAQ perf issues…

I have proven that IO does not help dramatically, so I am left with trying extra compute.

I think it was you that said that I am probably suffering from some “hot” cells and the run will take as long as it takes for these to solve, with the other cells not taking much cpu?

My understanding from that post is that there is no way to optimise the CMAQ grid (in CAE world - mesh density) so that there are more cells in the “hot” areas and less cells in the “cool” areas - and that it has to be a equi-spaced grid.

So (I think?), if I want to reduce the run time - then the only option is to try and sub-divide all the grid further by moving to more processors (unfortunately that means scaling to more nodes, even if it is not computationally/financially efficient), to reduce the amount of cpu work each cell has got to do.

Now, I understand that this may not work as its a law of diminishing returns and the job may then spend more time on inter process communication, but I at least wanted to try it to prove it.

So, do you know if it is possible to run containerised CMAQ across multiple nodes and have all the processes inside each container on each node communicate via MPI across the hosts?

Or, does this just not work, and I will have to try running CMAQ from a file system instead of a container to get multi-node to work?

Any experience of this - as I see from my reading that you are an expert in this and you have posted some CMAQ container build instructions (for Centos 7) a while back.

Best

Gary

Please submit the modeling domain and number of grid cells for your domain by sharing your GRIDDESC.

To verify that this is not an issue with your domain being too small to scale to 176 cores on 1 node, please try running a larger domain such as the 12US1 Domain.

We have instructions on how to obtain the input data for a 12US1 Domain from the CMAS Center Open Data Program.
https://cyclecloud-cmaq.readthedocs.io/en/latest/user_guide_cyclecloud/benchmark_cmaqv54%2B_hbv3_beeond/index.html

GRIDDESC
‘12US1’
‘LAM_40N97W’ -2556000. -1728000. 12000. 12000. 459 299 1

Hi Liz,

Thanks for the response, I need to wait for my Air Quality team to get back from Holidays before I can respond as I am just the “IT guy” and don’t know about Air Quality simulations per-se :slight_smile: and am trying to get this running as quick as I can with brute force whilst they are away.

For my general interest - is it possible, has anyone ever, run containerised CMAQ across multiple nodes, or does it have to be run from a file system load point?

Hi Gary,

If your model domain is small, and you were not able to see scaling on 1 node, then you will not see scaling across nodes. To run tightly coupled model like CMAQ, you need an HPC Cluster to run across nodes. One method to create an HPC Cluster is to use Azure CycleCloud to get access to infiniband networking and fast I/O.

I did not have luck a few years ago using containers to run CMAQ on Azure CycleCloud across mutliple nodes.

Here is an example that uses Singularity on Azure CycleCloud. I would start by trying to run a hello world job similar to what is demonstrated in this link just to verify that everything works, and then try a singularity container with your CMAQ install.

Hi Liz,

I can launch the CMAQ container on multiple nodes, using an sbatch job script similar to this:

#SBATCH --partition=hpchb176rsv4
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=176
#SBATCH --cpus-per-task=1
srun --mpi=pmix apptainer exec --contain --fakeroot --bind $OUT_INIT_DIR:$IN_INIT_DIR,$OUT_DATA_DIR:$IN_DATA_DIR $SIF_CONTAINER /usr/local/init/cctm-init.sh

But, the obvious problem with this is it starts two nodes OK and then runs 176 containers on each node, that then each run the /usr/local/init/cctm-init.sh script which then launches loads of versions of CMAQ - causing it all to crash out.

So, it’s almost there - the issue seems to be, how to launch just one container on each node (rather than 176!), and then, how the CCTM launch scripts handle the MPI init / inter-node inter-process communications.

Am I wasting my time trying to get this working, has anyone got multi-node containerised CMAQ CCTM working?

Best

Gary