CCTM Perf running in a container vs a filesystem is much slower, is this expected?

I am getting a huge difference in CMAQ CCTM performance (on an Azure CycleCloud / Slurm Cluster) when running in a container (I am using a converted docker container running on apptainer / singularity as-was) vs when running from a filesystem in a custom VM image.

I expected some perf. drop due to the container filesystem overhead, but I did not expect a factor of 2-4x - can anyone confirm that this is expected? If not, what I might need to do to improve the containerised run performance?

I have run a bunch of tests on the same CMAQ CCTM run data (but no multi-node containerised runs as I just could not get this to work…), and it looks like the only way to get CMAQ CCTM to run quick is from a filesystem (esp multi-node with BeeGFS) - is this correct / what others have found?

Single Node Container - 1x 160 cores (HB176rs_v4) NFS:

***** CMAQ TIMING REPORT *****

Start Day: 2024-09-10
End Day: 2024-09-12
Number of Simulation Days: 3
Domain Name: MDE27
Number of Grid Cells: 864360 (ROW x COL x LAY)
Number of Layers: 40
Number of Processes: 160
All times are in seconds.

Num Day Wall Time
01 2024-09-10 1663.12
02 2024-09-11 2007.64
03 2024-09-12 1478.06
Total Time = 5148.82
Avg. Time = 1716.27

Single Node Filesystem - 1x 160 cores (HB176rs_v4) NFS:

***** CMAQ TIMING REPORT *****

Start Day: 2024-09-10
End Day: 2024-09-12
Number of Simulation Days: 3
Domain Name: MDE27
Number of Grid Cells: 864360 (ROW x COL x LAY)
Number of Layers: 40
Number of Processes: 160
All times are in seconds.
Num Day Wall Time
01 2024-09-10 905.39
02 2024-09-11 1189.18
03 2024-09-12 882.81
Total Time = 2977.38
Avg. Time = 992.46

Single Node Filesystem - 1x 160 cores (HB176rs_v4) BEEGFS:

***** CMAQ TIMING REPORT *****

Start Day: 2024-09-10
End Day: 2024-09-12
Number of Simulation Days: 3
Domain Name: MDE27
Number of Grid Cells: 864360 (ROW x COL x LAY)
Number of Layers: 40
Number of Processes: 160
All times are in seconds.
Num Day Wall Time
01 2024-09-10 456.00
02 2024-09-11 448.76
03 2024-09-12 447.95
Total Time = 1352.71
Avg. Time = 450.90

Two Node Filesystem - 2x 120 cores (HB120rs_v3) BEEGFS:

***** CMAQ TIMING REPORT *****

Start Day: 2024-09-10
End Day: 2024-09-12
Number of Simulation Days: 3
Domain Name: MDE27
Number of Grid Cells: 864360 (ROW x COL x LAY)
Number of Layers: 40
Number of Processes: 240
All times are in seconds.
Num Day Wall Time
01 2024-09-10 380.24
02 2024-09-11 371.60
03 2024-09-12 368.80
Total Time = 1120.64
Avg. Time = 373.54

Note - CMAQ CCTM 5p4 and GCC Compiler

Hi @lizadams @cjcoats - please forgive my prompting, but you both seem to have experience in this area. Would you have any knowledge, observations that you could share relating to my tests running CMAQ CCTM in a container vs from a filesystem?

My AQ specialists would like to run their Simulations from Docker containers on our Azure CycleCloud/Slurm cluster environment, but we don’t seem to be able to hit the project’s Simulation cycle-time requirements when using them:

  • I can’t get containers to work across multiple cluster nodes
  • Sync’ing the containers and their input/output data to local NVMe drives on the compute nodes only gives a marginal (10%) performance increase.
  • The simulations don’t even scale very well within a single node - not much perf increase between 60 cores and 120 cores.

So, I have built a custom VM HPC image with CMAQ installed in it (so running from a filesystem, rather than a container, so that I can run multiple nodes in parallel) and sync’ing the input/ouput data to a clustered (BeeGFS) filesystem on the node’s local NVMe drives. This does give a huge speed up (4x) just using 1 node, even more with 2 nodes (4-6x), and then diminished returns thereafter.

Am I likely going to be out of luck trying to get containerised CMAQ to run performantly like it does on a clustered filesystem?

Would you expect such a performance difference (presumably due to the container’s IO layer)?

I believe Apptainer (Singuluarity) is a commonly used container runtime in HPC environments (due to not needing root access on the nodes) - but are there better performing solutions?

Best Regards
Gary