CCTM Perf running in a container vs a filesystem is much slower, is this expected?

garymansell · September 11, 2024, 7:30am

I am getting a huge difference in CMAQ CCTM performance (on an Azure CycleCloud / Slurm Cluster) when running in a container (I am using a converted docker container running on apptainer / singularity as-was) vs when running from a filesystem in a custom VM image.

I expected some perf. drop due to the container filesystem overhead, but I did not expect a factor of 2-4x - can anyone confirm that this is expected? If not, what I might need to do to improve the containerised run performance?

I have run a bunch of tests on the same CMAQ CCTM run data (but no multi-node containerised runs as I just could not get this to work…), and it looks like the only way to get CMAQ CCTM to run quick is from a filesystem (esp multi-node with BeeGFS) - is this correct / what others have found?

Single Node Container - 1x 160 cores (HB176rs_v4) NFS:

* CMAQ TIMING REPORT *

Start Day: 2024-09-10
End Day: 2024-09-12
Number of Simulation Days: 3
Domain Name: MDE27
Number of Grid Cells: 864360 (ROW x COL x LAY)
Number of Layers: 40
Number of Processes: 160
All times are in seconds.

Num Day Wall Time
01 2024-09-10 1663.12
02 2024-09-11 2007.64
03 2024-09-12 1478.06
Total Time = 5148.82
Avg. Time = 1716.27

Single Node Filesystem - 1x 160 cores (HB176rs_v4) NFS:

* CMAQ TIMING REPORT *

Start Day: 2024-09-10
End Day: 2024-09-12
Number of Simulation Days: 3
Domain Name: MDE27
Number of Grid Cells: 864360 (ROW x COL x LAY)
Number of Layers: 40
Number of Processes: 160
All times are in seconds.
Num Day Wall Time
01 2024-09-10 905.39
02 2024-09-11 1189.18
03 2024-09-12 882.81
Total Time = 2977.38
Avg. Time = 992.46

Single Node Filesystem - 1x 160 cores (HB176rs_v4) BEEGFS:

* CMAQ TIMING REPORT *

Start Day: 2024-09-10
End Day: 2024-09-12
Number of Simulation Days: 3
Domain Name: MDE27
Number of Grid Cells: 864360 (ROW x COL x LAY)
Number of Layers: 40
Number of Processes: 160
All times are in seconds.
Num Day Wall Time
01 2024-09-10 456.00
02 2024-09-11 448.76
03 2024-09-12 447.95
Total Time = 1352.71
Avg. Time = 450.90

Two Node Filesystem - 2x 120 cores (HB120rs_v3) BEEGFS:

* CMAQ TIMING REPORT *

Start Day: 2024-09-10
End Day: 2024-09-12
Number of Simulation Days: 3
Domain Name: MDE27
Number of Grid Cells: 864360 (ROW x COL x LAY)
Number of Layers: 40
Number of Processes: 240
All times are in seconds.
Num Day Wall Time
01 2024-09-10 380.24
02 2024-09-11 371.60
03 2024-09-12 368.80
Total Time = 1120.64
Avg. Time = 373.54

Note - CMAQ CCTM 5p4 and GCC Compiler

garymansell · September 12, 2024, 8:55am

Hi @lizadams @cjcoats - please forgive my prompting, but you both seem to have experience in this area. Would you have any knowledge, observations that you could share relating to my tests running CMAQ CCTM in a container vs from a filesystem?

My AQ specialists would like to run their Simulations from Docker containers on our Azure CycleCloud/Slurm cluster environment, but we don’t seem to be able to hit the project’s Simulation cycle-time requirements when using them:

I can’t get containers to work across multiple cluster nodes
Sync’ing the containers and their input/output data to local NVMe drives on the compute nodes only gives a marginal (10%) performance increase.
The simulations don’t even scale very well within a single node - not much perf increase between 60 cores and 120 cores.

So, I have built a custom VM HPC image with CMAQ installed in it (so running from a filesystem, rather than a container, so that I can run multiple nodes in parallel) and sync’ing the input/ouput data to a clustered (BeeGFS) filesystem on the node’s local NVMe drives. This does give a huge speed up (4x) just using 1 node, even more with 2 nodes (4-6x), and then diminished returns thereafter.

Am I likely going to be out of luck trying to get containerised CMAQ to run performantly like it does on a clustered filesystem?

Would you expect such a performance difference (presumably due to the container’s IO layer)?

I believe Apptainer (Singuluarity) is a commonly used container runtime in HPC environments (due to not needing root access on the nodes) - but are there better performing solutions?

Best Regards
Gary

Topic		Replies	Views
Is it possible (how?) to run containerized CMAQ across multiple nodes using Azure CycleCloud & Slurm? CMAQ	6	73	August 23, 2024
CMAQ CCTM performance seems "bursty" (poor) on 176 core HPC Node CMAQ	8	76	August 20, 2024
Long time with the CCTM running CMAQ cmaq	2	324	June 9, 2023
Multi-Node CCTM on Azure CycleCloud with Slurm and OpenMPI has started crashing Infrastructure	2	28	January 22, 2025
12CONUS1 run taking long CMAQ cmaq	1	146	April 8, 2024

CCTM Perf running in a container vs a filesystem is much slower, is this expected?

Single Node Container - 1x 160 cores (HB176rs_v4) NFS:

***** CMAQ TIMING REPORT *****

Single Node Filesystem - 1x 160 cores (HB176rs_v4) NFS:

***** CMAQ TIMING REPORT *****

Single Node Filesystem - 1x 160 cores (HB176rs_v4) BEEGFS:

***** CMAQ TIMING REPORT *****

Two Node Filesystem - 2x 120 cores (HB120rs_v3) BEEGFS:

***** CMAQ TIMING REPORT *****

Related topics

* CMAQ TIMING REPORT *

* CMAQ TIMING REPORT *

* CMAQ TIMING REPORT *

* CMAQ TIMING REPORT *