Some question about openmp reconstruct

hello,there.

i am trying to reconstruct cmaq with an openmp version. Now i am trying to reconstruct AERO module in which cost a lot in my perf result. And i met some obstacle.

First obstacle is that when i try to test an openmp testcode in aeroproc, the openmp thread is always 1. but i accutually used -fopenmp and set -x OMP_NUM_THREADS=2 in my mpirun script.

when i use the same code as a single program , the omp thread works fine; but when i use it as a subroutine in aero_driver.F, i can’t work.(i have tried to preprocess aero_driver.F and use #undef parallel to avoid !$omp parallel turns to !$omp 1)

my question 1 is that why above happens?

i found a nrow,nlay,ncol loop in aero_drvier.F, and in my experience, nlay loop can always do parallel in omp.and i begin to reconstruct in this point.

Second obstacle is that i found that aero module has some global var, which can be the context before parallel region. These global vars mostly assigned in subroutine extract_xxx but not change in parallel region, and all the extract_xxx is called before parallel region.I have tried to use threadprivate + copyin to copy these global var to every thread. however, i got some coredump.

my second question is that: “is the above method good enough to do the omp reconstruct? Is there any easier way to reconstruct?

Thank you for your attention~Any opinion will be appreciated~ :blush:

Have you correctly inserted OpenMP directives for the loops you wish to parallelize?

There are a number of OpenMP codes in the I/O API M3Tools; you might want to look at them as examples of how this should be done. For example, look at vertintegral.f90 and presz.f90 as being the most “model-like” of the codes there.

Note that both load-balancing and parallel overhead are significant issues: you generally want to parallelize outer loops, in order to maximize the amount of work per parallel-loop invocation; and you also generally get better results whe you parallelize over a large subscript-range (e.g., the “remainder-imbalance” is much smaller for, say, 200 rows than it is for 30 layers.).

I did a thorough OpenMP-parallelization of CMAQ-cousin MAQSIP-RT thirty years ago for the world’s first air-quality forecast system (in a joint NC Supercomputer Center – Penn State project), and then similarly for CAMx for a CARB project, and achieved quite good scalability (far better than the scalability for CMAQ at the time). CMAQ is structured in a way that makes efficient OpenMP parallelization difficult, at best – and (for example) whose “centralized-I/O module” prevents task-parallelization of the input/time-interpolate tasks (something that was trivial with MAQSIP, since the I/O API INTERP3 for doing that is thread-safe).

Hi rzbbc,

For Q1, please try to use setenv OMP_NUM_THREADS 2 in the run script rather than in the mpirun command. Also you can double check this setup in your code with the following Fortran statement:

write (6, *) ' ==d== ', omp_get_num_threads()

For Q2, please provide a more detailed description how you defined your parallel regions and in what subroutine. Also if your are using ifort compiler, please add -traceback which could help you identifying the origin of the COREDUMP. You have the right idea to moving forward with OpenMP. However, you need to ask yourself why you want to adopt the hybrid approach, MPI+OpenMP (I assume that is the case and please correct me if I am wrong). When all the data are private, OpenMP will operate in the most efficient manner. As you know, that is not the case in CMAQ where there are quite a few global variables (shared). You have to copy them into private data which costs CPU time. Will it be more efficient to convert those thread resources into MPI processes? This is my person thought.

Cheers,

David

Thank you for your reply :blush: ~i have just return from my National Day vacation today.

I’m very interested in the load-balancing part.

my parallel region now is module AERO, which locate in

CMAQ-5.4/CCTM/src/aero/aero6/aero_driver.F line 351-471

link is CMAQ/CCTM/src/aero/aero6/aero_driver.F at 5.4 · USEPA/CMAQ · GitHub

in my testcase, NLAY is 33, NCOL and NROW may be 400-300.

but, in my experience, although may cause a less load-balancing, choosing NLAY as parallel region can avoid many data dependency problem.

i think we have similar exprience :blush: ~i did OpenMP-parallelization for CAMx also. I did some optimization in CAMx/emiss.f(newest version), which can achieve good optimization. All you need to do is solve the latter loop’s data dependency problem, and extract the string compare before the loop. Unfortranately i can’t provide the code.

I’m sure i use (but not insert) openmp code correctly.

it is code like below:

USE OMP_LIB
some other code

!$omp parallel default(shared)      
      tn = OMP_GET_NUM_THREADS()
      ti = OMP_GET_THREAD_NUM()
      write(*,*) "parallel rate: tn=", tn, " ti=", ti
!$omp end parallel

i have cleaned codes in loop and loop it self , and it turns out the tn=2, ti=1.

i think some setting may be wrong in my machine, i wil try to fix it latter.

Thank you for you reply~ :blush:

my parallel region now is module AERO, which locate in

CMAQ-5.4/CCTM/src/aero/aero6/aero_driver.F line 351-471

link is

I will try your suggestion, it must be something wrong in my setting.

And your assume is right, i am trying to do a MPI+OpenMP hybrid reconstruction. In my case, it turns out that when use pure MPI and MPI_RANK > 256, the scaling benefit decrease very fast.

Thank you for your notification that global var in OpenMP will import fork and memory cost.

In fact i don’t accualy reason me that why i want to try MPI+OpenMP, maybe it’s just a try :sweat_smile: .

On the basis of my experience parallelizing MAQSIP-RT, I strongly disagree with wong.david-c: there is no extra cost for using SHARED data.

You do have a number of problems to deal with.

First, the treatment of the global variables: they should almost-certainly be declared SHARED rather than doing a copy-in/copy-out: see, for example, the OMP-parallel directive at lines 450-453 of M3Tools program VERTINTEGRAL https://github.com/cjcoats/ioapi-3.2/blob/master/m3tools/vertintegral.f90. “Scratch” work-variables should be declared PRIVATE, and you need to ensure there are no loop-dependencies for them.

Next (and quite probably the core-dump problem): You must ensure that all the routines called inside a parallel region are thread-safe – and there are a lot of them:

  • EXTRACT_AERO
  • EXTRACT_PRECURSOR
  • EXTRACT_SOA
  • UPDATE_AERO
  • UPDATE_PRECURSOR
  • UPDATE_ORGVAPOR
  • calcmoments
  • GETPAR
  • FEEDBACK_WRITE

Third: about load-balancing: for NLAYS=33 layers and 16 processors, for example, one of the processors will have 3 loop-iterations while all the rest have 2, so that you have idle (un-balanced) processors 34% of that loop’s compute-time. If instead you parallelize on rows and re-organize the main loop as

DO R = 1, NROWS
   DO L = 1, NLAYS
      DO C = 1, NCOLS

you will trade off a tiny amount of cache-locality for much better division-properties in the load balancing, at least for “reasonable” numbers of processors (say 64 or fewer).

Fourth, because the CMAQ interpolate_var is not thread-safe the way the I/O API INTERP3 is, you cannot reduce the serial-code overhead using parallel sections the way I did in MAQSIP-RT:

!$OMP PARALLEL SECTIONS
!$OMP& DEFAULT( NONE ),
!$OMP& SHARED( MDATE, MTIME, NLAYS, TA, QV, PRES, DENS ),
!$OMP& PRIVATE( MESG )

!$OMP SECTION

    IF ( .NOT. INTERP3( METCRO3D, 'TA', PNAME, .TRUE.,
 &                      NLAYS, MDATE, MTIME,
 &                      TA ) ) THEN
        MESG = 'Could not read TA from ' // METCRO3D
        CALL GRID_EXIT( PNAME, MDATE, MTIME, MESG, 2 )
    END IF      !  if INTERP3 failed

C… Read & Interpolate QV

!$OMP SECTION

    IF ( .NOT. INTERP3( METCRO3D, 'QV', PNAME, .TRUE.,
 &                      NLAYS, MDATE, MTIME,
 &                      QV ) ) THEN
        MESG = 'Could not read QV from ' // METCRO3D
        CALL GRID_EXIT( PNAME, MDATE, MTIME, MESG, 2 )
    END IF      !  if INTERP3 failed

C… Read & Interpolate PRES

!$OMP SECTION

    IF ( .NOT. INTERP3( METCRO3D, 'PRES' , PNAME, .TRUE.,
 &                      NLAYS, MDATE, MTIME,
 &                      PRES ) ) THEN
        MESG = 'Could not read PRES from ' // METCRO3D
        CALL GRID_EXIT( PNAME, MDATE, MTIME, MESG, 2 )
    END IF  !  if INTERP3 failed

C… Read & Interpolate DENS

!$OMP SECTION

    IF ( .NOT. INTERP3( METCRO3D, 'DENS' , PNAME, .TRUE.,
 &                      NLAYS, MDATE, MTIME,
 &                      DENS ) ) THEN
        MESG = 'Could not read DENS from ' // METCRO3D
        CALL GRID_EXIT( PNAME, MDATE, MTIME, MESG, 2 )
    END IF  !  if INTERP3 failed

!$OMP END PARALLEL SECTIONS

Hi rzbbc,

Let me address a few points here:
1\. Dr. Coats has stated “CMAQ is structured in a way that makes efficient OpenMP parallelization difficult, at best – and (for example) whose “centralized-I/O module” prevents task-parallelization of the input/time-interpolate tasks”. Well when I parallelized entire CMAQ code in 1998, I took the MPI approach and did not consider MPI-OpenMP hybrid methodology. With this design in mind, it might now be 100% ready to bring in OpenMP construct. You have also noticed that interpolate_var subroutine in centralized_io_module.F is not thread safe due to the fact that it uses Fortran90 type of assign statement (a = b + c). Only it is converted into loop, with OpenMP directive, it will be thread safe. IOAPI was designed for serial operation and shared memory platforms, so its applicable environment is limited. CMAQ is designed to work on distributed memory architecture (so far have not considered hybrid approach) and Dr. Al Bourgeois and I have to modify/expand IOAPI so it can be run on distributed memory machines. Yes CMAQ needs some work in order to make it to be in MPI-OpenMP mode. Will this approach bring in any performance increase is another topic and the basis to drive the work?

2. Dr. Coats said “I strongly disagree with wong.david-c: there is no extra cost for using SHARED data.”, I guess it depends on how he defines the term cost. In reality, there is a memory contention potential which can hinder the performance. That is the reason why when you learn OpenMP approach, you were told to put things in PRIVATE as much as possible. Putting things in PRIVATE through copy operation can be costly. Remember there is no free lunch in this world.

3. load imbalance issue: I agree with Dr. Coats assessment that when you use 16 threads on 33 layers, one thread will have 50% work than the rest. However, I suggest to use 2 or 3 threads. The reason is in the MPI-OpenMP hybrid mode, the MPI side should be the dominant portion in my opinion. Consider the following, given a system with M nodes and each node has N cores, in your run script with job script description, typically you need to declare how many cores in each node will be used, i.e. number of MPI processes (let say n1), and the rest of the cores in that node, n2 (n1 + n2 = N) can be used for OpenMP side (for discussion and simplicity purposes, I don’t consider hyper threads) and totally how many nodes you need to use. With this picture in mind, when the code is in MPI mode, those n2 cores are idle, in OpenMP mode those n1 cores are idle. You can see the optimal way to utilize the system w.r.t. CMAQ (or your code) is a huge question. This implies to the following question, which one is better, MPI only or MPI-OpenMP hybrid?

4. Dr. Coats has suggested to swap the NROWS and NLAYS loop. I don’t see there is a need since the magnitude of NROWS is very similar to NLAYS once the domain is decomposed.

Cheers,

David

1 Like

For array operations:

!$omp workshare
C=A+B

I recall that back in the early 2000’s, MAQSIP-RT was still on the “good” part of the scaling curve for 32 processors (32-core runs were about 2% faster than 31-core runs (where the “perfect” speedup would be 3.1%), that being the biggest shared-memory machine to which we had access), CMAQ at that time did not scale past 8 processors (my employer was running both). I took the time to diagnose the reason for that CMAQ problem, so that it was eventually fixed…

On 2: When there are memory-contention issues in OpenMP, it is almost always because you have a lousy data distribution to begin with – something that in practice does not happen in (air quality, meteorological, or hydrological) environmental models. Otherwise SHARED and PRIVATE have essentially the same cost, except in the case of PRIVATE run-time-dimensioned arrays.

For 3: In principle, for a target machine with M nodes, each of which has N-way cores, you will get your best M*N-way-parallel performance by doing M-way distributed parallelism of N-way shared-parallel computation.

Hi, Dr. wong.david-c and Dr. Coats.

Thank you for your reply~ Your discussion enlight me a lot.

For my opinion, an openmp reconstruct in current CMAQ code, fork and memory occupy cost is unavoidable. Some background and analysis from my opinion is below, and i’m trying to give some my solution.

Backgound:Code Features:

B.1、CMAQ try to use an OOP design, use fortran module as c++ class.

B.2、CMAQ’s module assigned many global vars, and this module usually have a "init global" subroutine to allocate or initialize these global vars.

B.3、CMAQ’s subroutine, sometimes, have “ALLOCATABLE SAVE“ variable, which may cause data conflict.

B.4、the module in CMAQ some times use “USE ONLY“ syntax, which cause global var not visiable in parallel region, in this case, neither private or shared can’t work.

B.5、many subroutine has “LOGICAL, SAVE :: FIRSTIME“ variable, this may be designed to spare the allocate and deallocate cost. But i think setting “!$OMP critical” can solve this problem.

Analysis: The reason why some variable can’t use private/shared

there are some variables is allocated before the parallel region.

For example: a subroutine named MAP_AERO, which is called by EXTRACT_AERO, is the “init global“ subroutine in AERO module(B.2) , is called outside the total aero method. And this means before the parallel region, these vars may have some values. If we don’t want to do more analysis, the values should be inherit in to the threads.

In most case, a global var described above can be set shared, because it never change in parallel region; But some global var is changed in parallel region, and will be used to generate module output, typical examples are aerospc_conc, moment2_conc. Because they are allocated and initilized before the parallel region, if setted private, their values will be initialized as undefined. I can’t be sure that whether their values should be maintained.So i used threadprivate + copyin method to maintain their values before step into parallel region.

A fact is that not all global variables should be set SHARED or THREADPRIVATE+COPYIN, which means some fork cost will be freed. But, according to my imagination, cost to copying special global vars like aerospc_conc may be quite a lot, because aerospc_conc is a matrix.

Codes refs:

subroutine MAP_AERO: CMAQ/CCTM/src/aero/aero6/AERO_DATA.F at 5.4 · USEPA/CMAQ · GitHub line 841

aerospc_conc assign and allocate: CMAQ/CCTM/src/aero/aero6/AERO_DATA.F at 5.4 · USEPA/CMAQ · GitHub assigned in line 723, allocate in line 1019(in MAP_AERO).

Subroutine extract_aero: CMAQ/CCTM/src/aero/aero6/AERO_DATA.F at 5.4 · USEPA/CMAQ · GitHub line 1285

parallel region: CMAQ/CCTM/src/aero/aero6/aero_driver.F at 5.4 · USEPA/CMAQ · GitHub line 351-471

MAP_AERO called before parallel region:

CMAQ/CCTM/src/driver/driver.F at 5.4 · USEPA/CMAQ · GitHub line 391(AERO module is call latter inside subrouine sciproc line 731).

above is my most sight in CMAQ’s omp reconstruct, and there is still many other jobs to do. Like Dr. Coats point out that interpolate_var is not thread safe, there may exist some other subroutine that is same as interpolate_var.

Maybe my method is not so good, it’s just a trade-off between efficiency and functional completion.

I’m looking forward to your comment.

Hi, Dr wong.david-c, thank you for your reply~

for Q1:

We should think about some questions below.

Will CMAQ solve polution case that has higher precision, which may require higher compute resource? I think CPU core will be more to satisfy the higher compute requirement due to Moore’s Law. In this case, If MPI scaling benefit decay fast, the openmp can be a good supplement to scaling benefit.

for Q2:

I agree with you and Dr. cjcoats. You are actually discussing two different thing, share/private and threadprivate+copyin, and the former i agree with Dr. cjcoats, and the latter i agree with you(The cost of context copy when thread forking may cost a lot). Some quesion in my previous reply basicly is due to the lackness of my cognition in CMAQ app.Maybe some more efficent reconstruct way exists.

for Q3:

My machine is in numa. To avoid infos striding over socket or die, i use csh script like below to achieve better data locality.

#assume cpu cores is 128, two numacore, core in each numa is 64
let numacore=2
let NPCOL=8
let NPROW=8
let threads=2
let NPROCS=$NPCOL*$NPROW
let ppr=$NPROCS/numacore
mpirun .....    –map-by ppr:$ppr:numa:PE=$threads -np=$NPROCS $EXEC

to control the mpi-openmp hibrid.

Considering the config abvoe:

when stepping into hibrid mpi-openmp parallel region, all 128 core will be used.

when not in hibrid parallel region, CMAQ will use MPI, which occupy 64 cores.

I think mpi-openmp will be a better choice for “bigger” case.

Cheers,

rzbbc

What you describe about aerospc_conc is exactly what I was meaning when I said

CMAQ is structured in a way that makes efficient OpenMP parallelization difficult, at best…

This is a “scratch variable”. As such, it never should have been a MODULE variable, nor should it have been ALLOCATEd – it should be a (local) “auto” variable. This is a violation of modularity, and is a bad programming practice in general, and not just for an MPI code. See the “Principle of Locality” in the I/O API’s updated copy of the original set of coding standards for CMAQ
If this were properly a local variable, there would have been no trouble declaring it PRIVATE for OpenMP. Note also that making it a MODULE-variable prevents optimizations which the compiler otherwise would have done…

Hi rzbbc,

Q1: Parallelism will reduce the run time. In reality, as you increase the number of processors, the efficiency will be deviated from the ideal speedup curve. This is true for all applications even for the embarrassed parallel program. The only difference is how quickly the application’s performance deviates from the ideal curve. One side note, the performance also depends of how the domain is decomposed (related to computation communication ratio). For example, for an application heavily depends on communication along the x-axis, the 1 x n decomposition will perform better than n x 1. The bottom line is use the computational resources wisely. OpenMP can be a good supplement to scaling benefit to me is still an open question (application dependent). Please note that you have misused the Moore’s Law :slight_smile:

Q2. Please note that CMAQ was parallelized based on MPI only (did not have MPI-OpenMP hybrid whatsoever in mind). As you have noticed that you need some reconstruction to make CMAQ has the hybrid capability. I don’t deny that. We always welcome constructive criticism but otherwise we respect each entitled opinion.

Q3. You are absolutely correct. Binding a thread to a particular processor is critical in terms of performance in OpenMP application. Please let us know what you find respect to your statement “I think mpi-openmp will be a better choice for “bigger” case.”

Cheers,

David

Hi, wong.david-c

The statement is from my test, but i am not very sure if it’s correct. When setting mpi rank from 128 → 256 → 512, the time cosumption varies from 23 → 19 → 17,which is very bad for scaling in my opinion. However, their MPI comm% in ipm does not differ a lot, which is 40%.

I think something which cost time will increase with MPI rank number linearly, but i don’t locate the code yet. i’ll be grateful if you can offer me some possible code block in this case~ :blush:

Cheers,

rzbbc

Hi, Dr. cjcoats~ Thank you for your reply :blush: , looks like we’ve agreed in the hardness of openmp reconstruct in CMAQ.

It do sometimes bothers me~However, like Dr wong says, CMAQ code is write long time ago, and in that time, they don’t take into account a hybrid parallel method.

It will be a long way to achieve a good MPI&&openmp version of CMAQ, considering about the engineering flexity problem.

I will try some local openmp in CMAQ hot spot i detect, but not a “module range“ openmp, which may avoid some problems we talked about. But i think this is a temporary solution.

In my opinion is that it’s hard to reconstruct CMAQ with human beings now. A better way may be writing an auto-openmp tool which designed to paralell CMAQ or other APP according to CMAQ’s code design.In the furture, i think Fortran code debt will appear more and more frequently.

Hi rzbbc.

In your test, what is the size of the domain, i.e. ncols x nrows?

Cheers,

David

Hi, Dr. wong.david-c

In fact, i use the 2018_12US1 case as my test case~ :blush:

cheers,

rzbbc

Hi rzbbc,

Thanks for the info. Internally we are using 256.

Cheers,

David

Hi, Dr. wong.david-c, I tested a zero emiss testcase in 512 mpi_rank, with a relatively better store hardware, which turned out that the scaling performance vs 256 is 5/8 almost 62.5%, which i think is a good scaling rate.(Except emiss unbalance)

But i found that CLDPROC’s time comsume increases with the timestep increase, which confused me. Did this phnomena happens in your machine too?

Supplement:after some perf,i locate a code in CMAQ/CCTM/src/cloud/acm_ae6/rescld.F at 93099d8159f50e525120c790df674fe837a57edf · USEPA/CMAQ · GitHub , which is a loop with nrow,ncol,nspc

using gpt to analysis this code, it told me that when this block run, grids with cloud increase, which increase the workload of compute, is that right?

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.