Hi rzbbc,
Let me address a few points here:
1\. Dr. Coats has stated “CMAQ is structured in a way that makes efficient OpenMP parallelization difficult, at best – and (for example) whose “centralized-I/O module” prevents task-parallelization of the input/time-interpolate tasks”. Well when I parallelized entire CMAQ code in 1998, I took the MPI approach and did not consider MPI-OpenMP hybrid methodology. With this design in mind, it might now be 100% ready to bring in OpenMP construct. You have also noticed that interpolate_var subroutine in centralized_io_module.F is not thread safe due to the fact that it uses Fortran90 type of assign statement (a = b + c). Only it is converted into loop, with OpenMP directive, it will be thread safe. IOAPI was designed for serial operation and shared memory platforms, so its applicable environment is limited. CMAQ is designed to work on distributed memory architecture (so far have not considered hybrid approach) and Dr. Al Bourgeois and I have to modify/expand IOAPI so it can be run on distributed memory machines. Yes CMAQ needs some work in order to make it to be in MPI-OpenMP mode. Will this approach bring in any performance increase is another topic and the basis to drive the work?
2. Dr. Coats said “I strongly disagree with wong.david-c: there is no extra cost for using SHARED data.”, I guess it depends on how he defines the term cost. In reality, there is a memory contention potential which can hinder the performance. That is the reason why when you learn OpenMP approach, you were told to put things in PRIVATE as much as possible. Putting things in PRIVATE through copy operation can be costly. Remember there is no free lunch in this world.
3. load imbalance issue: I agree with Dr. Coats assessment that when you use 16 threads on 33 layers, one thread will have 50% work than the rest. However, I suggest to use 2 or 3 threads. The reason is in the MPI-OpenMP hybrid mode, the MPI side should be the dominant portion in my opinion. Consider the following, given a system with M nodes and each node has N cores, in your run script with job script description, typically you need to declare how many cores in each node will be used, i.e. number of MPI processes (let say n1), and the rest of the cores in that node, n2 (n1 + n2 = N) can be used for OpenMP side (for discussion and simplicity purposes, I don’t consider hyper threads) and totally how many nodes you need to use. With this picture in mind, when the code is in MPI mode, those n2 cores are idle, in OpenMP mode those n1 cores are idle. You can see the optimal way to utilize the system w.r.t. CMAQ (or your code) is a huge question. This implies to the following question, which one is better, MPI only or MPI-OpenMP hybrid?
4. Dr. Coats has suggested to swap the NROWS and NLAYS loop. I don’t see there is a need since the magnitude of NROWS is very similar to NLAYS once the domain is decomposed.
Cheers,
David