Hi there~
Thank you for watching this post~
I’m trying to optimize cldproc recently, and i locate an hotspot in subroutine RESCLD.
After some code insert and test, i realize that there exist an big load unbalance in RESCLD, and mostly in subroutine scavwdep.Unfortranately the code in scavwdep contains a loop with condition, which is hard for me to continue inserting code to watch it behavior.
I read the whole code in scavwdep.F, and my result is that it calculate ALFA in four different part, including GC, AE, NR, TR, and calculate CEND and REMOV respectly. During the calculation, subroutine GETALPHA is called to get ALFA0, ALFA2, ALFA3, and function HLCONST is called to get KH.
TWO existing optimize i found is below:
1、seperate ALFA and CEND&&REMOV’s calc, and latter the EXP(-ALFA * TAUCLD) in CEND calculation could be vectorized. But it seems like that the code weigh not too much ratio.
2、search algo in HLCONST is linear search, which can be optimize to hash(which i think maybe the core hotspot)
Another phenomena i found is that as timestep grows, the cost of cldproc grows. Is there any mechanics supporting this happens?
Does someone know where is the actual hotspot? Any sugguestion is appreciated.
In its way, this is much akin to what SMOKE is about: thirty years ago, emissions modeling was mostly a bunch of redundantly-repeated ad-hoc linear searches using character-string search-keys (the same search being repeated for every time step). The key insight was eliminating the redundancy by turning the ad-hoc problem into a “vector” problem where there were a fixed – sorted – orderings (a “vector” structure), using binary searches instead of linear, and wherever possible replacing character-string operations (which are horribly expensive) by integer operations wherever possible: then it comes down to the use of structured sparse operators (which in the case of SMOKE were sparse-matrix multiplications; for HLCONST, the operators are sparse-exponential). The same algorithmic-complexity ideas still apply.
The result was to replace an overnight supercomputer run by two minutes forty-three seconds on a desktop SPARC-II workstation.
Note that your hash-table idea is a bit of an improvement over the present, but still not as much improvement as you would get with a more structured vector approach.
[Note also that SMOKE today no longer follows all the improvements of the original]
FWIW – Carlie J. Coats, Jr.,Pl.D.
I/O API Author/Maintainer
original SMOKE Author
Thanks for keep profiling the CMAQ model. I am not a cloud person but by examining the code, I saw two major subroutines: SCAVWDEP and AQ_MAP, which are being called in RESCLD subroutine. In addition, there is a conditional statement “IF ( QCRGCOL .GT. 0.0 ) THEN” that determines whether they will be called or not. Load imbalance is inevitable. I have done a small test and this hypothesis has been confirmed.
You other observation is within HLCONST, there is an expensive operation, i.e. string comparison (Dr. Coats has pointed that out as well). I have devised a new way to handle that and I have done a quick test confirming the new modification does not alter results. If you don’t mind, please contact me directly (wong.david-c@epa.gov) and I will share the code with you so you can conduct an independent verification.
Thank you for your reply~
I tried codes D.R. wong provided me, It turn out that “structured vector approach” can get some faster result in my computer than my previous realization~
Thank you for your reply. I have tested your code and the test result is in above post. It’s turns out your idea in structured vector approach is right.
cheers, rzbbc.
This is great discussion. @cgnolte replaced the string comparisons in hlconst with integer comparisons about a year ago and it will likely be included in the next public CMAQ release.