Is r_interpolate_var_2db optimizable?

hello there~
I found an possiblely bottleneck in CMAQ5.4 in my machine.
code is below:

In this code, store_beg_ind to store_end_ind is a flatten of a 3d data. But if lvl is presented, the output data only used a flatten of 2d data.
With a -Dparallel, I think most call will have lvl presented.
the reference call is below:

I have tested that cio_bndy_data will change in the first tstep in every hour, but I’m wondering if it will change except this situation.
if it won’t change, maybe we can cache the cio_bndy_data(store_beg_ind:store_end_ind) to acheive some performance optimization.
E.G.:
If subroutine r_interpolate_var_2db (vname, date, time, data, type, lvl) has called once, save the cio_bndy_data(store_beg_ind:store_end_ind) to cache.when it’s called again,while vname, date, time, type is unchanged, but a different lvl, because the value is calculated before, we use cache data, to avoid too much copy and muladd.
I’m basiclly new to cmaq and time interpolate, i am worried if this optimize will cause some bug or some other unanticipated behavior.
Thank you for any idea~

In interpolate_var_2db, which is locate in centralized_io_module.F line 6693, it calculate the cio_bndy_data in range (store_beg_ind:store_end_ind),which is actually a 3D cube in the form like below


but if lvl is defined, the code in line 6711 actually denote that it use a slice of the 3D cude like below(the red line point out the slice)

if it possible to only calculate the index in slice but not all the cube?
which means:

#before optimization:
if (present(lvl)) then
  cio_bndy_data(store_beg_ind:store_end_ind) =
   & (head and tail weight ratio in 3D cube)
else
  maintain raw code
endif
#after optimization:
if (present(lvl)) then
  begin_index= store_beg_ind+(lvl-1)*size_b2d 
  end_index = store_beg_ind+lvl*size_b2d-1
  cio_bndy_data(begin_index:end_index ) = 
   & (head and tail weight ratio in 2D slice)
else
  maintain raw code
endif

I am not very familiar with this code’s branch and call situation, but in line 6642 there exist an situation when the date and time equal the cio_bndy_data_timestamp, which cause cio_bndy_data does not change and eventually use the old data. I’m not sure if different lvl of a vname(which i think is a polution spec) of same date and time would goto this branch.
Thank you for any idea~ :blush:

Hi rzbbc,

I am sorry for the slow response due to furlough and I am not actively monitoring the CMAS Forum.

Before I answer your question, here is some background information. For simplicity, in centralized I/O implementation, rather than keep tracking 2D and 3D data separately, all the boundary data which are a combination of 2D and 3D data, are being stored in the 1D array cio_bndy_data. In terms of routine name nomenclature, #interpolate_var*db, where # = i (integer operation) or r (real number operation), and * = 1, 2, 3, so *d denotes the number of dimensions in the main routine argument handled and b denotes it is a boundary data type. Within cio_bndy_data, there are n chunks each of which is corresponding to a variable in either the 2D or 3D data. In each chunk, there are three sub-chunks which hold time t, t+1, and interpolated at t_k ( t <= t_k <= t+1) data, respectively. Clearly when interpolate_var is called for a specific variable at the very first time, it will read in two time steps of data and then perform the interpolation. In addition, in each of these sub-chunks data is organized according to the Fortran column-major orientation, i.e. first column, second column, …, of the first layer slab, then the second layer slab, and so forth.

Your cache performance concern is well taken. In hadv routine (hadvppm.F), subroutine RDBCON is being called by layer, i.e. each layer at a time. Inside RDBCON, it calls interpolate_var (indeed it calls interpolate_var_2db) with a specific level. The store_beg_ind and store_end_ind are associated with a specific layer/slab. For all the interpolate_var calls, they all access different parts within cio_bndy_data. So cache performance should not be an issue (cache reuse is not in the picture, and there is no bug and unwanted behavior). However, in general, RDBCON can be called with the same time stamp (previously being called with same timestamp in other part of the code), to avoid unnecessary copy and muladd those operations you have mentioned (again this is nothing to do with cache reuse), t_k data will be used to address this dilemma.

In centralized_io_module.F, it has the following block in subroutine r_interpolate_var_2db:

if (present(lvl)) then
beg_k = lvl
end_k = lvl
else
beg_k = 1
end_k = nlays
end if

This does not look like the one you mentioned in your most recent post (blocked with “before optimization”, there must be some mis-understanding). Anyway, I believe the above explanation will address your latest post as well. Let me reiterate that each sub-chunk (in the picture, you called the cube) which is associated with a particular variable, will be recalculated except the same time stamp has been used before, as well as a new time step of data needs to be read in when the timestamp is outside the current t and t+1 range.

If you have additional questions, please let me know and feel free to contact me directly (wong.david-c@epa.gov).

Cheers, David

Hello, D.R. David.
Thank you for your reply.
I learned a lot about time interpolate var from your introduction to the background of time interpolate.Thank you very much~

quote1:“However, in general, RDBCON can be called with the same time stamp (previously being called with same timestamp in other part of the code), to avoid unnecessary copy and muladd those operations you have mentioned (again this is nothing to do with cache reuse)”:
In the rdbcon’s behavior part, i think your description is not very consistent to my experiment. In my experiment,during a complete hadvppm loop,which involve all the lvls , rdbcon part, cio_bndy_data_tstamp will also change, like 1m15s, 2m30s, 3m45s, and the times alternating appear.
This may cause the “avoid unnecessary” mechanism works not so good. In my experiment, except lvl1 which is unavoidable to calc, lvl 2-23 do goto the “avoid unnecessary” branch, lvl24-31 still goto the “calc” branch.
“This does not look like the one you mentioned in your most recent post”, you are absolutely right.
In the first post i try to use cache to optimize, but turns out that the memory cost too much which result in a cross socket memory assign, eventually achieved little optimization. And this cache performance optimize way turns out not good.

In the second post, i think problem in a different way. The CORE optimization point is that: if lvl is present, too much calc is done, like picture below:


when present(lvl), we analysis the code behavior step by step
1、let us assume that if lvl = 25, and only red slice’s bndy is required to set data correctly. But the store_beg_ind to store_end_ind calculate all the cube.
2、let us assume that if lvl = 26, and only blue slice’s bndy is required to set data correctly. But the store_beg_ind to store_end_ind calculate all the cube.
which means some lvl’s data is calc repeatedly.
the code is actually the second optimize way, calc only the data to be using, and not a cache optimize. The “optimize” code i show in the second post is actually restrict the cio_bndy_data’s calculate range from all sub-chunk to part of sub-chunk.As you mentioned in your post
“if (present(lvl)) then
beg_k = lvl
end_k = lvl
else
beg_k = 1
end_k = nlays
end if”
which has actually restricted the used index range of sub-chunk.
Today i looked into the code and see the tstamp problem.And i found a solution, can you help me to figure out if it is right?In my newest optimize code, i expand the dimension of cio_bndy_data_tstamp, add a dim from cio_bndy_data_tstamp(2, 0:2, n_cio_bndy_vars) to cio_bndy_data_tstamp(2, 0:2, n_cio_bndy_vars, 0:NLAYS), in which 0 means situation with no lvl present.
In this way, each “lvl” (include no lvl, which is 0) owns and manages their time stamp, and if present lvl, only relative region is calculated.
Above is my “optimization” thought, would you please give me some guide ~thank you~ :blush:
cheers, rzbbc

Hi rzbbc,

“In the rdbcon’s behavior part, i think your description is not very consistent to my experiment. In my experiment,during a complete hadvppm loop,which involve all the lvls , rdbcon part, cio_bndy_data_tstamp will also change, like 1m15s, 2m30s, 3m45s, and the times alternating appear.”

The timing (three numbers) you have provided is a little bit odd, i.e. the last two numbers are the multiple of the very first one. Could you please show me how you capture the timing in your experiment?

“This may cause the “avoid unnecessary” mechanism works not so good. In my experiment, except lvl1 which is unavoidable to calc, lvl 2-23 do goto the “avoid unnecessary” branch, lvl24-31 still goto the “calc” branch.”

Could you please elaborate the terms “avoid uncecessay branch” and “cal branch”?

“In the second post, i think problem in a different way. The CORE optimization point is that: if lvl is present, too much calc is done, like picture below:”

Your concern is that some calculations have been done more than once in the boundary interpolation routine. Anyway, here are the actual calculations when you step through the routine r_interpolate_var_2db:

  • determine loc_head and loc_tail

  • determine time step size

  • the next big if-then block to make sure there are two time steps of data ready

  • the next big if-then block to determine when data and time have no change, interpolation is not done otherwise interpolation is done for the entire slab, i.e. all levels

  • the next if block determining the starting and ending level depends on the presence of lvl

  • assign the interpolated data to the appropriate level
    Clearly, there are no duplicate calculations. In addition, when lvl = 1, it will take the most time compared to the rest of the levels where they should take about the same time.

    Please take a look at this routine step through explanation and let me know whether you agree with me that there is no extra calculation.

Cheers, David

Hi, D.R. David~
Thank your for your reply~
Q1:The timing (three numbers) you have provided is a little bit odd, i.e. the last two numbers are the multiple of the very first one. Could you please show me how you capture the timing in your experiment?
A1: I inserted write(*,*) "skip/calc", vname, date, time, lvl in line 6644 and in line 6693 respectively, recompile and rerun the job. It is basicly an average four split of a time interval 5min. In HHMMSS form,it is 000000, 000115, 000230,000315, 000500.
In my experiment, a timestep is 5min, i try to figure out what happen in the interpolate method, so i insert code into it.
Q2:Could you please elaborate the terms “avoid uncecessay branch” and “cal branch”?
A2: the avoid unnecessary branch i think is line 6644, which only do an count = count + 1, and calc branch is in 6647-6696, which calc ratio and interpolate in cio_bndy_data.
And finally, I agree with your description about the routine of r_interpolate_var_2db. A point you did not mentioned is that, after the next big if-then block to determine when data and time have no change, interpolation is not done otherwise interpolation is done for the entire slab, i.e. all levels, if time have changed, the timestamp will change to date/time in

                cio_bndy_data_tstamp(1, 2, var_loc) = date
                cio_bndy_data_tstamp(2, 2, var_loc) = time

this will cause different lvl and different time step into calc branch, and duplicate calculations will happen, which i am 100% comfirmed.
Current solution to duplicate calculation relies on the tstamp compare, which works fine if all vname lvl’s time is the same. I did not deny the mechanic of avoiding redundancy calc of current code, but there are exeptions, as the phenomina in my experiment lvl 24-31 still go to calc branch, in which the lvl 29’s time is 345, and lvl 30’s time is 230, which also confuse me sometimes. ANYWAY, my testcase is 2018_12US1.

And let me decribe my optimize again, in case i did not describe very clear.
The core optimization is according to Razor Principle.As you describe in your last post,

* the next if block determining the starting and ending level depends on the presence of lvl
* assign the interpolated data to the appropriate level

If we dive into the code, we can found that if lvl present, the interpolated date used to produce the var data, is all about a slice, code is below:

             if (present(lvl)) then
                beg_k = lvl
                end_k = lvl
             else
                beg_k = 1
                end_k = nlays
             end if

             data = 0.0
             store_beg_ind = cio_bndy_data_inx(1,2,var_loc)
             DO k = beg_k, end_k
                starting_pt = store_beg_ind + (k - 1) * size_b2d - 1

let’s look at the var starting_pt, it’s basicly a first address of slice of an lvl in a cube. So the calc in

                   cio_bndy_data(store_beg_ind:store_end_ind) =   cio_bndy_data(head_beg_ind:head_end_ind) * ratio1
     &                                                          + cio_bndy_data(tail_beg_ind:tail_end_ind) * ratio2

is far too much to calc to cover the requirement of the following code.I can feel that the code is done by two people, or one people in different time, especially in the present(lvl) part.

By the way, I have tried to solve the tstamp and calc issue in my concern, which accelerate hadvppm from 3.2s to 0.8s every step, and results is “same” in 2hour simulation. The way to compare results is compare the CCTM_BUDGET_v54_cb6r5_ae7_aq_WR413_MYR_clang4.0.0_2018_12US1_2x96_classic_20171222.txt in output dir, and i am wondering if this is right method.

Another info is that perf result in CMAQ5.4 in my machine, interpolate_var_2db weight 15%, and most hotpoint lies in

cio_bndy_data(store_beg_ind:store_end_ind) =   cio_bndy_data(head_beg_ind:head_end_ind) * ratio1
     &                                                          + cio_bndy_data(tail_beg_ind:tail_end_ind) * ratio2

that why i try to optimize it and i think figure the optimize way out can benefit a lot for CMAQs speed performance~
Cheers, rzbbc.

Hi rzbbc,
Thanks for your clarification. I mis-understood the meaning of those timing numbers. I thought they were the computational timing for each level to demonstrate duplicated calculation was taken place.
Let me provide a more tailed explanation for the following step:
• the next big if-then block to make sure there are two time steps of data ready
Before interpolate_var (in the RDBCON case, it is r_interpolate_var_2db) is being called, subroutine retrieve_boundary_data is being called inside subroutine centralized_io_init, which has been called at the very beginning of the model. The main function of the subroutine retrieve_boundary_data is to read in two time steps of data, t and t+1. That big if-then block is for determine whether the current model step is still within the window between t and t+1. If not, it will bring in t+2 time step data to ensure the interpolation which takes place in the later portion of the subroutine, properly.
In the next block of the if-then conditional statements, the following two lines will be executed only for the first level in the subroutine hadv:
cio_bndy_data_tstamp(1, 2, var_loc) = date
cio_bndy_data_tstamp(2, 2, var_loc) = time

From the coding standpoint, the code does what it supposes to do correctly. I know that is not the case in your model. I could work with you to determine the cause (you can’t artificially manipulate the date and time in the middle of the lvl loop in hadv subroutine. As a starter, you can print out the jdate and jtime in hadv subroutine before calling RDBCON at each level (print out the lvl number as well). In RDBCON, print out var, lvl, mdate, and mtime before the call to interpolate_var for BCNAME(VAR). Also print out date and time in r_interpolate_var_2db and compares these data and time in different routine after running the model.
You are right that starting_pt points to the beginning at each level in a particular slab. It will be used to calculate the North, East, South, and West boundary for that particular level rather than the entire slab as you hinted. Anyway, this code is developed by me.
When you want to compare results from two different runs, we will directly compare (I have developed a dif tool) the netCDF output files such as ACONC.
I am a bit surprise to learn that 15% of the entire model spend on r_interpolate_var_2db. Would you mind sharing your entire timing analysis with us. In your case, what is the domain size, and the number of processors was used. I hope I did not miss any your questions.
Cheers, David

1 Like

Hi, D.R. David
Thank you for your reply~
My testcase is 2018_12US1, in which i change the config file, almost close all the emiss calc according to the tutorial.(include setting some config to N, and set )

I am sorry that i’m too busy to find optimization point in cmaq to continue figure out why lvl change in a RDBCON loop through a experiment. But i may found some code evidence that FDATE/FTIME may change during RDBCON loop. In hadvppm.F:

#start from line 209
110    CONTINUE
         CALL RDBCON ( FDATE, FTIME, ASTEP( LVL ), LVL, BCON, L_WRITE_WARNING )
....
# strat from line 255
         DSTEP = DSTEP + STEP
         IF ( DSTEP .LE. SYNCSTEP ) THEN
            CALL NEXTIME( FDATE, FTIME, SEC2TIME( STEP ) )
            GO TO 101
         END IF

if the branch after line 255 is stepped into, i think function NEXTIME will change FDATE/FTIME, and goto the top 101 continue, which may cause RDBCON to have different input in the same loop.

And about how to modify the code, we only need to change code in centralized_io_module.F. I can’t directly offer the code, but i will give some hint, following the hint you can easily reproduce the code.
1.expand the dim of cio_bndy_data_tstamp, add an dim of (0:NLAYS), in which 0 denote all lvl changes, and 1~NLAYS denote each lvl.
2.about initialize, because dim is expanded, add ,: at the tail of cio_bndy_data_tstamp index to set all lvls as the old way, don’t forget there is an jdate+999 line.
3.if lvl is present, change tstamp of the lvl in line 6845, and only calc the related lvl’s data like below:

if (present(lvl)) then
  begin_index= store_beg_ind+(lvl-1)*size_b2d 
  end_index = store_beg_ind+lvl*size_b2d-1
  cio_bndy_data(begin_index:end_index ) = 
   & (head and tail weight ratio in 2D slice)
else
  maintain raw code
endif

4.if lvl not present, change tstamp of all lvls, and maintin raw code.
5.if if (cio_bndy_var_name(var_loc, 2) == 'bc') then condition in line 6657 happens, change tstamp of all lvls.

The 15% weigh is from the perf tool directly, maybe it’s about the case is a bit large and machine of us is difference from yours. Our machine uses more cores but each core works not so good, this may be a cause. The total processors number is 512.

I’m very interested in the diff tool, comparing .nc file is a more persuasive way to determine precise of code change. Is that script in github?

Cheers, rzbbc

Hi rzbbc,

Thanks for pointing out the potential time change part of the code in hadv process. My apology for being sloppy oversight that portion of the code. I will work on a solution to improve the computational performance of hadv.

  The dif tool was developed by me for internal use only. If you are interested in it, please contact me directly (wong.david-c@epa.gov).

Cheers, David

If you want to compare I/O API netCDF files, the “standard” way is to use M3Tools program m3diff, which has been freely available for more than thirty years. It offers a variety of options, including differences, ratios, and normalized difference, reporting statistics (max, min, mean,sigma) for the result.

Hi, D.R. David
Thank you for reply~It is an honor to receive your guidance here~ :blush:
Cheers, rzbbc

Hi, D.R. cjcoats.
Thank you for your solution~I will try it latter~
Cheers, rzbbc

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.