CMAQ v5.3.3 DESID (run-time slower issue as variable increases- more than 20 or 30? - do not know exact number)


Stream Label with ALL is working well with lots of variables of more than 50


But when Stream Label changes other than ALL, the model computational time gets super slow.

Default: 1 day with 120 procs = 1 hour
With lots of variables changes in stream label header: 1 day with 120 procs = about 10 hours.

DESID module might not be able to utilize to subtract two different whole emissions because emission variables are usually from 50 - 70. But it can subtract target variables with fewer numbers.

Can anyone test it? not sure it can be our server problem.

Thanks

Hello,

We ran into an issue similar to this and the problem ended up being inefficiency in vectorized Fortran calculations for large arrays. If you go to lines 1069-1073 in EMIS_DEFN.F:

       VDEMIS( :,1:NL,:,: ) = 0.0
       FORALL( ISTR = 1:N_EMIS_ISTR, MAP_EMtoDIFF( ISTR ) .NE. 0 ) 
     &        VDEMIS( MAP_EMtoDIFF( ISTR ),1:NL,:,: ) = 
     &                VDEMIS( MAP_EMtoDIFF( ISTR ),1:NL,:,: ) + 
     &                VDEMIS0( ISTR,1:NL,:,: )

try rewriting this snippet to conventional loops:

        DO L = 1,NL
          VDEMIS( :,L,:,: ) = 0.0
          DO ISTR = 1,DESID_N_ISTR 
            IF ( MAP_ISTRtoDIFF( ISTR ) .NE. 0 ) 
     &        VDEMIS( MAP_ISTRtoDIFF( ISTR ),L,:,: ) = 
     &                VDEMIS( MAP_ISTRtoDIFF( ISTR ),L,:,: ) + 
     &                VDEMIS0( ISTR,L,:,: )
          END DO
        END DO

Does that alleviate the slow-down? We have made this update in v5.4.

Regards,
Ben

See Optimizing Environmental Models for Microprocessor Based Systems The Easy Stuff :slight_smile: which is an updated set of notes from a lecture for EPA ORD in 2002 (and note that the effects noted in it have grown even worse over the years!). See particularly the section Typical Cache Behavior.

The subscript order VDEMIS0( ISTR,L,:,: ) etc, where the active subscripts are the right-most two, guarantees that this code is both a “cache-buster” and a “bad-stride code” where as the problem gets larger and larger, the processor has to go out more and more to (slow) main memory to do its job. This can easily multiply the computational cost by a factor larger than 100 for larger problems.