NHSE ReviewTM 1996 Volume Second Issue

Comparison of 3 HPF Compilers

| <- Prev | Index | Next -> |
NHSE ReviewTM: Comments · Archive · Search


Chapter 4 -- Loop Jamming

The following trivial program demonstrates how Fortran 90 assignments are "lowered" to per-node execution. No inter-processor communication should be needed for the three assignments. A literal translation of the Fortran 90 semantics to Fortran 77 without recognition of the conformance and alignment of all six arrays might lead to more control structure than is needed. This is the case for PGI pghpf, whereas APR xHPF and IBM XL HPF employ a translation technique that can be called "loop jamming".

      PROGRAM three_assignments
      REAL, DIMENSION(30,40) :: a1, a2, b1, b2, c1, c2
!HPF$ DISTRIBUTE a1(BLOCK, BLOCK)
!HPF$ ALIGN (:,:) WITH a1(:,:) :: a2, b1, b2, c1, c2

      a2 = a1
      b2 = b1
      c2 = c1

      END

4.1 APR xHPF

The above is translated into the following:

      do a3 = dtx, dtx0, dtx1
      dtx3=dtx3+1
      dtx2=-1
        DO a0 = 1, 30
      dtx2=dtx2+1
      a2(hie+dtx2*hif+dtx3*hi10)=a1(hi+dtx2*hi0+dtx3*hi1)
      b2(hi8+dtx2*hi9+dtx3*hia)=b1(hi2+dtx2*hi3+dtx3*hi4)
      c2(hib+dtx2*hic+dtx3*hid)=c1(hi5+dtx2*hi6+dtx3*hi7)
        ENDDO
      ENDDO
which can be seen
in context in the _mpf.f file generated by the translator. All three assignments are handled in one loop-nest: a good example of "loop jamming".

Note that each node has its own values for dtx, dtx0 and dtx1 computed prior to the loop by APR support routines from the details of the current distribution of each array. (See discussion of APR's Run Time Model, below.) In fact, the form of this code would not change if the mapping were changed from BLOCK to CYCLIC. Note also that this is the code-form xHPF generates for "SHRUNK" array allocation (each processor containing only its own portion of the mapped array) -- a "FULL" array allocation would have significantly simpler subscripts.

This translation used the default preference of xHPF for actually decomposing only on one dimension, the rightmost with either a BLOCK or CYCLIC keyword, regardless of the rank of the array or the number of mapped dimensions of the array. This default can be overridden by command-line arguments.

Each of the other scalar variables assigned-to in the loop body or used in the subscript expression are used to access the actually allocated array. APR reports that most native Fortran compilers, with sufficient optimization power, are able to simplify these expressions or recognize the loop-invariances, and that execution does not suffer due to their apparent complexity: IBM xlf is reported to have the required capability.

4.1.1 xHPF Run Time Model

The code that precedes the loop in the generated _mpf.f file:

      call dd_dstloop(10, 1, 40, 1, dtx, dtx0, dtx1, b2, 162, -11, 1, 1
     .    , 30, 3, 1, 1, 10)
      dtx3=-1
      call dl_mem_by_dl(a1, 108, 5, hi, -11, 1, 1, 30, hi0, 3, 1, 1, 10
     .    , hi1)
      call dl_mem_by_dl(b1, 180, 5, hi2, -11, 1, 1, 30, hi3, 3, 1, 1, 10
     .    , hi4)
      call dl_mem_by_dl(c1, 144, 5, hi5, -11, 1, 1, 30, hi6, 3, 1, 1, 10
     .    , hi7)
      call dl_mem_by_dl(b2, 162, 4, hi8, -11, 1, 1, 30, hi9, 3, 1, 1, 10
     .    , hia)
      call dl_mem_by_dl(c2, 126, 4, hib, -11, 1, 1, 30, hic, 3, 1, 1, 10
     .    , hid)
      call dl_mem_by_dl(a2, 198, 4, hie, -11, 1, 1, 30, hif, 3, 1, 1, 10
     .    , hi10)
      call dd_preloop_xchng(11, 25, 'three_assignments.f90.F77', dtx, 
     .    dtx0, dtx1)
      call dl_modify(hi10, hif, hie, hid, hic, hib, hia, hi9, hi8, hi7, 
     .    hi6, hi5, hi4, hi3, hi2, hi1, hi0, hi)

anticipates APR's dynamic memory mapping model. The xHPF run time system accommodates more dynamic redistribution than the static HPF Subset directives would imply: the extra facilities would be available to users of the APR-specific !APR$ PARTITION... directive.

The dd_dstloop(...) call determines the "control distribution" (APR calls this "spreading" in its manuals) of the parallelized "DO a3" loop (the rightmost dimension of each array) based on an owner-computes analysis of the b2 array (any of the three left-hand-side arrays would suffice). Each of the dl_mem_by_dl(...) calls determines the current mapping for its subject array, filling-in values that will be used in the loop. Note that a determination is made in this routine about required inter-processor communication for any array elements, but no actual communication is initiated yet. Then dd_preloop_xchng(...) compares the current node "locality" recorded earlier in the "control distribution" with values saved within the runtime system by each dl_mem_by_dl call (a "work list"), and actually performs any needed inter-processor communication. Finally, dl_modify(...) updates the scalar variables that might be affected by such a communication.

For this trivial code with the directives as given, of course there is no required communication.

After the loop code there is:

      call dd_postloop_xchng(18, 25, 'three_assignments.f90.F77')
This anticipates that some later array-using code might require a different mapping of the data, and performs any anticipatory inter-processor communication. Again, the saved "work list" is used to make the determination, and again, for this code, no communication is needed.

Our experience with xHPF has shown that its overhead, even in these "null communication" cases is fairly small. One can get an actual measurement of that from the instrumented library that is part of the xHPF product.


4.2 PGI pghpf

The three assignments are translated into the following:

      call pghpf_localize_bounds(c2$d1(c2$dp1),1,1,30,1,i$$l,i$$u)
      call pghpf_localize_bounds(c2$d1(c2$dp1),2,1,40,1,i$$l1,i$$u1)
!     forall (i$i=i$$l1:i$$u1:1, i$i1=i$$l:i$$u:1) a2((u$$b2-l$$b2+1)*(
!    +i$i-l$$b3)+i$i1-l$$b2+a2$p) = a1((u$$b-l$$b+1)*(i$i-l$$b1)+i$i1-
!    +l$$b+a1$p)

      do i$i = i$$l1, i$$u1
         do i$i1 = i$$l, i$$u
            a2((u$$b2-l$$b2+1)*(i$i-l$$b3)+i$i1-l$$b2+a2$p) = a1((u$$b-
     +l$$b+1)*(i$i-l$$b1)+i$i1-l$$b+a1$p)
         enddo
      enddo

      call pghpf_localize_bounds(c2$d1(c2$dp1),1,1,30,1,i$$l2,i$$u2)
      call pghpf_localize_bounds(c2$d1(c2$dp1),2,1,40,1,i$$l3,i$$u3)
!     forall (i$i=i$$l3:i$$u3:1, i$i1=i$$l2:i$$u2:1) b2((u$$b2-l$$b2+1)*
!    +(i$i-l$$b3)+i$i1-l$$b2+b2$p) = b1((u$$b2-l$$b2+1)*(i$i-l$$b3)+i$i1
!    +-l$$b2+b1$p)

      do i$i = i$$l3, i$$u3
         do i$i1 = i$$l2, i$$u2
            b2((u$$b2-l$$b2+1)*(i$i-l$$b3)+i$i1-l$$b2+b2$p) = b1((u$$b2-
     +l$$b2+1)*(i$i-l$$b3)+i$i1-l$$b2+b1$p)
         enddo
      enddo

      call pghpf_localize_bounds(c2$d1(c2$dp1),1,1,30,1,i$$l4,i$$u4)
      call pghpf_localize_bounds(c2$d1(c2$dp1),2,1,40,1,i$$l5,i$$u5)
!     forall (i$i=i$$l5:i$$u5:1, i$i1=i$$l4:i$$u4:1) c2((u$$b2-l$$b2+1)*
!    +(i$i-l$$b3)+i$i1-l$$b2+c2$p) = c1((u$$b2-l$$b2+1)*(i$i-l$$b3)+i$i1
!    +-l$$b2+c1$p)

      do i$i = i$$l5, i$$u5
         do i$i1 = i$$l4, i$$u4
            c2((u$$b2-l$$b2+1)*(i$i-l$$b3)+i$i1-l$$b2+c2$p) = c1((u$$b2-
     +l$$b2+1)*(i$i-l$$b3)+i$i1-l$$b2+c1$p)
         enddo
      enddo
which is seen
in context in the Fortran file that can be saved during a pghpf translation. Clearly there has been no attempt at loop jamming.

The commented-"forall" lines are the best indication of the pghpf compiler's parallelization accomplishments. Here both dimensions of distribution acknowledged. Also, since this compiler determines that there can be no inter-processor communication, it generates no extra calls. PGI's pghpf has no facility corresponding to xHPF's "FULL" allocations, so all distributed arrays are locally allocated in per-node-portion form in the generated code.


4.3 IBM XL HPF

XL HPF translates the three assignment statements as:

C 1585-501  Original Source Line 6
       do i_4=iown_l_14,MIN0(iown_u_15,40),1
C 1585-501  Original Source Line 6
         do i_5=iown_l_16,MIN0(iown_u_17,30),1
           a2_27(i_5,i_4) = a1_26(i_5,i_4)
           b2_29(i_5,i_4) = b1_28(i_5,i_4)
           c2_31(i_5,i_4) = c1_30(i_5,i_4)
         end do
       end do
as extracted from the
pseudo-Fortran listing. Clearly there is good loop-jamming here.

This is the form of the code with the compiler's "-qhot" (High Order loop Transformations) optimization, and this is the portion of the listing elicited by the "-qreport=hotlist" command-line option. (Examination of earlier portions of that listing show all the invocation options and the more line-by-line "hpflist"-elicited HPF Parallelization Report.

This compiler generated a two-dimensional parallel spreading of the assignments. The arrays are allocated locally with each processor holding only its portion. No communication is needed and so none is generated.

Copyright © 1996


| <- Prev | Index | Next -> |
NHSE ReviewTM: Comments · Archive · Search


presberg@tc.cornell.edu
Last modified: Fri Sep 27 18:23:27 1996