
| <- Prev | Index | Next -> |
NHSE ReviewTM: Comments · Archive · Search
PROGRAM three_assignments
REAL, DIMENSION(30,40) :: a1, a2, b1, b2, c1, c2
!HPF$ DISTRIBUTE a1(BLOCK, BLOCK)
!HPF$ ALIGN (:,:) WITH a1(:,:) :: a2, b1, b2, c1, c2
a2 = a1
b2 = b1
c2 = c1
END
The above is translated into the following:
do a3 = dtx, dtx0, dtx1
dtx3=dtx3+1
dtx2=-1
DO a0 = 1, 30
dtx2=dtx2+1
a2(hie+dtx2*hif+dtx3*hi10)=a1(hi+dtx2*hi0+dtx3*hi1)
b2(hi8+dtx2*hi9+dtx3*hia)=b1(hi2+dtx2*hi3+dtx3*hi4)
c2(hib+dtx2*hic+dtx3*hid)=c1(hi5+dtx2*hi6+dtx3*hi7)
ENDDO
ENDDO
which can be seen
in context
in the _mpf.f file generated by the translator. All three assignments
are handled in one loop-nest: a good example of "loop jamming".Note that each node has its own values for dtx, dtx0 and dtx1 computed prior to the loop by APR support routines from the details of the current distribution of each array. (See discussion of APR's Run Time Model, below.) In fact, the form of this code would not change if the mapping were changed from BLOCK to CYCLIC. Note also that this is the code-form xHPF generates for "SHRUNK" array allocation (each processor containing only its own portion of the mapped array) -- a "FULL" array allocation would have significantly simpler subscripts.
This translation used the default preference of xHPF for actually decomposing only on one dimension, the rightmost with either a BLOCK or CYCLIC keyword, regardless of the rank of the array or the number of mapped dimensions of the array. This default can be overridden by command-line arguments.
Each of the other scalar variables assigned-to in the loop body or used in the subscript expression are used to access the actually allocated array. APR reports that most native Fortran compilers, with sufficient optimization power, are able to simplify these expressions or recognize the loop-invariances, and that execution does not suffer due to their apparent complexity: IBM xlf is reported to have the required capability.
The code that precedes the loop in the generated _mpf.f file:
call dd_dstloop(10, 1, 40, 1, dtx, dtx0, dtx1, b2, 162, -11, 1, 1
. , 30, 3, 1, 1, 10)
dtx3=-1
call dl_mem_by_dl(a1, 108, 5, hi, -11, 1, 1, 30, hi0, 3, 1, 1, 10
. , hi1)
call dl_mem_by_dl(b1, 180, 5, hi2, -11, 1, 1, 30, hi3, 3, 1, 1, 10
. , hi4)
call dl_mem_by_dl(c1, 144, 5, hi5, -11, 1, 1, 30, hi6, 3, 1, 1, 10
. , hi7)
call dl_mem_by_dl(b2, 162, 4, hi8, -11, 1, 1, 30, hi9, 3, 1, 1, 10
. , hia)
call dl_mem_by_dl(c2, 126, 4, hib, -11, 1, 1, 30, hic, 3, 1, 1, 10
. , hid)
call dl_mem_by_dl(a2, 198, 4, hie, -11, 1, 1, 30, hif, 3, 1, 1, 10
. , hi10)
call dd_preloop_xchng(11, 25, 'three_assignments.f90.F77', dtx,
. dtx0, dtx1)
call dl_modify(hi10, hif, hie, hid, hic, hib, hia, hi9, hi8, hi7,
. hi6, hi5, hi4, hi3, hi2, hi1, hi0, hi)
anticipates APR's dynamic memory mapping model. The xHPF run time
system accommodates more dynamic redistribution than the static HPF
Subset directives would imply: the extra facilities would be available
to users of the APR-specific !APR$ PARTITION... directive.
The dd_dstloop(...) call determines the "control
distribution" (APR calls this "spreading" in its manuals) of the
parallelized "DO a3" loop (the rightmost dimension of each array) based on an
owner-computes analysis of the b2 array (any of the three
left-hand-side arrays would suffice). Each of the
dl_mem_by_dl(...) calls determines the current mapping
for its subject array, filling-in values that will be used in the
loop. Note that a determination is made in this routine about
required inter-processor communication for any array elements, but no
actual communication is initiated yet. Then
dd_preloop_xchng(...) compares the current node
"locality" recorded earlier in the "control distribution" with values
saved within the runtime system by each dl_mem_by_dl call
(a "work list"), and actually performs any needed inter-processor
communication. Finally, dl_modify(...) updates the
scalar variables that might be affected by such a communication.
For this trivial code with the directives as given, of course there is no required communication.
After the loop code there is:
call dd_postloop_xchng(18, 25, 'three_assignments.f90.F77')
This anticipates that some later array-using code might require a
different mapping of the data, and performs any anticipatory
inter-processor communication. Again, the saved "work list" is used to
make the determination, and again, for this code, no communication is
needed.Our experience with xHPF has shown that its overhead, even in these "null communication" cases is fairly small. One can get an actual measurement of that from the instrumented library that is part of the xHPF product.
The three assignments are translated into the following:
call pghpf_localize_bounds(c2$d1(c2$dp1),1,1,30,1,i$$l,i$$u)
call pghpf_localize_bounds(c2$d1(c2$dp1),2,1,40,1,i$$l1,i$$u1)
! forall (i$i=i$$l1:i$$u1:1, i$i1=i$$l:i$$u:1) a2((u$$b2-l$$b2+1)*(
! +i$i-l$$b3)+i$i1-l$$b2+a2$p) = a1((u$$b-l$$b+1)*(i$i-l$$b1)+i$i1-
! +l$$b+a1$p)
do i$i = i$$l1, i$$u1
do i$i1 = i$$l, i$$u
a2((u$$b2-l$$b2+1)*(i$i-l$$b3)+i$i1-l$$b2+a2$p) = a1((u$$b-
+l$$b+1)*(i$i-l$$b1)+i$i1-l$$b+a1$p)
enddo
enddo
call pghpf_localize_bounds(c2$d1(c2$dp1),1,1,30,1,i$$l2,i$$u2)
call pghpf_localize_bounds(c2$d1(c2$dp1),2,1,40,1,i$$l3,i$$u3)
! forall (i$i=i$$l3:i$$u3:1, i$i1=i$$l2:i$$u2:1) b2((u$$b2-l$$b2+1)*
! +(i$i-l$$b3)+i$i1-l$$b2+b2$p) = b1((u$$b2-l$$b2+1)*(i$i-l$$b3)+i$i1
! +-l$$b2+b1$p)
do i$i = i$$l3, i$$u3
do i$i1 = i$$l2, i$$u2
b2((u$$b2-l$$b2+1)*(i$i-l$$b3)+i$i1-l$$b2+b2$p) = b1((u$$b2-
+l$$b2+1)*(i$i-l$$b3)+i$i1-l$$b2+b1$p)
enddo
enddo
call pghpf_localize_bounds(c2$d1(c2$dp1),1,1,30,1,i$$l4,i$$u4)
call pghpf_localize_bounds(c2$d1(c2$dp1),2,1,40,1,i$$l5,i$$u5)
! forall (i$i=i$$l5:i$$u5:1, i$i1=i$$l4:i$$u4:1) c2((u$$b2-l$$b2+1)*
! +(i$i-l$$b3)+i$i1-l$$b2+c2$p) = c1((u$$b2-l$$b2+1)*(i$i-l$$b3)+i$i1
! +-l$$b2+c1$p)
do i$i = i$$l5, i$$u5
do i$i1 = i$$l4, i$$u4
c2((u$$b2-l$$b2+1)*(i$i-l$$b3)+i$i1-l$$b2+c2$p) = c1((u$$b2-
+l$$b2+1)*(i$i-l$$b3)+i$i1-l$$b2+c1$p)
enddo
enddo
which is seen
in context
in the Fortran file that can be saved during a pghpf translation.
Clearly there has been no attempt at loop jamming.The commented-"forall" lines are the best indication of the pghpf compiler's parallelization accomplishments. Here both dimensions of distribution acknowledged. Also, since this compiler determines that there can be no inter-processor communication, it generates no extra calls. PGI's pghpf has no facility corresponding to xHPF's "FULL" allocations, so all distributed arrays are locally allocated in per-node-portion form in the generated code.
XL HPF translates the three assignment statements as:
C 1585-501 Original Source Line 6
do i_4=iown_l_14,MIN0(iown_u_15,40),1
C 1585-501 Original Source Line 6
do i_5=iown_l_16,MIN0(iown_u_17,30),1
a2_27(i_5,i_4) = a1_26(i_5,i_4)
b2_29(i_5,i_4) = b1_28(i_5,i_4)
c2_31(i_5,i_4) = c1_30(i_5,i_4)
end do
end do
as extracted from the
pseudo-Fortran listing.
Clearly there is good loop-jamming here.
This is the form of the code with the compiler's "-qhot" (High Order loop Transformations) optimization, and this is the portion of the listing elicited by the "-qreport=hotlist" command-line option. (Examination of earlier portions of that listing show all the invocation options and the more line-by-line "hpflist"-elicited HPF Parallelization Report.
This compiler generated a two-dimensional parallel spreading of the assignments. The arrays are allocated locally with each processor holding only its portion. No communication is needed and so none is generated.
| <- Prev | Index | Next -> |
NHSE ReviewTM: Comments · Archive · Search