
| <- Prev | Index | Next -> |
NHSE ReviewTM: Comments · Archive · Search
The following trivial example is examined:
PROGRAM sum_prog
REAL, DIMENSION(30,40) :: a1, a2
!HPF$ DISTRIBUTE a1(BLOCK, BLOCK)
!HPF$ ALIGN (:,:) WITH a1(:,:) :: a2
s = sum(a1+a2)
END
There is a two-dimensional elemental summation inside the invocation
of the array reduction.
The assignment statement generates the following:
s = 0.0
call dd_dstloop(10, 1, 40, 1, dtx, dtx0, dtx1, a1, 108, -11, 1, 1
. , 30, 3, 1, 1, 10)
dtx3=-1
call dd_def_red(s, dd_tdesc(s, 162, 12), 2, 1)
call dl_mem_by_dl(a1, 108, 5, hi, -11, 1, 1, 30, hi0, 3, 1, 1, 10
. , hi1)
call dl_mem_by_dl(a2, 126, 5, hi2, -11, 1, 1, 30, hi3, 3, 1, 1, 10
. , hi4)
call dd_preloop_xchng(11, 16, 'sum_prog.f90.F77', dtx, dtx0, dtx1
. , s)
call dl_modify(hi4, hi3, hi2, hi1, hi0, hi)
do a3 = dtx, dtx0, dtx1
dtx3=dtx3+1
dtx2=-1
DO a0 = 1, 30
dtx2=dtx2+1
s=s+a1(hi+dtx2*hi0+dtx3*hi1)+a2(hi2+dtx2*hi3+dtx3*hi4)
ENDDO
ENDDO
call dd_postloop_xchng(16, 16, 'sum_prog.f90.F77', s)
as seen in the
generated _mpf.f file.
Those lines that are significant to the elemental summation and to the
reduction are hi-lighted. The "work list" for communication is
indicated by the dd_def_red(...) call. Each node
generates its local elemental summation of its portion of a1 and a2
and its local partial scalar summation to s in the inner "DO a0"
loop. Finally the inter-processor communication needed to perform the
array reduction (and to convey the scalar result to all the
participating processors) is performed by the
dd_postloop_xchng(...) call.
At CTC we have examined the communication pattern involved and it is a Log-P-level tree over the P participating processors. The MPI communications are accomplished with blocking sends and receives.
PGI pghpf instantiates a temporary array, a1$a, to hold the elemental summation, and then invokes its own run time system summation-reduction routine:
call pghpf_localize_bounds(a1$a$d1(a1$a$dp1),1,1,30,1,i$$l,i$$u)
call pghpf_localize_bounds(a1$a$d1(a1$a$dp1),2,1,40,1,i$$l1,i$$u1)
! forall (i$i=i$$l1:i$$u1:1, i$i1=i$$l:i$$u:1) a1$a((u$$b-l$$b+1)*(
! +i$i-l$$b1)+i$i1-l$$b+a1$a$p) = a1((u$$b2-l$$b2+1)*(i$i-l$$b3)+i$i1
! +-l$$b2+a1$p) + a2((u$$b4-l$$b4+1)*(i$i-l$$b5)+i$i1-l$$b4+a2$p)
do i$i = i$$l1, i$$u1
do i$i1 = i$$l, i$$u
a1$a((u$$b-l$$b+1)*(i$i-l$$b1)+i$i1-l$$b+a1$a$p) = a1((u$$b2
+-l$$b2+1)*(i$i-l$$b3)+i$i1-l$$b2+a1$p) + a2((u$$b4-l$$b4+1)*(i$i-
+l$$b5)+i$i1-l$$b4+a2$p)
enddo
enddo
call pghpf_sums(a1$a$r,a1$a(a1$a$p),.true.,27,a1$a$d1(a1$a$dp1),19
+)
s = a1$a$r
This can seen in the context of the full
saved Fortran file.
The IBM XL HPF strategy for this case is similar to that of xHPF in
that it uses a local scalar variable (here the compiler-generated
"SCALAR_28") for per-node partial sums. It also resembles xHPF in
that it accomplishes the elemental array summation in the same "do
i_9" loop that computes the local sum. Then it invokes its own run
time system routine, _xlhpf_reduce_sum(...) to complete
the array reduction:
s = 0.
SCALAR_28 = 0.
C 1585-501 Original Source Line 6
do i_8=iown_l_18,MIN0(iown_u_19,40),1
C 1585-501 Original Source Line 6
do i_9=iown_l_20,MIN0(iown_u_21,30),1
SCALAR_28 = SCALAR_28 + (a1_34(i_9,i_8) + a2_35(i_9,i_8))
end do
end do
Recv_index_31(1) = (-2)
Recv_index_31(2) = (-2)
Send_index_32(1) = 0
Send_index_32(2) = 1
DS_SAS_33(1) = 0
DS_SAS_33(2) = MIN0(29 / D_17(1),PGB_13(1) - 1)
DS_SAS_33(3) = 1
DS_SAS_33(4) = 0
DS_SAS_33(5) = MIN0(39 / D_17(3),PGB_13(2) - 1)
DS_SAS_33(6) = 1
call _xlhpf_reduce_sum(9,SCALAR_28,s,PG_15,2,Send_index_32,DS_SAS
&_33,Recv_index_31)
Complete details can be seen in the
pseudo-Fortran listing.
At CTC we have observed that the reduction communication pattern is accomplished with an MPI collective communication REDUCE_ALL followed by an MPI broadcast of the scalar value.
| <- Prev | Index | Next -> |
NHSE ReviewTM: Comments · Archive · Search