NHSE ReviewTM 1996 Volume Second Issue

Comparison of 3 HPF Compilers

| <- Prev | Index | Next -> |
NHSE ReviewTM: Comments · Archive · Search


Chapter 5 -- Array Reduction

Fortran 90 reduction intrinsic functions, such SUM, PRODUCT, MAXVAL, etc., are inevitable non-embarrassingly-parallel aspects of any serious application written in HPF. How the compiler handles the necessary single-node code generation and communication can have a large bearing on the performance of any application.

The following trivial example is examined:

      PROGRAM sum_prog
      REAL, DIMENSION(30,40) :: a1, a2
!HPF$ DISTRIBUTE a1(BLOCK, BLOCK)
!HPF$ ALIGN (:,:) WITH a1(:,:) :: a2

      s = sum(a1+a2)

      END
There is a two-dimensional elemental summation inside the invocation of the array reduction.


5.1 APR xHPF

The assignment statement generates the following:

      s = 0.0
      call dd_dstloop(10, 1, 40, 1, dtx, dtx0, dtx1, a1, 108, -11, 1, 1
     .    , 30, 3, 1, 1, 10)
      dtx3=-1
      call dd_def_red(s, dd_tdesc(s, 162, 12), 2, 1)
      call dl_mem_by_dl(a1, 108, 5, hi, -11, 1, 1, 30, hi0, 3, 1, 1, 10
     .    , hi1)
      call dl_mem_by_dl(a2, 126, 5, hi2, -11, 1, 1, 30, hi3, 3, 1, 1, 10
     .    , hi4)
      call dd_preloop_xchng(11, 16, 'sum_prog.f90.F77', dtx, dtx0, dtx1
     .    , s)
      call dl_modify(hi4, hi3, hi2, hi1, hi0, hi)
      do a3 = dtx, dtx0, dtx1
      dtx3=dtx3+1
      dtx2=-1
        DO a0 = 1, 30
      dtx2=dtx2+1
      s=s+a1(hi+dtx2*hi0+dtx3*hi1)+a2(hi2+dtx2*hi3+dtx3*hi4)
        ENDDO
      ENDDO
      call dd_postloop_xchng(16, 16, 'sum_prog.f90.F77', s)
as seen in the
generated _mpf.f file.

Those lines that are significant to the elemental summation and to the reduction are hi-lighted. The "work list" for communication is indicated by the dd_def_red(...) call. Each node generates its local elemental summation of its portion of a1 and a2 and its local partial scalar summation to s in the inner "DO a0" loop. Finally the inter-processor communication needed to perform the array reduction (and to convey the scalar result to all the participating processors) is performed by the dd_postloop_xchng(...) call.

At CTC we have examined the communication pattern involved and it is a Log-P-level tree over the P participating processors. The MPI communications are accomplished with blocking sends and receives.


5.2 PGI pghpf

PGI pghpf instantiates a temporary array, a1$a, to hold the elemental summation, and then invokes its own run time system summation-reduction routine:

      call pghpf_localize_bounds(a1$a$d1(a1$a$dp1),1,1,30,1,i$$l,i$$u)
      call pghpf_localize_bounds(a1$a$d1(a1$a$dp1),2,1,40,1,i$$l1,i$$u1)
!     forall (i$i=i$$l1:i$$u1:1, i$i1=i$$l:i$$u:1) a1$a((u$$b-l$$b+1)*(
!    +i$i-l$$b1)+i$i1-l$$b+a1$a$p) = a1((u$$b2-l$$b2+1)*(i$i-l$$b3)+i$i1
!    +-l$$b2+a1$p) + a2((u$$b4-l$$b4+1)*(i$i-l$$b5)+i$i1-l$$b4+a2$p)
      do i$i = i$$l1, i$$u1
         do i$i1 = i$$l, i$$u
            a1$a((u$$b-l$$b+1)*(i$i-l$$b1)+i$i1-l$$b+a1$a$p) = a1((u$$b2
     +-l$$b2+1)*(i$i-l$$b3)+i$i1-l$$b2+a1$p) + a2((u$$b4-l$$b4+1)*(i$i-
     +l$$b5)+i$i1-l$$b4+a2$p)
         enddo
      enddo
      call pghpf_sums(a1$a$r,a1$a(a1$a$p),.true.,27,a1$a$d1(a1$a$dp1),19
     +)
      s = a1$a$r
This can seen in the context of the full
saved Fortran file.


5.3 IBM XL HPF

The IBM XL HPF strategy for this case is similar to that of xHPF in that it uses a local scalar variable (here the compiler-generated "SCALAR_28") for per-node partial sums. It also resembles xHPF in that it accomplishes the elemental array summation in the same "do i_9" loop that computes the local sum. Then it invokes its own run time system routine, _xlhpf_reduce_sum(...) to complete the array reduction:

       s = 0.
       SCALAR_28 = 0.
C 1585-501  Original Source Line 6
       do i_8=iown_l_18,MIN0(iown_u_19,40),1
C 1585-501  Original Source Line 6
         do i_9=iown_l_20,MIN0(iown_u_21,30),1
           SCALAR_28 = SCALAR_28 + (a1_34(i_9,i_8) + a2_35(i_9,i_8))
         end do
       end do
       Recv_index_31(1) = (-2)
       Recv_index_31(2) = (-2)
       Send_index_32(1) = 0
       Send_index_32(2) = 1
       DS_SAS_33(1) = 0
       DS_SAS_33(2) = MIN0(29 / D_17(1),PGB_13(1) - 1)
       DS_SAS_33(3) = 1
       DS_SAS_33(4) = 0
       DS_SAS_33(5) = MIN0(39 / D_17(3),PGB_13(2) - 1)
       DS_SAS_33(6) = 1
       call _xlhpf_reduce_sum(9,SCALAR_28,s,PG_15,2,Send_index_32,DS_SAS
     &_33,Recv_index_31)
Complete details can be seen in the
pseudo-Fortran listing.

At CTC we have observed that the reduction communication pattern is accomplished with an MPI collective communication REDUCE_ALL followed by an MPI broadcast of the scalar value.

Copyright © 1996


| <- Prev | Index | Next -> |
NHSE ReviewTM: Comments · Archive · Search


presberg@tc.cornell.edu
Last modified: Fri Sep 27 19:31:45 1996