From owner-hpff-task  Tue Jan  2 11:24:26 1996
Received: (from daemon@localhost) by cs.rice.edu (8.7.1/8.7.1) id LAA17257 for hpff-task-out; Tue, 2 Jan 1996 11:24:26 -0600 (CST)
Received: from [128.42.1.213] (morpheus.cs.rice.edu [128.42.1.213]) by cs.rice.edu (8.7.1/8.7.1) with SMTP id LAA17244 for <hpff-task>; Tue, 2 Jan 1996 11:24:21 -0600 (CST)
X-Sender: chk@titan.cs.rice.edu
Message-Id: <v01530507ad0f19fdde37@[128.42.1.213]>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Tue, 2 Jan 1996 11:25:57 -0600
To: hpff-task
From: chk@cs.rice.edu (Chuck Koelbel)
Subject: hpff-task: HPF reduce directive
Sender: owner-hpff-task
Precedence: bulk

---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------

Fabien -
Actually, the mailing list that is discussing REDUCE is hpff-task.

hpff-task -
Forwarded from hpff-distribute.  Apologies to those of you who see this
twice.  I'll comment that I generally like the "external" syntax, modulo a
couple syntax nits (for example, INDEPEDENT does not name the loop index,
and doesn't apply to nested loops.)
        Chuck

Hi Rob,

I'm not sure you're in charge of the reduce directive, but anyway, I guess
someone on the "hpff-distribute" list is:

I just had a look the "reduce" directive proposal. I was wondering how
the compiler should guess at which level in a loop nest the reduction
must be performed.

My point is that the way the "reduce" directive is placed in the loop nest
does not give this information to the compiler, thus some analysis will be
needed. This analysis could be avoided with some other syntax in the "new"
and "independent" spirit.


EXAMPLE:
--------

Here is an example to illustrate this point.  (This example is not
designed to be meaningful from the computation point of view, but to
provide a set of problems in the same loop nest.) The reduction on s1 must
be performed at the j level, the ones on s2 and s3 at the i level. The s3
reduction has two contributions from B and A. The reduction on s1 is
initialized with B. It is used for the p reduction.

      real A(n,n), B(n)
! there may be some external loop around this piece of code
      ...
      p = 1.
      s2 = 0.
      s3 = 0.
      do i=1, n
         B(i) = 2.*real(i-j)
         s1 = B(i)
         s3 = s3 + n*B(i)
         do j=1, n
            A(j,i) = 1./real(i+j)
            s1 = s1 + A(j,i)
            s2 = s2 + A(j,i)
            s3 = s3 + A(j,i)
         enddo
         p = p * s1
      enddo
      ...
! p, s2 and s3 might be used...


CURRENT "INTERNAL" REDUCE SYNTAX:
---------------------------------

With the current syntax, it's quite verbose and the compiler must decide
at with level the reduction must be performed. Another issue is whether
or not several contributions are allowed:

      p = 1.
      s2 = 0.
      s3 = 0.
chpf$ independent(i), new(s1, j)
      do i=1, n
         B(i) = 2.*real(i-j)
         s1 = B(i)                 ! initialization of s1
chpf$ reduce                       ! first contribution to s3...
         s3 = s3 + n*B(i)
chpf$ independent(j)
         do j=1, n
            A(j,i) = 1./real(i+j)
chpf$ reduce                       ! must guess that it is at the j level
                                   ! what are the needed analysis for this?
                                   ! should it trust the "new" outside?
                                   ! "see" that s1 is "simply" (as opposed
                                   ! to s3) assigned and used in the i loop?
                                   ! note that both s1 and s3 are read and
                                   ! written within the i loop.
            s1 = s1 + A(j,i)
chpf$ reduce                       ! i level (2 loops outward)
            s2 = s2 + A(j,i)
chpf$ reduce                       ! i level, second contribution to s3...
            s3 = s3 + A(j,i)
         enddo
chpf$ reduce                       ! i level
         p = p * s1
      enddo


"EXTERNAL" REDUCE SYNTAX:
-------------------------

I would rather suggest the less verbose (and nothing/few to guess):

      p = 1.
      s2 = 0.
      s3 = 0.
chpf$ independent(i), new(j,s1), reduce(s2,s3,p)
      do i=1, n
         B(i) = 2.*real(i-j)
         s1 = B(i)
         s3 = s3 + n*B(i)
chpf$ independent(j), reduce(s1[,s2,s3]) ! [] optional?
         do j=1, n
            A(j,i) = 1./real(i+j)
            s1 = s1 + A(j,i)
            s2 = s2 + A(j,i)
            s3 = s3 + A(j,i)
         enddo
         p = p * s1
      enddo

Apart from the reduced (!) verbosity, the very fact that the reduce
directive is attached to a loop gives the compiler a hint about the level
at which the reduction (=> [partial] synchronization) is to be performed.

For checking purposes every scalar assigned in an independent loop should
appear in some "new" or "reduce" directive...  Also the multi-contribution
to a reduction does not require multiple annotations from the user (rising
the question of the semantics of the operation if some are forgotten...)

The compiler has to guess what are the reduction statement(s) for each
scalar, that is it has to find all the assignments to that scalar within
the loop nest.

The spirit of such an "external" reduce directive is quite different from
the "internal" one. It asserts at a higher level in the code a general
property about a loop nest and a scalar manipulated in this loop nest.
This is homogeneous with the existing "independent" and "new" directives,
ans thus improves the readability of the code.


NOTES:
------

As far as the forall is concerned, since it may not be independent, the
"external" reduce should exist without attachment:

chpf$ reduce(s)
      forall(i=2:n-1)
        s = s + a(i)
        a(i) = a(i)+a(i-1)+a(i+1)
      end forall

Also it is not very clear to me what to do with reductions into arrays.
(I'm not sure from the minutes whether this is allowed or not)...

chpf$ independent(i,j), reduce(S(i)) ???
      do i=1, n
        do j=1, n
          S(i) = S(i) + A(k,j)
        enddo
      enddo

means that an "internal" directive might be of some help?


CONCLUSION:
-----------

To conclude, I suggest that another "external" (as opposed to the current
"internal") syntax for the reduce directive would (1) leave less work to
the compiler about "guessing" things, (2) be more high level and readable
(3) be less verbose.


Hope this help. Thanks for having read this mail. Comments are welcome.
Have a nice day,

Fabien.

--
Fabien COELHO __ http://www.cri.ensmp.fr/~coelho __ coelho@cri.ensmp.fr


---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------

From owner-hpff-task  Tue Jan  2 11:48:57 1996
Received: (from daemon@localhost) by cs.rice.edu (8.7.1/8.7.1) id LAA18163 for hpff-task-out; Tue, 2 Jan 1996 11:48:57 -0600 (CST)
Received: from [128.42.1.213] (morpheus.cs.rice.edu [128.42.1.213]) by cs.rice.edu (8.7.1/8.7.1) with SMTP id LAA18155; Tue, 2 Jan 1996 11:48:53 -0600 (CST)
X-Sender: chk@titan.cs.rice.edu
Message-Id: <v01530508ad0f1bd34ca9@[128.42.1.213]>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Tue, 2 Jan 1996 11:50:28 -0600
To: Fabien COELHO <coelho@chailly.ensmp.fr>, hpff-task
From: chk@cs.rice.edu (Chuck Koelbel)
Subject: hpff-task: Re: not atached "new" directive
Sender: owner-hpff-task
Precedence: bulk

---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------

Hi, Fabien -

Again, a suggestion probably better directed at hpff-task (considering task
and other forms of parallelism) than at hpff-distribute (considering data
mapping)...


>Hi Chuck,
>
>> chpf$ reduce(s)
>>       forall(i=2:n-1)
>>         s = s + a(i)
>>         a(i) = a(i)+a(i-1)+a(i+1)
>>       end forall
>
>The example I just posted on the net reminds me my (really real)
>application with "private" scalars that required a "new" clause outside
>a loop. Forall constructs may also use scalars to avoid redundant
>computations:
>
>chpf$ new(x) ! apply on the next loop, here the forall...
>      forall(i=2:n)
>        x = a(i-1)+2*a(i)+a(i+1) ! x used to store an intermediate value.
>        b(i) = f1(x)             ! some pure function.
>        a(i) = f2(x)             ! idem.
>      end forall
>
>Here the loop is NOT independent and x must be private to each iteration
>for parallelism. I guess the compiler must expand the scalar into an array
>for such a loop. I think that this kind of loop should be allowed by HPF,
>and that it is not with the current definition of forall/new.
>The other way around is to assume that a scalar in a forall *must* be
>private (?) or to perform some analysis to prove it, but I understood that
>one of the goal of the HPF Forum is that no advanced analysis technique is
>needed by compilers for handling HPF.
>
>Hope this help. Have a nice day,
>
>Fabien.


I'm confused what you are asking for.

The intent of FORALL was to provide a different syntax and more general
shapes for array assignments.  This makes it orthogonal to reduction
operators like the one in your first example.  Is it really a problem to
call the SUM intrinsic outside the FORALL?

You could write the second loop as an INDEPENDENT DO loop without trouble.
Why is it vital that you use the keyword FORALL?  Many other cases of
temporary scalars can be handled by declaring a local variable in a PURE
function.  (This doesn't work here, obviously.)

It seems that you want FORALL to be a general-purpose construct for "do
this in parallel".  However, FORALL is something more limited - this was a
concious decision to make the construct simpler.  The INDEPENDENT DO loop
seems to be closer to what you want in any case.

                                                Chuck


---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------

From owner-hpff-task  Tue Jan  2 11:49:07 1996
Received: (from daemon@localhost) by cs.rice.edu (8.7.1/8.7.1) id LAA18188 for hpff-task-out; Tue, 2 Jan 1996 11:49:07 -0600 (CST)
Received: from [128.42.1.213] (morpheus.cs.rice.edu [128.42.1.213]) by cs.rice.edu (8.7.1/8.7.1) with SMTP id LAA18177; Tue, 2 Jan 1996 11:49:02 -0600 (CST)
X-Sender: chk@titan.cs.rice.edu
Message-Id: <v0153050cad0f20706200@[128.42.1.213]>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Tue, 2 Jan 1996 11:50:37 -0600
To: coelho@chailly.ensmp.fr, hpff-task
From: chk@cs.rice.edu (Chuck Koelbel)
Subject: hpff-task: Re:  HPF reduce directive
Sender: owner-hpff-task
Precedence: bulk

---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------

Well, I tried to forward Fabien's message as fast as possible.  Still drew
Rob's response on the other list...

                                                Chuck


Dear Fabien,

I know this is a subtle issue.   I thought we had unabiguously defined the
answer to the "shich loop" question as the outermost DO loop that:

i)  Is independent

ii)  Does not have a NEW clause for the reduction variable, and

iii)  Does not *contain* in its range an independent DO loop with a NEW
      clause for the reduction variable.

Also, the reduction statement must be in the lexical range of the DO loop.
Thus, we have addressed the issues of whether more than one reduction statment
may occur (there is no restriction on the number or the place)
and which loops are the scope of the reduction.   Nevertheless, your
points about economy of expression are very well taken.    We should discuss
this alternative.

Rob


---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------

From owner-hpff-task  Tue Jan  2 13:01:10 1996
Received: (from daemon@localhost) by cs.rice.edu (8.7.1/8.7.1) id NAA20906 for hpff-task-out; Tue, 2 Jan 1996 13:01:10 -0600 (CST)
Received: from [128.42.1.213] (morpheus.cs.rice.edu [128.42.1.213]) by cs.rice.edu (8.7.1/8.7.1) with SMTP id NAA20899; Tue, 2 Jan 1996 13:01:05 -0600 (CST)
X-Sender: chk@titan.cs.rice.edu
Message-Id: <v01530511ad0f305b1f72@[128.42.1.213]>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Tue, 2 Jan 1996 13:02:40 -0600
To: Fabien COELHO <coelho@chailly.ensmp.fr>, hpff-task
From: chk@cs.rice.edu (Chuck Koelbel)
Subject: hpff-task: Re: HPF reduce directive
Sender: owner-hpff-task
Precedence: bulk

---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------

>> I'll comment that I generally like the "external" syntax, modulo a
>> couple syntax nits (for example, INDEPEDENT does not name the loop index,
>> and doesn't apply to nested loops.)
>
>Do you mean that my inner loop cannot be tagged as independent?
>Because it is not perfectly nested?
>
>I have not seen such a restriction in HPF 1.0 (or it is not listed in the
>constraint list?). I cannot see any rational for it. The compiler is not
>required to execute the loop in parallel, it's just a matter of asserting
>some semantical property of a piece of code, in my understanding.

Sorry for the ambiguous wording.  What I meant was, "A single INDEPENDENT
directive applies to only one DO loop or FORALL."  It is legal to nest
INDEPENDENT directives as your example did, but this does not happen
automatically.

>BTW, Happy new year!
>
>Fabien.

Happy Holidays to you as well.

                                                Chuck


---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------

From owner-hpff-task  Tue Jan  2 13:22:32 1996
Received: (from daemon@localhost) by cs.rice.edu (8.7.1/8.7.1) id NAA21564 for hpff-task-out; Tue, 2 Jan 1996 13:22:32 -0600 (CST)
Received: from fontainebleau.ensmp.fr (root@fontainebleau.ensmp.fr [192.54.148.100]) by cs.rice.edu (8.7.1/8.7.1) with SMTP id NAA21544; Tue, 2 Jan 1996 13:22:22 -0600 (CST)
Received: from cri.ensmp.fr (chailly.ensmp.fr) by fontainebleau.ensmp.fr with SMTP id AA09325
  (5.67a8/IDA-1.5); Tue, 2 Jan 1996 20:22:01 +0100
Received: from provins.caii by cri.ensmp.fr (4.1/SMI-4.0)
	id AA14116; Tue, 2 Jan 96 20:22:00 +0100
Date: Tue, 2 Jan 96 20:22:00 +0100
Message-Id: <9601021922.AA14116@cri.ensmp.fr>
Received: by provins.caii (4.1/SMI-4.1)
	id AA03523; Tue, 2 Jan 96 20:21:59 +0100
From: Fabien COELHO <coelho@chailly.ensmp.fr>
To: schreiber@hpl.hp.com
Cc: chk@cs.rice.edu, hpff-task@cs.rice.edu
Subject: hpff-task: Re: HPF reduce directive
Sender: owner-hpff-task
Precedence: bulk

---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------

Dear Rob,

> I know this is a subtle issue.   I thought we had unabiguously defined the
> answer to the "shich loop" question as the outermost DO loop that:
> 
> i)  Is independent
> ii)  Does not have a NEW clause for the reduction variable, and
> iii)  Does not *contain* in its range an independent DO loop with a NEW
>       clause for the reduction variable.
> 
> Also, the reduction statement must be in the lexical range of the DO loop.

Ok. These conditions do make sense and allow the compiler to find the
"right" outer loop by going up the syntax tree. Note that there is no
contradiction with my point: 

>> My point is that the way the "reduce" directive is placed in the loop nest
>> does not give this information to the compiler, thus some analysis will be
>> needed. This analysis could be avoided with some other syntax in the "new"
>> and "independent" spirit.

I haven't claimed that such a condition would not exist or would be that
difficult to check. My point is just that this search for the "right" loop
nest is *not* needed at all if the reduction is expressed at the loop level. 
Thus the compiler is directly at the right point and knows where to insert
the parallel reduction code for a given scalar, instead of having first to
look for the loop level. You can switch from some analysis to none.

> Thus, we have addressed the issues of whether more than one reduction
> statment may occur (there is no restriction on the number or the place)
> and which loops are the scope of the reduction. 

Great!

> Nevertheless, your points about economy of expression are very well
> taken. We should discuss this alternative.

As a matter of taste, I like vey much the "homegeneity" (with new, but not
independent...) argument. A simple syntactic issue for a language is that
things from one point of the language are the same at another point of the
language, if possible, so that the end-user won't have to learn different
"look and feel" syntax within the same language. This is not a simple
issue to look at for a language being discussed by a forum, where
aesthetics is not the primary goal. However it is an issue when you have
to teach/learn the language. With this in mind, I think that 

chpf$ independent(x), new(y), reduce(z)...
                 ^^^
could be allowed in HPF because it is simpler to teach and remember, every
body needs a list of or argument between parenthesis. On the other hand,
to be consistent with the "old" Fortran look and feel, one might have
prefered

chpf$ independent x
chpf$ new y
chpf$ reduce z

because it looks like declarations:

      real a
chpf$ dynamic w
chpf$ template t
chpf$ processors p

Well, not that easy an issue:-) It becomes an issue with teh *many*
suggested extensions considered by the forum for HPF-later...

Well, happy new year, 

Fabien.

--
Fabien COELHO __ http://www.cri.ensmp.fr/~coelho __ coelho@cri.ensmp.fr
  CRI, ENSMP, 35, rue Saint Honor�, 77305 Fontainebleau Cedex, France
     phone: 33 1 64 69 {voice: 48 52, fax: 47 09, standard: 47 08}
       ________  All opinions expressed here are mine  ________
---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------

From owner-hpff-task  Tue Jan  2 14:13:30 1996
Received: (from daemon@localhost) by cs.rice.edu (8.7.1/8.7.1) id OAA22899 for hpff-task-out; Tue, 2 Jan 1996 14:13:30 -0600 (CST)
Received: from hplms26.hpl.hp.com (hplms26.hpl.hp.com [15.255.168.31]) by cs.rice.edu (8.7.1/8.7.1) with ESMTP id OAA22894 for <hpff-task@cs.rice.edu>; Tue, 2 Jan 1996 14:13:26 -0600 (CST)
Received: from hplpp3.hpl.hp.com by hplms26.hpl.hp.com with ESMTP
	($Revision: 1.36.108.11 $/15.5+ECS 3.3+HPL1.1S) id AA261593604; Tue, 2 Jan 1996 12:13:25 -0800
Received: by hplpp3.hpl.hp.com
	(1.37.109.14/15.5+ECS 3.3+HPL1.1) id AA215953606; Tue, 2 Jan 1996 12:13:26 -0800
Date: Tue, 2 Jan 1996 12:13:26 -0800
From: Rob Schreiber <schreibr@hplpp3.hpl.hp.com>
Message-Id: <199601022013.AA215953606@hplpp3.hpl.hp.com>
To: hpff-task@cs.rice.edu
Subject: hpff-task: reductions
Sender: owner-hpff-task
Precedence: bulk

---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------


Here is a new rev of the reduction proposal.   Note that except for the last
section, only intrinsic reduction is considered.




    Proposed HPF Extension for Reduction Operations in Independent Loops

    Rob Schreiber and Chuck Koelbel, and Joel Saltz

    January 2, 1996

0.  Changes Since the Last Version.

RSS is now 46 years old.

Changes reflect response at the November, 1995 HPFF meeting.

There is a proposal for intrinsic reduction, followed by a proposed extension for
user defined reduction.

1.  Rationale

It is often the case that a data parallel computation cannot be
expressed in HPF Version 1.1 as an INDEPENDENT loop because several of
the loop iterations update one or more variables.  In such cases
parallelism may be possible and desirable, because the order of updates
is immaterial to the final result.  This is most often the case with
accumulations, such as the following loop:

        do i = 1, 1000000000
            x = x + complicated_function(i)
        end do

This loop can run in parallel as long as its iterations make their
modifications to the shared variable (X) in an atomic manner.
Alternately, the loop can be run in parallel by making updates to
temporary copies of X, with a (short) final phase to merge these
changes.  In either case, the computation is conceptually parallel but
not allowed by the original definition of INDEPENDENT.

When programming this kind of computation to run in parallel in HPF
Version 1, the programmer must store update information in a temporary
array whose size is equal to the number of loop iterations, and then
use an intrinsic reduction function or XXX_SCATTER library function
after the loop.  The problems with this approach are twofold:  the
temporary array may be excessively large; and the reduction operation
has to be intrinsic.

For these reasons, we have added a reduction feature to HPF that allows
loops that update shared variables (shared means that they are updated
by more than one loop iteration) to be labeled as independent.

2.   Technical Proposal

2.1 Add a REDUCE Directive

H499 reduce-directive         is  !HPF$ REDUCE

Constraint:  The first non-comment, non-blank line following a
reduce-directive must be an assignment-stmt of a special form, called a
reduction statement.

2.2 Reduction Statment

A reduction statement is an assignment statement of a special form, that
occurs in the range of an independent DO loop.

H498 reduction-stmt           is  assignment-stmt

Any statement immediately preceded by a REDUCE directive is a reduction
statement.  An assignment to a whole scalar or array variable in the
range of an independent DO loop must be a reduction statement and
therefore must be preceded by a reduce directive.

Constraint 1: The form of a reduction-stmt must be

     reduction-variable = expr reduction-op reduction-variable

or

     reduction-variable = reduction-variable reduction-op expr
or
     reduction-variable = &
       reduction-function(reduction-variable, expr)

or
     reduction-variable = &
       reduction-function(expr, reduction-variable)



A variable that appears on the left hand side of a reduction statement
is said to be a reduction variable.  The operator or function that
precedes or follows the reduction variable reference after the = sign
is the reduction operator.

H498.1  reduction-variable    is  variable

H498.2  reduction-op          is  intrinsic-op 

H498.3  reduction-function    is  MAX
                              or  MIN
                              or  IAND
                              or  IOR
                              or  IEOR
                              or  MATMUL

Constraint 2:  The two reduction-variable references that occur in a
reduction statement must be lexically identical.

Constraint 3:  A reduction-op may not be a rel-op; and it may not be
power-op (i.e. **).

Constraint 4:  The following two forms of the reduction-stmt are not allowed:

    reduction-variable = expr - reduction-variable

    reduction-variable = expr / reduction-variable

Constraint 5:  A reduction-stmt must occur in the range of an
INDEPENDENT DO loop in which the reduction variable does not have the
NEW attribute.    Additional constraints on the location of a reduction
statement are covered in Section 2.5.

The intrinsic operators and functions that are allowed are all associative
(at least in their mathematical definitions, 
even though the usual implementations of the arithmetic operators
by Fortran language processors
and the underlying hardware are not);
they are listed below (in Section 3).  All are commutative except for
MATMUL.  (We apply the terms associative and commutative to the
subtraction and division operators, since their meaning is addition of
the additive or multiplication by the multiplicative inverse,
respectively:   A - B means A + (-B).)

According to Constraint 5 above, a reduction statement occurs in an
INDEPENDENT loop in which the reduction variable is not NEW.  In fact,
it may occur in a nest of such loops.   It is said to be *protected* in
the outermost of these.  While this loop or loop nest is active, the
reduction variable may occur only as a reduction variable; it does not
have a defined value until control passes to the statement following
the loop.   This issue is discussed fully in Section 2.5.

2.3 Relaxation of the HPF Standard's requirement for INDEPENDENT loops

In Section 4.4 of the HPF standard, 
the first proviso on what may not occur in the loop is modified to:

*  Any two operations that assign to the same object interfere with
each other, (begin added text) unless all assignments in the loop to
the object occur in reduction statements.  (end added text)

Thus, the user's assertion implied by INDEPENDENT about the behavior of
the program is weakened, and no longer applies to reduction variables
that are protected in the DO construct.  On exit from the outermost
loop in which it is protected the value of a reduction variable is
processor dependent, although it is constrained by the model
implementation mechanism given in Section 3.

2.4  Example.

Here is an ADD_SCATTER done as an independent loop:

    !hpf$ independent
    do i = 1, n
        !hpf$ reduce
        X(index(i)) = X(index(i)) - f(i)
    enddo

The implementation will most likely make a local copy of an array of the
same shape as X, and initialize it to zero.  Each iteration will
subtract the value of f(i) from its local X(index(i)).  To
create the final result, it will combine all the local arrays with the
initial value of X:  the combining operator is addition, so that the
result is the sum of the initial value of X and the local arrays.

2.5  Restrictions on the Lexical Location of Reduction Statements, and on the
Use of Reduction Variables

In this discussion, we use the term *range* (of a DO construct) and
*active* (an active DO construct) in their technical sense as defined
by Fortran 90.  The range of a DO construct consists of the statements
lexically bounded by the DO statement and the terminal statement of the
construct.  Thus, a subprogram invoked from inside a DO loop is not in
its range, although the CALL statement or function reference that
invokes it is.    On the other hand, the DO construct is active from
the time the DO statement is encountered until the time that the
matching execution of the terminal statement is executed, or the
program stops, or control is transferred out of the construct in one of
the other ways allowed for INDEPENDENT loops (see page 302 of the
Handbook and Section 4.4 of the HPF Standard.)

When a reduction operation is executed, some nest of DO constructs will
be active; this is guaranteed by the Constraint 5 in Section 2.2.  The
reduction variable is said to be protected in some, and possibly all of
the active nest; the rule is this:  it is protected in the outermost DO
construct that is INDEPENDENT, has no NEW clause that names the
reduction variable, and does not *contain* an independent directive in
which the reduction variable occurs in a NEW clause (see Example 2,
below).   It is protected in all of this DO construct, including any
other control structures nested within it.

HPF requires that any reduction statement occur in the range of the DO
construct in which its variable is protected: it must be lexically
contained in the body of the outermost loop in which it is protected.

A protected reduction variable may only occur in reduction statements,
and only in references to the left of the assignment operator and the
matching reference to the right of the assignment operator.  Outside of
the loop in which it is protected, however, it is treated as an ordinary
variable.

2.5.1 Example Loop Nests

When loops are nested, a reduction variable may be protected in an
independent outer loop even though the reduction operations in which it
occurs are nested inside an inner loop.  Moreover, the inner loop and
any intervening loops may or may not be independent.

!   Nested Loop Example 1.  Inner loop is sequential

      x = 10
      outer: do while (x < 1000)
         !hpf$ independent, new(j)
         middle: do i = 1, n
	   inner: do j = 1, m
             !hpf$ reduce
	     x = x + j
             !  Note that it would be incorrect
             !  to refer to x here, except in a reduction statement
	   enddo inner
           !  Note that it would be incorrect
           !  to refer to x here, except in a reduction statement
         enddo middle
         print *, x
      enddo outer

Since the variable x occurs in a reduction statement in the range of
the independent middle loop, it is a protected reduction variable
throughout that loop, including inside the inner loop.   The outermost
loop is not INDEPENDENT, and so x is not protected in that part of its
range not included in the middle loop.

This remains true if the inner loop is independent.   On the other
hand, a variable that occurs in a NEW clause may not be a reduction
variable in the range of a loop, although it may be in the range of a
contained loop:

!   Nested Loop Example 2.  Outer loop NEW clause.

      !hpf$ independent
      outer: do k = 1, 100
         !hpf$ independent, new (x)
         middle: do i = 1, n
            x = 10
            !hpf$ independent
            inner: do j = 1, m
              !hpf$ reduce
              x = x + j**2
              !  Note that it would be incorrect
              !  to refer to x here, except in a reduction statement
            enddo inner
            y(i) = x
         enddo middle
      enddo outer

Here, x is a protected reduction variable only in the inner loop.


The reduction statement may not occur in a subprogram called from the range of the
independent loop.   To get this effect, return the updating value from the
subprogram and perform the update with a reduction statement in the lexical loop
body.

	program example
        integer x, xinc, i
	
	x = 10
        !hpf$ independent, new(xinc)
	do i = 1, n
            call sub(i, xinc)
            !hpf$ reduce
	    x = x + xinc
	enddo

	subroutine sub(i, xinc)
	integer i, xinc, local_b
	  local_b = (i-1)**2
          xinc = i * local_b
	end subroutine sub

2.6  Reduction Operators May Not Be Mixed

In most cases, only one operator will be used in the reduction
statements (if there are more than one) that update a given reduction
variable.   It is sensible, however, to use different operators, such
as + and -, together on the same reduction variable: mathematically,
subtraction is just addition of the additive inverse.  The same is true
for multiplication (*) and division (/).  No other mixing of operators
is allowed, however.

3.  Implementation and Semantics  

The result of the Fortran 90 intrinsic function  SUM  is defined to
be a processor dependent approximation to the sum of the elements of
its argument array.  On exit from the outermost loop in which it is
protected, the value of a reduction variable is likewise not completely
specified by HPF.  One possible value is that which would have been
computed by sequential execution of the loop, but other,
processor-dependent approximations to this value may be produced.

While HPF does not explicitly define
the set of possible values, it does give a precise implementation
mechanism that serves to define this set.

Since no reference to a protected reduction variable can occur outside of
a reduction statement in the loop range, it is not necessary to define
the values that these variables may have inside the loop.

In this discussion, the term processor means a single physical
processor OR a group of physical processors that together sequentially
execute some or all of the iterations of an independent loop.

We first give a simple implementation mechanism that applies to
commutative reduction operations.  Each processor allocates a local
reduction variable and initializes it to the identity element for the
intrinsic reduction operator.  The local reduction variable has the
same type and kind type parameter
as the global reduction
variable.  Its shape is determined by the shape of the global reduction variable,
and is identical to it, except as noted below.

The identity elements for the intrinsic operators are as
follows:

	Operator	Identity Element
	-------		----------------
	+		0
	-		0
	*		1
	/		1
	.and.		.true.
	.or. 		.false.
	.eqv.		.true.
	.neqv.          .false.

	//		''

        Table 1.   Identity elements for intrinsic reduction operators.

The following intrinsics may be used as reduction-functions, with the indicated
identity elements:


	Operator	Identity Element
	-------		----------------
	IAND(I,J)	(all ones)
	IOR(I,J)        0
	IEOR(I,J)       0
	MIN(X,Y)	-HUGE(X) where x has the same type and kind type
			parameter as the reduction variable
	MAX(X,Y)	+HUGE(X) where x has the same type and kind type
                        parameter as the reduction variable
        MATMUL(A,B)     MATMUL is noncommutative; it is the only allowed
                        associative and noncommutative intrinsic reduction function.
			As a consequence of the form A = MATMUL(A, B)
			of the reduction operation, it must be
			true that SIZE(A,2) = SIZE(B,1) = SIZE(B,2).
			As a consequence of the alternate form A = MATMUL(B, A)
			of the reduction operation, it must also be
			true that SIZE(A,1) = SIZE(B,2) = SIZE(B,1).  Thus,
			the identity is the identity matrix of shape (/N, N/),
                        of the same type and kind type parameters as the reduction
                        variable; for the left identity, N = SIZE(A,1), for
			the right identity, N = SIZE(A,2).

        Table 2.   Identity elements for intrinsic reduction functions.

Let L be the local reduction variable, A the global reduction variable.
For operators other than MATMUL, the shape of L will be the same as the shape of A,
except that if all the expr's in reduction statements for A are scalar, then a
scalar L may be used.   In the case of MATMUL, L will have shape
determined by that of A, as described in Table 2.

As an example, consider:

        integer, external :: intfun
	double precision y(100)
	real x(100)

  	!hpf$ independent, new(y)
	do i = 1, n
           y = ...
           !hpf$ reduce
	   x = intfun(i) + x
           !hpf$ reduce
	   x = x - y
        enddo

The local reduction variable will be a real array of shape (/ 100 /).
Thus, the double precision  values (of L + y) will be rounded to single precision 
after every local addition.

Each processor performs a subset of the loop iterations;  when it
encounters a reduction statement, it updates its own copy of the
reduction variable.    A processor is free to perform its loop
iterations in any order; furthermore, it may start an iteration,
suspend work on it, do some or all of the work of other iterations, and
resume work on the suspended iteration.   However, any update of a
local reduction variable occurs through the execution of a reduction
statement, and reduction statements *are* executed atomically.

The final value of the reduction variable is computed by combining the
local values with the value of the global reduction variable on entry
to the loop, using the combining operator.  The ordering of this
reduction is language processor dependent, just as it is for the
intrinsic reduction functions (SUM, etc.)

Next, we give an implementation for the noncommutative operator MATMUL.

First, assume that a given processor executes a contiguous set of the
loop iterations -- i.e. there is a block mapping of loop iterations to
processors.  In this case, the processor allocates a local reduction
variable to accumulate all updates applied from the right  (as in X =
MATMUL(X, expr)) and initializes it to the right identity element.  it
allocates another local reduction variable to accumulate all updates
applied from the left  (as in X = MATMUL(expr, X)) and initializes it
to the value of the left identity element.  It may omit either of these
if there are no corresponding updates in the loop.

When all iterations of the loop have been completed, the local
reduction variables are combined to produce the final result.   The
combination is an evaluation of the expression

(1)   lrv(P) <op> lrv(P-1) <op> ... <op> lrv(1) 
                     <op> A <op> 
                            rrv(1) <op> ... <op> rrv(P)

where A is the initial value of the reduction variable, lrv(k), k=1,
..., P is the value of the local reduction variable that has
accumulated the updates from the left due to the kth group of loop
iterations (on processor k, in the case of a block mapping of
iterations to processors), and rrv(k), k = 1, ..., P is the value of
the local reduction variable that has accumulated the updates from the
right due to the kth group of loop iterations.  In making this
evaluation, the language processor is free to employ any associated
rearrangement of the given expression; i.e. it is not required to
evaluate the expression left to right, but may use any parenthesization
of it.   It may not interchange the order of the expressions, since
the operator is not commutative.

If the mapping of iterations to processors is not a block mapping, then
a left, a right, or both a left and a right reduction variable must be
created and used for every contiguous group of loop iterations, with
updates accumulated and combined as above; in expression (1), P is the
number of such contiguous groups of iterations.

4.   Examples.

The reduction variable may be a scalar array element:

      !hpf$ independent
      do i = 1, n
          !hpf$ reduce
          x(indx(i)) = x(indx(i)) + <expr>
      enddo

It may be to a section of an array:

      !hpf$ independent
      do i = 1, n
          !hpf$ reduce
          x(i: i+4) = x(i: i+4) + <expr>   ! as many as 5 updates to elements of x
      enddo

Note that the compiler, if it distributes iterations of this loop in a block-wise
manner, will not need to make a private copy of the entire array x.

5.   Alternative syntax proposed.

In the proposal above, the programmer explicitly identifies the reduction statements,
and only implicitly the loop in which each reduction variable is protected.
The alternative reverses these decisions.  Thus, we have:

      x = 10
      outer: do while (x < 1000)
         !hpf$ independent, new(j), reduction(x)
         middle: do i = 1, n
	   inner: do j = 1, m
	     x = x + j
             !  Note that it would be incorrect
             !  to refer to x here, except in a reduction statement
	   enddo inner
           !  Note that it would be incorrect
           !  to refer to x here, except in a reduction statement
         enddo middle
         print *, x
      enddo outer

Now, the loop in which a reduction variable is protected is labeled explicitly.
Each and every occurrence of X encountered while the loop is active must be
as a reduction variable in a reduction statement of the form specified in Section
2.2 above, that occurs in the lexical range of the loop.

This alternative syntax has the advantages of economy (only one
directive regardless of the number of reduction statements) and clarity
(the protected loop is explicitly indicated).

6.  Extension to user-defined operators.
This section is a separate and detachable part of the proposal.

6.1  COMMUTATIVE attribute for functions.

First, allow COMMUTATIVE as an addition attribute of functions:

        PURE, COMMUTATIVE FUNCTION FOOBAR(A,B)

The user asserts that the value of FOOBAR(A,B) is equal to the value
of FOOBAR(B,A) for all inputs A and B.

6.2  Initial value for local variables in reduction clause

The reduction clause is allowed to designate an initial value
for the local reduction variable:

!hpf$ independent, new(B), reduction (A = MY_OP_IDENT(SIZE(B,1)))
do i = 1, n
   B = ...
   A = MY_OP(A, B)
enddo

The reason to assign the identity to the reduction directive rather than
make it an attribute of the function is the difficulty of restricting
the kind of expression allowed in the later case.  In this proposal, any
expression whose value is conformable with the reduction variable is
permitted.  Note that the expression is evaluated ONCE on entry to the
loop nest, and its value is conceptually replicated, so that all
processors initialize their local reduction variables to this value.
Thus, the expression could use a non-pure function.  We might disallow
this, in order to permit concurrent, one-per-processor evaluations of
the identity expression.







---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------

From owner-hpff-task  Wed Jan  3 04:27:11 1996
Received: (from daemon@localhost) by cs.rice.edu (8.7.1/8.7.1) id EAA07281 for hpff-task-out; Wed, 3 Jan 1996 04:27:11 -0600 (CST)
Received: from fontainebleau.ensmp.fr (root@fontainebleau.ensmp.fr [192.54.148.100]) by cs.rice.edu (8.7.1/8.7.1) with SMTP id EAA07274; Wed, 3 Jan 1996 04:26:29 -0600 (CST)
Received: from cri.ensmp.fr (chailly.ensmp.fr) by fontainebleau.ensmp.fr with SMTP id AA12765
  (5.67a8/IDA-1.5); Wed, 3 Jan 1996 11:26:01 +0100
Received: from provins.caii by cri.ensmp.fr (4.1/SMI-4.0)
	id AA19126; Wed, 3 Jan 96 11:26:00 +0100
Date: Wed, 3 Jan 96 11:26:00 +0100
Message-Id: <9601031026.AA19126@cri.ensmp.fr>
Received: by provins.caii (4.1/SMI-4.1)
	id AA04277; Wed, 3 Jan 96 11:25:57 +0100
From: Fabien COELHO <coelho@chailly.ensmp.fr>
To: chk@cs.rice.edu
Cc: hpff-task@cs.rice.edu
In-Reply-To: <v01530508ad0f1bd34ca9@[128.42.1.213]> (chk@cs.rice.edu)
Subject: hpff-task: Re: not attached "new" directive
Sender: owner-hpff-task
Precedence: bulk

---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------

Chuck,

> >chpf$ new(x)
> >      forall(i=2:n)              ! NOT independent!
> >        x = a(i-1)+2*a(i)+a(i+1) ! refer to old A values.
> >        b(i) = f1(x)
> >        a(i) = f2(x)             ! new A defined here.
> >      end forall
> >
> >Here the loop is NOT independent [...]
> 
> The intent of FORALL was to provide a different syntax and more general
> shapes for array assignments.  This makes it orthogonal to reduction
> operators like the one in your first example.

My understanding of "orthogonal" is that two concepts may apply on
something with no special constraint or interaction. This is just my
point! the "reduction", "new" and possibly "on" concepts apply to parallel
loops, whether "independent" or not. I think they also apply on sections
of code, as I argued for the corner computations of my basic seismic
application in a previous mail. I'll argue on that in another mail, 
some day.

> Is it really a problem to call the SUM intrinsic outside the FORALL?

Just a matter of scanning the array once or twice, for the reduction
example. Just cache misses and control overheads issues. Just a
performance problem. (I guess this has to do with HPF:-).

The same remark holds for independent do loops with reductions. You may
call the SUM or PROD or whatever instrinsic outside of the loop instead.
But nevertheless the Forum is considering a "reduction" directive for
independent loops. Why shouldn't it apply also on foralls? Let's discuss
this point:

For the second example, you need to use explicitely a temporary
array do simulate the SIMD execution order, that is you have to declare it
and it may be big, while if the compiler manages it I guess it will be
stack-allocated and the space reused if needed for the next loop. And your
code will be cleaner. 

> You could write the second loop as an INDEPENDENT DO loop without trouble.
> Why is it vital that you use the keyword FORALL? 

I'm not promoting the FORALL. It is neither Fortran 77 nor Fortran 90, so
any code that do use a forall cannot be compiled and is not portable apart
on HPF compilers. I hope F95 (F96?) will have the same one? Well...

There are 2 "parallel" loops in HPF. The FORALL may or may not be
"independent", it is nevertheless parallel, with a SIMD execution
order. HPF allows "independent" to characterize these loops, but not "new"
(constraint line 10, page 81 of HPF 1.0). I cannot see why, apart the
history of the proposal, where forall is an extension of the array
assignment syntax. I'm arguing that *if possible*, "new" and "reduction"
and "on" or whatever directive should apply on foralls as well as on do
loops. 

I would summarize the FORALL as a loop for which you don't have to bother
about dependences on a same array because a temporary is managed somehow
by the compiler. If it is independent it should be tagged so because it
helps the compiler by relaxing some potential dependences. As a user I
cannot see obvious reasons to explain why I would be prevented from using
some directives for a loop and not for another, if it does make sense to
do so in my parallel application. If I can write

chpf$ independent, new(x)
      do i=2, n
         x = A(i-1)+2*A(i)+A(i+1)
         B(i) = f1(x)
         C(i) = f2(x) ! C array
       enddo

I should be allowed to write in HPF

chpf$ new(x) ! but NOT independent...
      forall(i=2:n)
         x = A(i-1)+2*A(i)+A(i+1) 
         B(i) = f1(x)
         A(i) = f2(x) ! A array written
      endforall

instead of the explicite temporary management:

      real TMP(n)
chpf$ align TMP with A
      TMP(2:n-1) = A(1:n-2) + 2*A(2:n-1) + A(3:n)
      B(2:n-1) = f1(TMP(2:n-1))
      A(2:n-1) = f2(TMP(2:n-1))


BUT:

A point against my case is that if the FORALL is allowed in some
sequential Fortran, such a loop would not be legal because the "new"
directive is needed for correctness, and I guess no such directive would be
provided to the user. What is actually needed for "new" is the ability to
declare variables local to the body. For reductions, this is a problem.

This issue is not related to HPF as a parallel language. It is related to
"sequential compatibility". It may be THE case against "new" and
"reduction" for foralls. But it is the only one I can see, and the
language loses on expressivity, in my opinion. 

> Many other cases of
> temporary scalars can be handled by declaring a local variable in a PURE
> function.  (This doesn't work here, obviously.)

You do not necessary want to pay for a function call within the loop nest.

> It seems that you want FORALL to be a general-purpose construct for "do
> this in parallel". 

Ya. With a "SIMD" execution order on each assignment. That's the way I
perceive it. 

I've seen codes where the forall is used as the only mean for parallelism,
that is all the computations where simply expressed with foralls, while
they were independent. They were not tagged as independent. Because the
compiler was not analyzing this directive "yet".  I think many people feel
the forall as more powerful and general than the do loop. Proof is that it
is an HPF extension, nothing to do with old Fortran! Joking:-)

> However, FORALL is something more limited - this was a
> concious decision to make the construct simpler. 

You might limit the content of body inside the loop. But I guess my
example make sense, as a point that HPF directives should apply on any
piece of code where it does mean something and it may help the compiler.
Something is lost in the language by disallowing orthogonality.

> The INDEPENDENT DO loop seems to be closer to what you want in any case.

Means I have to manage a temporary array by hand to do something else...
So I'll do it by hand:-)

Conclusion: I think I must give up on "new" and "reduction" directives on
foralls, but the language loses expressivity and orthogonality. If the
forall is not to be considered by some sequential fortran, it should be
allowed in HPF.

I guess it is worth the discussion, at least...

Fabien.

--
Fabien COELHO __ http://www.cri.ensmp.fr/~coelho __ coelho@cri.ensmp.fr
  CRI, ENSMP, 35, rue Saint Honor�, 77305 Fontainebleau Cedex, France
     phone: 33 1 64 69 {voice: 48 52, fax: 47 09, standard: 47 08}
       ________  All opinions expressed here are mine  ________
---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------

From owner-hpff-task  Wed Jan  3 05:35:50 1996
Received: (from daemon@localhost) by cs.rice.edu (8.7.1/8.7.1) id FAA09174 for hpff-task-out; Wed, 3 Jan 1996 05:35:50 -0600 (CST)
Received: from fontainebleau.ensmp.fr (fontainebleau.ensmp.fr [192.54.148.100]) by cs.rice.edu (8.7.1/8.7.1) with SMTP id FAA09169; Wed, 3 Jan 1996 05:35:40 -0600 (CST)
Received: from cri.ensmp.fr (chailly.ensmp.fr) by fontainebleau.ensmp.fr with SMTP id AA13429
  (5.67a8/IDA-1.5); Wed, 3 Jan 1996 12:34:43 +0100
Received: from provins.caii by cri.ensmp.fr (4.1/SMI-4.0)
	id AA20126; Wed, 3 Jan 96 12:34:43 +0100
Date: Wed, 3 Jan 96 12:34:43 +0100
Message-Id: <9601031134.AA20126@cri.ensmp.fr>
Received: by provins.caii (4.1/SMI-4.1)
	id AA04341; Wed, 3 Jan 96 12:34:42 +0100
From: Fabien COELHO <coelho@chailly.ensmp.fr>
To: schreiber@hpl.hp.com
Subject: hpff-task: HPF directives (reduce, on, new and so again)
Cc: chk@cs.rice.edu, hpff-task@cs.rice.edu
Sender: owner-hpff-task
Precedence: bulk

---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------

Dear Rob,

> 	By the way, many directives are orthogonal. I cannot remember from
> 	the "ON" proposal whether something like
> 	
> 	chpf$ independent, new(y), reduction(z(:)), on home(a(i))
> 	      do i=...
> 	
> 	would make sense.
>
> ON is now not viewed as a clause of the independent directive. It can apply
> to any code block including, but not exclusively a loop body.

Sure; I would like to suggest a scope directive (in the spirit of what is
suggested for the on directive, and as a general replacement for it) to
allow some time things like: 

chpf$ reduction(x=0.), new(z,z2) [, on ...]
chpf$ scope
      z = 1./epsilon
      z2 = .5*z
      A(1) = z*(B(2)-B(1))
      x = x + A(1)
chpf$ independent
      do i=2, n-1
         A(i) = z2*(B(i+1)-B(i-1))
         x = x + A(i)
      enddo
      A(n) = z*(B(n)-B(n-1))
      x = x + A(n)
chpf$ end scope
      ! now x is available as SUM(A). 
      ! (x = f(B(1),B(2),B(n-1),B(n)), 
         the computation does not make sense here:-)

the reduction involves the loop and the corners which are computed.
this causes not much problem as far as compilation is concerned, in my
opinion. The benefit is to avoid the global synchronizations at 
x = x + A(1) and x = x + A(n), and every thing is made after the 
"end scope" as it would have been for a reduction in a do loop.

Even if this is not yet allowed, I think that the scope provider and
the directives should be distinct for allowing this in the future if
it may be desired... at least for HPF-later...

My opinion is that it could also be considered for HPF{2,kernel}...

> 	Moreover all these directives should commute and be optionnal?
> Reduction is not optional if independent is used and the loop contains
> reductions....

I just mean on a syntactic point of view, that all directives should be:

HPF-directive-list is HPF-directive [, HPF-directive-list ]

HPF-directive is independent-directive
              or new-directive
              or reduction-directive
              or on-directive
              or ...

Constraint 1: After an HPF-directive-list is a scope provider [or maybe
              another HPF-directive-list], that is:
              - a do loop
              - a forall loop
              - a "scope-directive"

Constraints 2: are disallowed:
              - independent for "scope"
              - new and reduction for "forall" (okay, okay:-)
              (that's all?!)


We could also think of "independent(i)" in a scope, making all loops as
parallel in this scope, relaxing one of constraints...

chpf$ independent(i), reduction(x), new(z), on ...
chpf$ scope ! all needed directives are declared before the code...
      ...
      do i = ... ! independent!
      ...
      do i = ... ! independent!
      ...
      do j = ... ! not necessarily ...!
      ...
chpf$ end scope
      ...
      do i = ... ! not necessarily independent!


I'll finalize this some day. The issues to be discussed:

 - sofware engineering (quality improved, I think)
 - ease of use (sure!)
 - economy (sure!)
 - clarity (I think so!)
 - compilability (I think so, quite easily!)
 - teachability (improved because it is in a general framework, it is
                 homegeneous, ...)

Well, part is syntactic sugar, but not all of it... 
Have a nice day, regards,

Fabien.

--
Fabien COELHO __ http://www.cri.ensmp.fr/~coelho __ coelho@cri.ensmp.fr
  CRI, ENSMP, 35, rue Saint Honor�, 77305 Fontainebleau Cedex, France
     phone: 33 1 64 69 {voice: 48 52, fax: 47 09, standard: 47 08}
       ________  All opinions expressed here are mine  ________
---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------

From owner-hpff-task  Wed Jan  3 11:10:48 1996
Received: (from daemon@localhost) by cs.rice.edu (8.7.1/8.7.1) id LAA13931 for hpff-task-out; Wed, 3 Jan 1996 11:10:48 -0600 (CST)
Received: from [128.42.5.167] (pasyn-39.rice.edu [128.42.5.167]) by cs.rice.edu (8.7.1/8.7.1) with SMTP id LAA13920; Wed, 3 Jan 1996 11:10:38 -0600 (CST)
Date: Wed, 3 Jan 1996 11:10:38 -0600 (CST)
X-Sender: chk@titan.cs.rice.edu
Message-Id: <v01530504ad100b60a035@[128.42.5.167]>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
To: Fabien COELHO <coelho@chailly.ensmp.fr>
From: chk@cs.rice.edu (Chuck Koelbel)
Subject: hpff-task: Re: not attached "new" directive
Cc: chk@cs.rice.edu, hpff-task@cs.rice.edu
Sender: owner-hpff-task
Precedence: bulk

---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------

At 11:26 01/03/96, Fabien COELHO wrote:
>>
>> The intent of FORALL was to provide a different syntax and more general
>> shapes for array assignments.  This makes it orthogonal to reduction
>> operators like the one in your first example.
>
>My understanding of "orthogonal" is that two concepts may apply on
>something with no special constraint or interaction. This is just my
>point! the "reduction", "new" and possibly "on" concepts apply to parallel
>loops, whether "independent" or not. I think they also apply on sections
>of code, as I argued for the corner computations of my basic seismic
>application in a previous mail. I'll argue on that in another mail,
>some day.

What I meant was "FORALL and reductions solve different problems".  While I
agree that two constructs that solve the same problem should have the same
restrictions, I do not agree that constructs doing different things must
have the same behavior.

You are arguing that FORALL should be (a? the?) general parallel construct.
I am arguing that the INDEPENDENT DO is a better candidate for this.  In
particular, making FORALL fully general would require major changes - for
starters, we would have to define the semantics of CALL and WRITE inside of
FORALL.  (We tried this in HPFF 1 meetings for CALL, and could not reach
concensus.)  INDEPENDENT requires fewer changes, essentially relaxing
constraints rather than defining new behavior.

If you want to pursue generalizing FORALL, my suggestion is to write up a
full proposal.  (Feel free to start with the text in the HPF language spec
if you want.)  That will at least ensure that we're talking about the same
thing.

>I'm not promoting the FORALL. It is neither Fortran 77 nor Fortran 90, so
>any code that do use a forall cannot be compiled and is not portable apart
>on HPF compilers. I hope F95 (F96?) will have the same one? Well...

Fortran 95 will include FORALL, defined as in HPF 1.0.

>There are 2 "parallel" loops in HPF. The FORALL may or may not be
>"independent", it is nevertheless parallel, with a SIMD execution
>order. HPF allows "independent" to characterize these loops, but not "new"
>(constraint line 10, page 81 of HPF 1.0). I cannot see why, apart the
>history of the proposal, where forall is an extension of the array
>assignment syntax. I'm arguing that *if possible*, "new" and "reduction"
>and "on" or whatever directive should apply on foralls as well as on do
>loops.

The problem is that NEW and REDUCTION and ON are directives.  They are not
first-class statements of the language.  Therefore, they cannot change the
meaning of the program by themselves.  They can change the efficiency of
the implementation (that's their job).  You make this point later in your
letter.

FORALL is a first-class statement, because its meaning is different from DO
(or any other construct.  Therefore, NEW cannot apply to FORALL, since the
FORALL would be incorrect without the NEW directive.

INDEPENDENT is an assertion of fact about the program.  The program will
produce the same answer whether INDEPENDENT is used or not, although how
quickly it produces the answer may be very different.  NEW modifies this
assertion (in particular, it says certain variables "don't count" in
determining dependences).

As a programmer, you need to understand what a program means.  If you can
explain what FORALL with a non-NEW scalar assignment means (and get the
rest of the world to agree with you), then we have a basis for adding NEW
to FORALL.  For example, what does this print *without NEW*?
        FORALL ( i = 1:n )
          x = a(i)
        END FORALL
        print x

>Conclusion: I think I must give up on "new" and "reduction" directives on
>foralls, but the language loses expressivity and orthogonality. If the
>forall is not to be considered by some sequential fortran, it should be
>allowed in HPF.
>
>I guess it is worth the discussion, at least...
>
>Fabien.

I don't necessarily agree that the language loses much expressivity.
However, I wish I could think of a way to make the difference between
FORALL and INDEPENDENT clearer.  HPF students (tutorials, etc.) always seem
to have trouble with this point, as do HPF implementors.

                                                Chuck


---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------

From owner-hpff-task  Wed Jan  3 11:29:56 1996
Received: (from daemon@localhost) by cs.rice.edu (8.7.1/8.7.1) id LAA14737 for hpff-task-out; Wed, 3 Jan 1996 11:29:56 -0600 (CST)
Received: from [128.42.5.167] (pasyn-39.rice.edu [128.42.5.167]) by cs.rice.edu (8.7.1/8.7.1) with SMTP id LAA14726; Wed, 3 Jan 1996 11:29:45 -0600 (CST)
Date: Wed, 3 Jan 1996 11:29:45 -0600 (CST)
X-Sender: chk@titan.cs.rice.edu
Message-Id: <v01530505ad1015920575@[128.42.5.167]>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
To: Fabien COELHO <coelho@chailly.ensmp.fr>
From: chk@cs.rice.edu (Chuck Koelbel)
Subject: hpff-task: Re: HPF directives (reduce, on, new and so again)
Cc: schreiber@hpl.hp.com, chk@cs.rice.edu, hpff-task@cs.rice.edu
Sender: owner-hpff-task
Precedence: bulk

---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------


>Sure; I would like to suggest a scope directive (in the spirit of what is
>suggested for the on directive, and as a general replacement for it) to
>allow some time things like:
>
>chpf$ reduction(x=0.), new(z,z2) [, on ...]
>chpf$ scope
>...
>chpf$ end scope

This idea seems appealing.  But I think it is better sent to the Fortran
2000 committee than to HPF.  Many things in F95 and HPF would be simpler if
we had a block-structured language.  In particular, we would not need NEW
if there were real local variables within a loop.

I'm a little curious - is SCOPE just a syntactic marker, or do you have
some further meaning for it?  For example, if there is no HPF-directive
before SCOPE, does it signify anything?

>>       Moreover all these directives should commute and be optionnal?
>> Reduction is not optional if independent is used and the loop contains
>> reductions....
>
>I just mean on a syntactic point of view, that all directives should be:
>
>HPF-directive-list is HPF-directive [, HPF-directive-list ]
>
>HPF-directive is independent-directive
>              or new-directive
>              or reduction-directive
>              or on-directive
>              or ...

I assume that "..." doesn't include DISTRIBUTE, REALIGN, and the other
mapping directives?  If it does, what does
        !HPF$ REALIGN x(BLOCK)
        !HPF$ SCOPE
          x = CSHIFT(x,1)
        !HPF$ END SCOPE
        y = x
mean (in particular, what is the distribution of x when it is assigned to y)?

>We could also think of "independent(i)" in a scope, making all loops as
>parallel in this scope, relaxing one of constraints...

I tend to think this is a bad idea, since it moves the assertion pretty far
from where it is used.  But it's reasonable to discuss.

>I'll finalize this some day. The issues to be discussed:
>
> - sofware engineering (quality improved, I think)
> - ease of use (sure!)
> - economy (sure!)
> - clarity (I think so!)
> - compilability (I think so, quite easily!)
> - teachability (improved because it is in a general framework, it is
>                 homegeneous, ...)
>
>Well, part is syntactic sugar, but not all of it...
>Have a nice day, regards,
>
>Fabien.

If you really want it to be discussed in HPF 2, you'd better get it written
up quickly.  The next HPFF 2 meeting is next week, and there aren't that
many more planned.

There is more time to send it off to Fortran 2000, which might be the
better choice anyway.  Why improve just HPF when you can help all of
Fortran?

                                                Chuck


---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------

From owner-hpff-task  Wed Jan  3 12:31:09 1996
Received: (from daemon@localhost) by cs.rice.edu (8.7.1/8.7.1) id MAA16598 for hpff-task-out; Wed, 3 Jan 1996 12:31:09 -0600 (CST)
Received: from fontainebleau.ensmp.fr (root@fontainebleau.ensmp.fr [192.54.148.100]) by cs.rice.edu (8.7.1/8.7.1) with SMTP id MAA16584; Wed, 3 Jan 1996 12:30:55 -0600 (CST)
Received: from cri.ensmp.fr (chailly.ensmp.fr) by fontainebleau.ensmp.fr with SMTP id AA15630
  (5.67a8/IDA-1.5); Wed, 3 Jan 1996 19:30:15 +0100
Received: from provins.caii by cri.ensmp.fr (4.1/SMI-4.0)
	id AA26174; Wed, 3 Jan 96 19:30:14 +0100
Date: Wed, 3 Jan 96 19:30:14 +0100
Message-Id: <9601031830.AA26174@cri.ensmp.fr>
Received: by provins.caii (4.1/SMI-4.1)
	id AA04832; Wed, 3 Jan 96 19:30:13 +0100
From: Fabien COELHO <coelho@chailly.ensmp.fr>
To: chk@cs.rice.edu
Cc: hpff-task@cs.rice.edu
In-Reply-To: <v01530504ad100b60a035@[128.42.5.167]> (chk@cs.rice.edu)
Subject: hpff-task: Re: not attached "new" directive
Sender: owner-hpff-task
Precedence: bulk

---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------

Dear Chuck,

> You are arguing that FORALL should be (a? the?) general parallel
> construct.

Not exactely. There is a bit of misunderstanding with the goal of my
previous mail and the opinion which is actually mine. I guess my poor
English and quite passionate arguments are no stranger to this...
Sorry about it.

> I am arguing that the INDEPENDENT DO is a better candidate for this. 
> [...]

Although this may not be clear from my mail, I definitely agree with you
on that point, and on the discussion that followed in your mail. 

I'm arguing more about the perception that people have/may have/should
have of the HPF language constructs, and their relations and
interactions. 

I don't parse FORALL with my prototype compiler, and I'm not planning to
add it, because I feel it as a particular case compared to the general
INDEPENDENT (do) loop. Also I want my test programs to be compiled by a
F77 compiler, for checking that my compilation is correct.

On the other hand the HPF compilers I had a look at did not parse/consider
the INDEPENDENT directive but only rely on the forall and array sections
for parallelism (I think this is not case with the new versions now).

If you look at examples in papers about HPF compilation, you may notice
that most of the time the examples are based on the FORALL syntax even if
the loop is INDEPENDENT, and the INDEPENDENT directive is not specified.

These facts show that the FORALL is perceived as an essential part
of the HPF language by some people (but *not* me).

If (other) people feel that FORALL is important, I was trying to argue
about giving them the same "services" that they can have with the
sympathetic INDEPENDENT do loops, namely "new" and "reduction", if
possible. 

The conclusion was that it is not possible, although I think (but you
don't necessarily think, fine) it would have been desirable (for
orthogonality) and that in some cases it it would have facilitate
expressivety (by not requiring from the programmer to manage explicitely
temporary arrays in their code while it could have been avoided with the
forall with extension, the compiler managing the temporary by itself).

> If you can
> explain what FORALL with a non-NEW scalar assignment means (and get the
> rest of the world to agree with you), then we have a basis for adding NEW
> to FORALL. 

I can't:-) And I don't want to!

> For example, what does this print *without NEW*? [...]

>         FORALL ( i = 1:n )
>           x = a(i)
>         END FORALL
>         print x

I think that the intent is that without new it should be "semantical"
error. As if you mofify a constant passed as an argument to a function for
instance. 

However, if I simply extend the "Scalarization of forall statement" 
(Section 4.1.4 page 63 to 65) very precisely defined in the HPF
specification to this case I could answer "if n<1 then initial x else
a(n)":-). This is definitely deterministic and useless:-) 

for 

FORALL(i=1:n)
  x = a(i)
  b(i) = x
END FORALL

it would mean "b(1:n)=a(n)". A little bit disappointing for the user:-)

However you may have noticed in my previous mail that I gave up on my
initial claim because of this very problem: whatever the meaning without a
"new" clause, it cannot be the same in parallel and in sequential...

> I don't necessarily agree that the language loses much expressivity.

Well, what meant is just about the "expressivity" point is that it
requires the user to use an explicite temporary array to do the 
equivalent with independent loop. This is not big an issue, and doesn't
change the fact that it is not possible to have simply the new directive
for the forall.

> However, I wish I could think of a way to make the difference between
> FORALL and INDEPENDENT clearer. HPF students (tutorials, etc.) always seem
> to have trouble with this point, as do HPF implementors.

Suggestion (50% joke): don't tell them about the FORALL!

For people that never used a SIMD machine, the SIMD execution order of the
forall is not intuitive, although they have no problem with array
sections...

Regards,

Fabien.
---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------

From owner-hpff-task  Thu Jan  4 03:34:50 1996
Received: (from daemon@localhost) by cs.rice.edu (8.7.1/8.7.1) id DAA04829 for hpff-task-out; Thu, 4 Jan 1996 03:34:50 -0600 (CST)
Received: from fontainebleau.ensmp.fr (root@fontainebleau.ensmp.fr [192.54.148.100]) by cs.rice.edu (8.7.1/8.7.1) with SMTP id DAA04824; Thu, 4 Jan 1996 03:34:40 -0600 (CST)
Received: from cri.ensmp.fr (chailly.ensmp.fr) by fontainebleau.ensmp.fr with SMTP id AA19828
  (5.67a8/IDA-1.5); Thu, 4 Jan 1996 10:31:38 +0100
Received: from provins.caii by cri.ensmp.fr (4.1/SMI-4.0)
	id AA00141; Thu, 4 Jan 96 10:31:38 +0100
Date: Thu, 4 Jan 96 10:31:38 +0100
Message-Id: <9601040931.AA00141@cri.ensmp.fr>
Received: by provins.caii (4.1/SMI-4.1)
	id AA05470; Thu, 4 Jan 96 10:31:29 +0100
From: Fabien COELHO <coelho@chailly.ensmp.fr>
To: chk@cs.rice.edu
Cc: schreiber@hpl.hp.com, hpff-task@cs.rice.edu
In-Reply-To: <v01530505ad1015920575@[128.42.5.167]> (chk@cs.rice.edu)
Subject: Re: hpff-task: Re: HPF directives (reduce, on, new and so again)
Sender: owner-hpff-task
Precedence: bulk

---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------

Hi Chuck,

> >chpf$ reduction(x=0.), new(z,z2) [, on ...]
> >chpf$ scope
> >...
> >chpf$ end scope
> 
> [...]
> Many things in F95 and HPF would be simpler if we had a block-structured
> language.  In particular, we would not need NEW if there were real local
> variables within a loop.

Sure. This is no more than the {} of C. Allowing local declarations would
drop the new directive, or just make it obsolete...

What stuck me is that there are TWO directives to provide a scope which
are being independently discussed by the HPF Forum: one for ON and one for
TASK. I think this should be noticed and stopped before having them in the
language... While a more general and pretty approach might unify both (and
possibly more) constructs...

In the example above, I do not necessary which to specify a ON or a
task, but I want one for reduction and new. Could be new alone or
reduction alone. Just mean that there is a need for some "scoping" in HPF
which is not related to loops. I don't think it would be nice to have two
different constructs in the language and not to be allowed to write the
reduction example above... Thus the Forum should consider such an
extension just for the language to be nice/pretty. For the syntax to be
homogeneous accross different HPF extensions...

If some BLOCK/END BLOCK or any other block structure is proposed by
F-later, fine, it will just obsolete the corresponding HPF structure and
few corrections will be needed to the standard to add the Fortran
structure as another possible scope provider...
 
> I'm a little curious - is SCOPE just a syntactic marker, or do you have
> some further meaning for it?  For example, if there is no HPF-directive
> before SCOPE, does it signify anything?

Not in my mind. I was just thinking of putting some constraints on it so
that there is only one entry point (scope) and one exit point (end scope),
because such constraints would ease the compilation process, maybe.
Additional constraint might be added when directives are added. 
Say for instance no I/O...

> >HPF-directive-list is HPF-directive [, HPF-directive-list ]
> >
> >HPF-directive is independent-directive
> >              or new-directive
> >              or reduction-directive
> >              or on-directive
> >              or ...
> 
> I assume that "..." doesn't include DISTRIBUTE, REALIGN, and the other
> mapping directives? 

Not in my mind. I just thought about declaration and advices to the
compiler about the code. But it is worth duscussing the point. 
I don't think that static mapping directives would make sense. 
I don't think that the idea of "executable" directive should be changed to
have some kind of implicite meaning in this context. 

> >We could also think of "independent(i)" in a scope, making all loops as
> >parallel in this scope, relaxing one of constraints...
> 
> I tend to think this is a bad idea, since it moves the assertion pretty far
> from where it is used.  But it's reasonable to discuss.

Ya. The motivation here was just to relax one of the constraint between
directives and scope providers. 

> If you really want it to be discussed in HPF 2, you'd better get it written
> up quickly.  The next HPFF 2 meeting is next week, and there aren't that
> many more planned.

I don't think I'll have time for next week... 

There are a couple of things that I would like the Forum to think about
and at least to consider as a possible path for improving/cleaning/...
HPF and the kernel. But I won't be sent to the US for talking about this
to the forum:-) All I can do is to send mail describying and arguing about
these ideas, so that people chat about them, and if they find them
interesting then some more formal process for including them in the 
language could be launched. At the time these ideas are:

[mail of Dec 13 to hpff-distribute]

 - disallowing ambiguity for remappings and putting them into the kernel
   (if no "dynamic" ambiguity about the mapping is allowed, and since the
    mappings are quite simplified in the kernel, it is not that
    difficult to handle them. I can suggest a simple algorithm to manage
    them with "renamings" and copie. It is implemented in my prototype,
    so it must be quite easy:-) 

 - allowing several static descriptive mappings to be specified for an
   array in a subroutine. for the kernel.
   (=> (maybe) automatic cloning for the different cases by the compiler.
    better than inherit I think, because it gives some information
    to the compiler, and simple implementation)

[mail of Jan 2 to hpff-distribute, forwarded to hpff-task]

 - the reduction syntax attached to the loop instead of the statement.
   was included by Rob in the last proposal, so it's ok.
 
[mail of Jan 3 to hpff-task]

 - some scope/split/endscope directive as a scope provider and hook
   for some HPF-directives (independent/new/reduction/on/task/???).

[???]

 - independent(i,j) as syntactic sugar for "independent, new/ind..."
   before a loop nest.

That's all at the moment:-)

> There is more time to send it off to Fortran 2000, which might be the
> better choice anyway.  Why improve just HPF when you can help all of
> Fortran?

Well, because I don't like Fortran as a programming language, while HPF
compilation is my PhD subject, so I want to improve HPF features:-)
Seriously, I think the HPF Forum might be quicker to consider such ideas
if they  appear to be meaningful and desirable, while the Fortran
definition is much slower.

Have a nice day.

Fabien.

--
Fabien COELHO __ http://www.cri.ensmp.fr/~coelho __ coelho@cri.ensmp.fr
  CRI, ENSMP, 35, rue Saint Honor�, 77305 Fontainebleau Cedex, France
     phone: 33 1 64 69 {voice: 48 52, fax: 47 09, standard: 47 08}
       ________  All opinions expressed here are mine  ________
---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------

From owner-hpff-task  Thu Jan  4 13:13:14 1996
Received: (from daemon@localhost) by cs.rice.edu (8.7.1/8.7.1) id NAA01569 for hpff-task-out; Thu, 4 Jan 1996 13:13:14 -0600 (CST)
Received: from hplms26.hpl.hp.com (hplms26.hpl.hp.com [15.255.168.31]) by cs.rice.edu (8.7.1/8.7.1) with ESMTP id NAA01563 for <hpff-task@cs.rice.edu>; Thu, 4 Jan 1996 13:13:11 -0600 (CST)
Received: from hplpp3.hpl.hp.com by hplms26.hpl.hp.com with ESMTP
	($Revision: 1.36.108.11 $/15.5+ECS 3.3+HPL1.1S) id AA235542783; Thu, 4 Jan 1996 11:13:04 -0800
Received: by hplpp3.hpl.hp.com
	(1.37.109.14/15.5+ECS 3.3+HPL1.1) id AA185562785; Thu, 4 Jan 1996 11:13:05 -0800
Date: Thu, 4 Jan 1996 11:13:05 -0800
From: Rob Schreiber <schreibr@hplpp3.hpl.hp.com>
Message-Id: <199601041913.AA185562785@hplpp3.hpl.hp.com>
To: hpff-task@cs.rice.edu
Subject: hpff-task: CCI for Mtg.
Sender: owner-hpff-task
Precedence: bulk

---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------


Group C CCI LIST AS OF JAN 4, 1996
Rob Schreiber


Please, please please read the high priority items -- we will try to 
slay them at the meeting.


CONTENTS

Item    About                       From    Date      Status
 36     Remapping in INDEPENDENT    Larry   10/7/95   Open   HIGH PRIORITY
 37     NEW status for locals       Larry   10/7/95   Open   HIGH PRIORITY
 38     returned array indices      Henry   ??        Open   HIGH PRIORITY
 AA     New for indep loop index    Henry   ??        Open   MEDIUM
 BB     PURE list concat function   Rob/Joel ??       Open   MEDIUM

 18     Defined assign in forall    Henry   10/23/95  Resolved - almost
 29     Extrinsic called in indep   Rob      8/03/95  Resolved


HIGH PRIORITY ITEMS:

=====================================

CCI #36
Independent and remapping
Group C
Larry Meadows
Submitted: 10/7/95
status: new

Question:
In Section 4.4 of the standard, one of the conditions on INDEPENDENT is
that realignment and redistribution cannot occur, since they may change
the processor storing a particular array element.

I would argue that the same reasoning applies to remapping of arguments
in subroutines called inside of INDEPENDENT DO loops. For example:

!hpf$ distribute a(block)
!hpf$ independent
        do i = 1,n
            call sub(a)
        enddo

        subroutine sub(a)
        real a(:)
!hpf$ distribute a(cyclic)
        ...
        return
        end


>From an implementation point of view, remapping of arguments is a
collective operation, just as is realignment or redistribution, so
is difficult to implement inside INDEPENDENT DO loops.

Couple of other points:

- Would be nice to have some examples in 4.4.1 where subroutines were
  called, legally and illegally.

(see cci #37 for other point)
Thanks.
lfm
=====================================
=====================================

CCI #37
Status of locals in procedures called inside Independent DO
Group C
Larry Meadows
Submitted 10/7/95
Status: new
Question:
- I seem to recall that we decided that local variables of subroutines
  called inside INDEPENDENT DO loops were automatically NEW (and I
  assume that this doesn't apply to SAVE or COMMON variables). Is this
  documented anywhere?
Thanks.
lfm

=====================================
=========================

CCI #38 -- Base of coordinates for functions returning array indices

Hello,

     A number of the procedures in the HPF_LOCAL_LIBRARY refer to their
arguments as containing or receiving "coordinates".  It's not clear whether the
values returned or required are supposed to be based on the corresponding lower
bounds of the array (or processor as the case may be) or whether they should be
one-based.

     For example,

           PROGRAM P
             INTERFACE
               EXTRINSIC(HPF_LOCAL) SUBROUTINE SUB(D)
                 INTEGER D(4:)
     !HPF$       PROCESSORS PROC(NUMBER_OF_PROCESSORS())
     !HPF$       DISTRIBUTE D(BLOCK) ONTO PROC
               END SUBROUTINE SUB
             END INTERFACE
             INTEGER A(5:10)

             CALL SUB(A)
           END PROGRAM P

           EXTRINSIC(HPF_LOCAL) SUBROUTINE SUB(D)
             USE HPF_LOCAL_LIBRARY
             INTEGER D(4:), L(1)

             CALL GLOBAL_TO_LOCAL(D, G_INDEX=(/5/), L_INDEX=L)
           END SUBROUTINE SUB

Assuming this program is run on one processor, what should be the value of L
after the call to GLOBAL_TO_LOCAL?  If G_INDEX and L_INDEX are indices that
are based on the actual lower bounds of A and D, respectively, then
G_INDEX=(/5/) would refer to the first element of A, and the corresponding
element of D would be D(4), and so L_INDEX should be (/4/).  If G_INDEX and
L_INDEX are based on one, then G_INDEX=(/5/) would refer to the fifth element
of A (i.e, A(9)), and the corresponding element of D would be D(8), which is
the fifth element of D, and so L_INDEX should be (/5/).

     This shows up in ABSTRACT_TO_PHYSICAL and PHYSICAL_TO_ABSTRACT (the INDEX
argument), LOCAL_TO_GLOBAL and GLOBAL_TO_LOCAL (the G_INDEX and L_INDEX
arguments) and (I think) the results of LOCAL_LINDEX and LOCAL_UINDEX.

     One thing to consider is that the results of GRADE_UP and GRADE_DOWN seem
to be values that can be used to index the array that's been sorted, but the
results of MAXLOC and MINLOC are one-based, so there's a precedent for each.
(Were the results of GRADE_UP and GRADE_DOWN intentionally designed to be
different from those of MAXLOC and MINLOC?)

     It should also be noted that the INDEX argument of ABSTRACT_TO_PHYSICAL
and PHYSICAL_TO_ABSTRACT gives indices in a processor arrangement, but there's
no direct way of finding out the bounds of such a processor arrangement -
HPF_DISTRIBUTION and GLOBAL_DISTRIBUTION will only give you the *shape*.

Thanks,

Henry
------------------

I am sorry to say we did not discuss CCI 38.  It's the one on
index bounds in hpf local library functions like global_to_local, etc?

My inclination, given that we have already got a difference of opinion
between grade_up/down and max/minloc is to choose the interpretation that seems
right, and make that the default (by a vote of 2 to 1).  Are these the only
intrinsic/library functions that return indices?

I think it makes sense to return usable indices, so that in your example,
I would vote for returning "4".

Remind me to discuss this at the next meeting, in January, please

Rob

--------------------
	Hi Rob,

	> I am sorry to say we did not discuss CCI 38.  It's the one on
	> index bounds in hpf local library functions like global_to_local, etc?
	>
	> My inclination, given that we have already got a difference of opinion
	> between grade_up/down and max/minloc is to choose the interpretation that seems
	> right, and make that the default (by a vote of 2 to 1).  Are these the only
	> intrinsic/library functions that return indices?
	>
	> I think it makes sense to return usable indices, so that in your example,
	> I would vote for returning "4".

	    One thing to consider, though, is that GRADE_UP and GRADE_DOWN, as defined
	now, can't be written in standard Fortran 90.  The reason is that assumed-shape
	arrays only assume shape - they can't assume lower bounds.  So if the interface
	for the specific function that performs GRADE_UP for default integers of rank
	one looked like this:

	      FUNCTION I_GRADE_UP(ARRAY, DIM)
	        INTEGER, INTENT(IN) :: ARRAY(:)
	        INTEGER, INTENT(IN), OPTIONAL :: DIM
	        INTEGER :: I_GRADE_UP(SIZE(SHAPE(ARRAY)), PRODUCT(SHAPE(ARRAY)))
	        !
	        !  Code omitted.
	        !
	      END FUNCTION I_GRADE_UP

	the value of LBOUND(ARRAY, 1) would be 1, regardless of what the lower bound of
	the actual argument was.

	     Though it's more usable, it can also present implementation difficulties.
	That's why I wonder whether the current definitions of GRADE_UP and GRADE_DOWN
	are in error.

Hmmm.   They've been implemented, so its possible -- to bad you have to use
extralinguistic means.

Let's discuss this next time -- its on my agenda;

Rob

===================================

MEDIUM PRIORITY ITEMS 

===================================
CCI #AA  --  NEW clause for the loop index of an independent loop?

     One of my coworkers has the following question:

     May the index variable of a DO-loop appear in the NEW clause of an
INDEPENDENT directive on that DO?  For example,

         !HPF$ INDEPENDENT, NEW(I)
               DO I = 1, 10
               END DO

If so, what does it mean?  Does it simply mean that the programmer is
guaranteeing that the value of I is not used after the DO-loop?

Thanks,

Henry
To: hpff-interpret@cs.rice.edu
Subject: hpff-interpret: remapping of arguments inside independent
Sender: owner-hpff-interpret@cs.rice.edu
Status: RO


===================================
CCI #BB  --  Is this a PURE procedure?

Hi,

Joel and I have a question about functions that produce linked structures
as a result, like a list concatenation function.   Can they be made PURE?

In such a function you really have to copy the data out of the input
lists when computing concat(A,B).


	type node_t
	  real value
	  type (node_t), pointer :: next
	end type

        function concat(a,b)
        type (node_t), pointer :: a, b. concat, head, tail
	integer len

	nullify(concat) 
	if (list_length(a) + list_length(b) == 0) return
	allocate(concat)
        tail => concat
        len = 0

	head => a
	do while (associated(head))
	    if (len > 0) then
	      allocate(tail%next)
	      tail => tail%next
            endif
	    tail%value = head%value
            nullify(tail%next)
	    head => head%next
	    len = len + 1
	end do
	
	head => b
        do while (associated(head))
	    if (len > 0) then
              allocate(tail%next)
              tail => tail%next
            endif
            tail%value = head%value
            nullify(tail%next)
            head => head%next
	    len = len + 1
        end do
	return
	end function concat

Now, ignoring the issue of whether I wrote it correctly, is concat
PURE?  It may be -- except maybe due to  a restriction on the use of
allocate in pure procedures.   Chuck, Henry, do you remember what
action we took on that question, in response to a CCI item from Henry
about allocate causing a status variable to be set by allocate?   I
hope we disallowed only the status, not the allocate!   If we did, I
view the inability to create pure functions that build a linked
structure and return a pointer to it as a serious problem.

Rob

To: zongaro@VNET.IBM.COM
Subject: Re: pure list-oriented fns
Cc: schreibr@riacs.edu
Status: RO

	From zongaro@VNET.IBM.COM Thu Nov  9 16:31:57 1995
	Subject: Re: pure list-oriented fns
	To: schreibr@riacs.edu (Rob Schreiber)
	Cc: saltz@cs.umd.edu
	X-Mailer: ELM [version 2.4 PL24alpha3]
	Content-Type: text
	Content-Length: 822

	> Here is a concat function -- pure.
	>
	>         pure function concat(a,b)
	>         type(node), pointer :: a, b, concat
	>
	>         lenresult = listlen(a) + listlen(b)
	>         if (lenresult == 0) then
	>             nullify(concat)
	>             return
	>         else
	>           allocate(concat)
	>             if (listlen(a) == 0) then
	>                 concat%val = b%val
	>                 concat%next = concat(0, b%next)
	>             else
	>                 concat%val = a%val
	>                 concat%next => concat(a%next, b)
	>           endif
	>       endif
	>       end

	     Except that a subobject of a dummy argument of a pure function can't be
	associated with a dummy argument with the POINTER attribute.  So the calls to
	listlen and concat are invalid.  (Did you mention last week that that was still
	a problem?)


Curses; the designers of PURE function fiendishly contrived to thwart me!

I mentioned last week that I thought recursion was a solution; you have
now set me straight.  Lets put this on the agenda, too.  The F90 boys
should be interested in this one.

Rob

=================================


LOW PRIORITY ITEMS


=================================

CCI # 18
Defined assignment in FORALL
Group C
Submitted: 05/04/95
Updated: 10/23/95
Submitted by: Henry Zongaro
Status: in progress

Maybe Resolved.  CHK will provide clarification in HPF 2 document.    It will be a change to the "sequentialization" of the forall to account for defined  assignments.   BUT Henry suggests alternative based on new X3J3 action.

Original question:
     I was wondering whether there's not a problem with allowing defined assignment to appear within a FORALL.  Consider the following example.

     module mod
       integer :: a(3) = (/1,2,3/)
     contains
       pure subroutine def_assign(lhs, rhs)
         integer, intent(inout) :: lhs
         character, intent(in) :: rhs

         lhs = a(ichar(rhs)+1)
       end subroutine def_assign
     end module mod

     program p
       use mod
       interface assignment(=)
         module procedure def_assign
       end interface

     forall (i = 1:2) a(i) = char(i)  ! Sneaky way of passing 
                                      ! "i" to def_assign
    end program p

     The rules of forall specify that the right-hand side and the
indices of the left-hand side are evaluated, in any order, prior to
assignment, which also takes place in any order.  In the above example,
we have

          a(1) = char(1)
          a(2) = char(2)

as the two defined assignments which take place.  Inside of def_assign,
there's a host-associated reference to a, so what ends up happening is
the following:

          a(1) = a(2)
          a(2) = a(3)

     The order in which these assignments occurs affects the result.
The value of a after the forall statement is executed could be (/2,3,3/)
or (/3,3,3/).

     Basically, the problem is that in defined assignment, completely
evaluating the right-hand side for all active combinations does not
necessarily let the compiler precompute everything which might also
appear on the left-hand side.  Thanks, Henry

DISCUSSION AT MEETING:

y, Henry, Jerry to circulate proposed wording for this definition.


No action from the July meeting was recorded ... was this resolved or not?
-----
Rob's recollection:
Resolved.  CHK will provide clarification in HPF 2 document.   It will be a change to the "sequentialization" of the forall to account for defined assignments.

-----
new note from Henry

    X3J3 is taking a different approach on CCI 18.  They're trying to
prohibit references in the procedure that defines the assignment to the
variable that appears on the left-hand side of the defined assignment.
I believe WG5 will decide on this in their November meeting.

     We might want to pick up whatever they decide.  One advantage of
the restriction is that no extra compiler mechanisms are required for
this somewhat obscure case.

=====================================
CCI #29
Calling hpf_local from independent loop
Group C
Updated: 10/23/95
Rob Schreiber
Submitted: 8/3/95
Status: in progress
QUESTION:
I have two CCI questions.
Question I.  Can an extrinsic(hpf_local) be invoked in an independent loop?
In a Forall?
     Ex:

Forall (1 = 1:10) a(i) = f(i,a(i))

Note that part of the calling sequence, as specified in Ver 1.1, appendix A, is 
"The processors are synchronized.  In other words, all actions that logically precede the call are completed."

It seems clear that when this was written it was tacitly assumed that the call did not occur in an independent loop or forall.

Part 2:  May ony other kind of extrinsic be called in a forall or independent loop?
================
Discussion begins here:     Summary provided by Rob S. 

Status:   Under discussion.   Summary of the discussion in September:

1.  Only and HPF type routine can be PURE.  Thus, only and HPF_LOCAL
routine could be local, pure, and hence called in a forall.

2.  There is no semantic problem with this, or with invocation of any
extrinsic routine in an independent loop.

3.  An example
        real x(N, 100000)
        !hpf$ distribute x(*, block)

        !hpf$ independent
        do i = 1, N
           call extrinsic_fn(x(i,:))
        enddo
The independent loop is a mechanism for spawning N threads per processor, each independent of the others;  the load per thread may be variable.  It would
possibly be useful here to NOT synchronize after each call to the extrinsic
routine!   Is there any semantic reason to force the synchronization?

4.  There are some very important issues in the implementation, with possible
language impacts.

Let us assume an MPI based implementation of HPF calling an extrinsic
local routine that uses MPI for communication.  Because of the unity of
purpose and of hotel between HPF and MPI, it is arguably necessary for
HPFF to make this work cleanly and efficiently.

Issue MPI-1:   Does HPF handle MPI_Init and MPI_Finalize, automatically.

Issue MPI-2:   Are any of the MPI routines PURE?

Issue MPI-3:   Thread safety.   In a naive implementation, HPF does a barrier
before and after the call to the extrinsic;  but there is no guarantee that
there are no outstanding, nonreceived messages in the messaging system.  Thus, to be safe, any extrinsic routine should use its own communicator.   To prevent
interference between separate calls to the routine, a new communicator should 
be created for every call.   An obvious way to do this is to call
      MPI_Comm_Dup(MPI_COMM_WORLD, New_comm)
at the beginning of every such extrinsic.    However, an extrinsic
that consumes all its messages would be justified in doing this once, on its
first invocation, and saving the communicator for reuse on later invocations.
   But consider what happens if the extrinsic is called in an independent DO loop
as in the example above, and there is no barrier used.  Now we really need a
separate communicator per thread. On the other hand, a call to MPI_Comm_Dup is a collective call, which synchronizes the processes.
     Perhaps this should be done by the calling HPF routine, so that the
MPI_COMM_WORLD communicator is different on every call.

Issue MPI-4:  If called from the range of an ON_HOME directive, what
set of processors does MPI_COMM_WORLD correspond to?   If it corresponds to the subset executing the ON block, then how can the called routine access 
nonresident data?   Should there be a way to access a communicator that 
corresponds to these executing processors, while MPI_COMM_WORLD 
always corresponds to all of the processors?

Issue MPI-5:  If called from separate ON_HOME blocks in the scope of
a TASK directive, with disjoint processors groups, so that the two
ON blocks may be executed concurrently, what communicators correspond to the two processor groups?   (If, in issue 4 above, the answer is that MPI_COMM_WORLD corresponds to the executing subset of the processors, then the answer here is MPI_COMM_WORLD.)

=========== comments from Jim Cownie =================

> Issue MPI-1:  Does HPF handle MPI_Init and MPI_Finalize,
automatically.  I would say that the HPF run-time should have called
MPI_Init before any user code has run, therefore user extrinsic
functions which need MPI can just use it.

This is actually not a big deal, since the user routine can always do
use MPI_Initialized() to guard her call to MPI_Init.  (Though if this is
done, then the HPF run-time needs to do the same, since MPI_Init should
only be called once.)  That's why it's simpler to say that MPI_Init has
already been called before any user HPF code has run.

> Issue MPI-2:   Are any of the MPI routines PURE?
Probably. For instance one could cast reductions as functions which
return the result, and only read the arguments (though why you'd want
to use an MPI reduction extrinsically rather than an HPF one is beyond
me).
 ... after all 5 MPI issues ...


     I would suggest that in all of these cases the HPF run-time should
provide a "current communicator" which includes the set of processes
running the current construct.  In some cases this will be
MPI_COMM_WORLD (or a Comm_Dup of COMM_WORLD), in others (ON_HOME, task
parallelism, processor subsets) it will represent a subset of the
available processes.  In MPI MPI_COMM_WORLD is always available as the
set of all processes (until MPI-2 introduces a dynamic process model,
though that shouldn't worry HPF implementations).

     Therefore I think
1) MPI_COMM_WORLD is *always* the set of all processes. (This is the
   current MPI view).
2) If you need subsets then you should create new communicators, and
   provide a way for the user code to access them.

MPI_COMM_WORLD should mean the same thing in a routine called from HPF
extrinsic as it did in a "raw" MPI program.  The HPF extrinsic MPI
environment should contain additions to the raw MPI environment (new
communicators, maybe pre-defined datatypes giving array distributions,
etc), but should not change the meaning of things in the raw MPI world.
In other words you may need to learn more to work in the HPF extrinsic
environment, but you shouldn't have to unlearn things you already knew
about MPI.

-- Jim


---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------

From owner-hpff-task  Mon Jan  8 16:14:37 1996
Received: (from daemon@localhost) by cs.rice.edu (8.7.1/8.7.1) id QAA06650 for hpff-task-out; Mon, 8 Jan 1996 16:14:37 -0600 (CST)
Date: Mon, 8 Jan 1996 16:14:37 -0600 (CST)
Message-Id: <199601082214.QAA06650@cs.rice.edu>
From: chk@cs.rice.edu (Chuck Koelbel)
Subject: hpff-task: ON clause proposal
Sender: owner-hpff-task
Precedence: bulk

---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------




Apologies to HPFF attendees, who probably will not see this before they
arrive at the meeting.

Updates:
"ALL" option is now "EVERY"
"LOCAL" clause is now "RESIDENT"
Omitting the local-var-list in RESIDENT means "all references are local"
Added a "declarative" form of the ON clause.
Options for ON applied to a CALL statement -
        1. Propagate ON through the call, but the user
           must declare the matching ON in the callee.
        2. All CALLs execute on every processor, i.e.
           ON is ignored for calls
I highly favor the first option.

Hoping that, just this once, Eudora doesn't convert the attachment into
BinHex format...

                                                Chuck




The ON Clause

Charles Koelbel and Rob Schreiber

The purpose of the ON directive is to allow the programmer to control the
distribution of computations among the processors of a parallel machine.
In a sense, this is the dual of the DISTRIBUTE and ALIGN directives for
data.  Modern parallel machines will achieve their best performance if all
operations are performed by separate processors (as specified by ON) with
each processor accessing its local data (as specified by DISTRIBUTE).
If either explicit computation partitioning or explicit data mapping is not
present, the system must choose one.

A secondary purpose of the ON directive is to identify data accessed by
the computation as local to the executing processor.  The compiler can use
this information to avoid generating communication or to simplify array address
calculations.  Note that whether a given data element is local depends on
two facts:
        Where the data is stored (i.e. its DISTRIBUTE and ALIGN attributes)
        Where the computation is executed (i.e. its ON directive)
For these reasons, the LOCAL clause is added to the ON directive, which is
effectively the earliest point in the program where the needed facts might
be available.  Note that changing the ON directive may invalidate some
LOCAL clauses, or may make some new LOCAL clauses true.


Syntax

There are two flavors of the ON directive: a single-statement form, and a
multi-statement form.  The BNF for both is

        simple-on-directive     IS      ON HOME ( home-expr ) [ , resident-clause ]
        decl-on-directive IS ENVIRONMENT :: simple-on-directive
        
        resident-clause IS      RESIDENT [ ( local-var-list ] ) ]
        
        block-on-directive      IS      simple-on-directive BEGIN
        end-directive   IS      END [ BEGIN ]
        
        on-block        IS      
                block-on-directive
                block
                end-directive

        home-expr       IS      variable
                OR      template-elmt
                OR      processors-elmt
        
        template-elmt   IS      template-name [ ( section-subscript-list ) ]
        
        processors-elmt IS      processors-name [ ( section-subscript-list ) ]
        
        local-var       IS      variable
                OR      EVERY = variable-name

Notes:
Constraints need to be added above...
Also, changes are needed in some higher-level syntax (for example, adding
simple-on-directive as an option to executable-directive).

home-expr, template-elmt, and processors-elmt are just auxiliary syntax
categories.  Note that "variable" is an F90 syntax term that means
(roughly) "a reference, including an array element, array section, or
derived type field"; "variable" doesn't include template or processor
elements since those aren't first-class language constructs.  Note also
that "block" is an F90 syntax term for "a series of statements treated as
a group" - for example, the body of a DO construct.

simple-on-directive should be an option under executable-directive (HPF
syntax term), as redistribute-directive is now.  This means that
simple-on-directive can be used as an executable statement.

decl-on-directive should be an option under specification-directive, as
distribute-directive is now.  This means that decl-on-directive is used as
a declaration.  Any expressions in home-expr or resident-clause of
decl-on-directive must be specification expressions; this is not a
constraint on simple-on-directive.

on-block most naturally fits as a F90 executable-stmt, since that makes
constraints about "properly nesting" easier to state.

Any top-level variables in the RESIDENT clause must be explicitly
mapped.  Otherwise, the assertion (see below) is a statement about how the
compiler works, not about the program.


Rationale:
The keyword HOME is required, even when its argument is a processor.  It
seems more natural to eliminate the HOME keyword there, but this leads to
the following ambiguity:

        INTEGER X(4)    ! X(I) will be on processor I
        !HPF$ PROCESSORS HOME(4)
        !HPF$ DISTRIBUTE X(BLOCK)
        X = (/ 4,3,2,1 /)
        !HPF$ ON HOME(X(2))
        X(2) = X(1)

If HOME were not required, where should the computation be done:
1. Processor HOME(2) (i.e. the owner of X(2))?
2. Processor HOME(3) (i.e. the value of X(2), before the assignment)?
3. Processor HOME(4) (i.e. the value of X(2), after the assignment)?
The definition of ON clearly indicates interpretation 1 is correct.  One
can get the effect of interpretation 2 by the directive
        !HPF$ ON HOME(HOME(X(2)))
There is no way to get the effect of interpretation 3, short of building a
clairvoyant computer.  Introducing reserved keywords into Fortran was
suggested as a better solution to this problem, but was seen as too large a
change to the underlying language.

Separate syntax is needed for simple-on-directive and decl-on-directive
to avoid ambiguity.  If the same directive can be used as both both an
executable statement and a declaration, its role is unclear if it is the
last statement in the declarations/first statement in the executable
section.  This is particularly bad since the "scope" of the directive is
rather different in the two uses.  That said, I'd be grateful if somebody
could suggest a better declaration syntax.
End of rationale.


Semantics

The ON directive advises the compiler to perform a computation using the
processor(s) named in the argument to HOME.  It has two forms:
        * simple-on-directive, which applies to the next executable statement or
          a block of executable statements.
        * decl-on-directive, which is a declaration for an entire scope.
Like the mapping directives ALIGN and DISTRIBUTE, this is advice rather
than an absolute commandment; the compiler may override an ON directive if
necessary.  Also like ALIGN and DISTRIBUTE, the ON directive may affect the
efficiency of computation, but not the final results.

Advice to implementors:
If the compiler may override the user's advice in an ON directive, then
the compiler should also offer the user an option to force all directives
to be obeyed.
End advice to implementors.

The single-statement ON directive (i.e.  simple-on-directive) precedes the
statement for which it is advising behavior, and is said to apply to that
statement.  If the statement is a compound statement (e.g.  a DO loop or an
IF-THEN-ELSE construct), then the ON directive also applies to all nested
statements therein.  Similarly, the block ON directive (i.e.  on-block)
applies the initial ON clause to all statements up to the matching END
directive.  The declarative form of the ON directive (i.e.
decl-on-directive) applies to all statements and declarations in a scope;
its intended use is for functions and subroutines that may be called on a
subset of the processors.

The HOME clause can name a program variable, a template, or a processors
arrangement.  For each of these possibilities, it can further specify a
single element or multiple elements.  This is translated into the
processor executing the computation as follows:
*       If the HOME clause names a program variable (that is, an array element
        or section), then every processor owning any part of that variable
        should execute the computation.  For example, if a is a distributed array
                !HPF$ ON HOME ( a(2:4) )
        tells the compiler to perform the statement on the processors owning
        a(2), a(3), and a(4).  If a were distributed BLOCK, this might be one
        processor; if a were distributed CYCLIC, it would be three processors.
*       If the HOME clause names a template element or section, then every
        processor owning any element of the template should execute the computation.
        The example above applies here as well, if a is a template rather than an
        array.
*       If the HOME clause names a processors arrangement, then the processor(s)
        referenced there should execute the computation.  For example, if p
        is a processors arrangement,
                !HPF$ ON HOME ( p(2:4) )
        will execute the statement on the three processors p(2), p(3), and p(4).
For the executable forms of the directive, the value(s) of the HOME clause
are determined as if the HOME argument was evaluated when control flow
reached the ON directive.  For the declarative form, the values are
determined on entry to the scope.  That is, if the value of a variable used
in the HOME clause changes within the on-block, work is not migrated to a
new processor.  If evaluation of the argument of the HOME clause would
change the value of any variable in the program, then the value of that
variable becomes undefined after the ON clause is reached.  Finally, note
that the HOME clause only specifies how computation is partitioned among
processors; it does not indicate processors that may be involved in data
transfer.

Rationale:
This is the heart of the ON clause.

The "as if" wording and making side effects undefined avoids the following
problem: ON is a directive.  Therefore, it cannot have side effects.  But
implementing the ON clause may require evaluating some functions, therefore
causing the side effects.
End of rationale.

Advice to implementors:
If the HPF program is compiled into Single-Program-Multiple-Data (SPMD)
code, then the ON clause can always be implemented (albeit inefficiently)
by having all processors compare their processor id to an id (or list of
ids) generated from the HOME clause.  (Similar naive implementations can be
constructed in other paradigms as well.) If the ON clause will be executed
repeatedly, for example in a DO loop, it is probably worthwhile to invert
this process.  That is, instead of all processors executing all the HOME
clause tests, the compiler should determine the range of loop iterations
that will give a true test.  (See the "Advice to implementors" in the
Examples section below for more details.) For example, consider the
following complex case:
        DO I = 1, N
                !HPF$ ON HOME( A(MY_FCN(I)) ) BEGIN
                ...
                !HPF$ END ON
        END DO
Here, the generated code can perform an "inspector" (i.e.  a skeleton loop
that only evaluates the HOME clause of each iteration) to produce a list of
iterations assigned to each processor.  This list can be produced in
parallel, since MY_FCN must be side-effect free; however, distributing it
to all processors may require unstructured communications patterns,
possibly negating the advantage of parallelism.  In general, more advanced
compilers will be able to efficiently invert more complex HOME clauses.  It
is recommended that the abilities (and limitations) of a particular
compiler be documented clearly for users.

Note that processors "screened out" by the naive implementation may still
be required to participate in data transfer.  (The RESIDENT clause can
eliminate this, but it is optional.)  If the underlying architecture
allows one-sided communication, this is not a problem.  On traditional
message-passing machines, a request-reply protocol may be used.  This
requires the inactive processors to enter a wait loop until the ON block
completes, or requires the runtime system to handle requests
asynchronously.  Again, it is recommended that the documentation tell
programmers which cases are likely to be efficient and which inefficient
on a particular system.
End advice to implementors.

Advice to programmers:
The argument to HOME in an ON clause can be arbitrarily complex.  This is a
two-edged sword; it can express very complicated computation partitioning,
but the implementation of these partitions may not be efficient.  More
concretely, it may express a perfectly load-balanced computation, but force
the compiler to serialize the computation of the HOME clauses.  Although
the amount of overhead for an ON clause will vary based on the HPF code,
the compiler, and the hardware, one can expect that compilers will generate
very good code based solely on array mappings or a named processor
arrangement, and progressively worse code as the complexity of the HOME
clause increases.  A rough measure of the complexity of an ON directive is
the amount of run-time data used to compute it; for example, a constant
offset is fairly simple, while a permutation array is very complex.  See
the Examples section below for more concrete examples of this phenomenon.

It is worth noting that the ON clause alone does not address data
movement.  (The RESIDENT clause will do this.)  Therefore, on some
machines additional processors will have to enter the ON block to take
part in communication.
End of advice to programmers.

It is legal to nest ON directives, provided that the processors named by
the inner ON directive are also named by the outer directive.  The syntax
of on-block automatically ensures that it is properly nested inside other
compound statements, and that compound statements properly nest inside of
it.  As with other Fortran 90 compound statements, transfer of control to
the interior of an on-block from outside the block is prohibited, while
transfers within a block may occur.  However, HPF also prohibits transfers
of control from the interior of a on-block to outside the on-block.  Note
that this is stricter than Fortran 90.  If ON clauses are nested, then the
innermost HOME clause effectively controls execution of the statement(s).
A programmer can think of this as successively restricting the set of
processors at each level of nesting; clearly, the last restriction must be
the strongest.  Alternately, the programmer can think of this as a
fork-join approach to nested parallelism.

Rationale:
The restrictions about control flow into and out of an ON block essentially
make it a single-entry single-exit region, thus simplifying the semantics
considerably.
End Rationale.



OPTIONS

Note to language designers:
Here follow two possible semantics for ON applied to procedure calls.  The
first is close in spirit to the original proposal - roughly speaking, ON
applies to statements inside the called procedure.  (I've attempted to
avoid some compilation problems by requiring ON clauses in the callee.
More stringent restrictions, like only allowing PURE functions, could avoid
one-way communication, maybe.)  The second is my take on the
"everybody participates in CALLs" proposal at the September meeting.  (My
memory is a bit vague, so I apologize if it is not what was meant.)  I
believe these are mutually exclusive.  I guess that a third proposal would
be to disallow CALL inside an ON directive, but this makes ON
much less useful.
End of note to language designers.



OPTION 1: ON IS PROPOGATED THROUGH CALL STATEMENTS.

It is legal for an ON directive to apply to a CALL statement or a statement
containing a function invocation.  In this case, the treatment is similar
to that for a compound statement.  The ON directive tells the compiler
which processors should execute the statements in the procedure body.  Any
processor named by an ON directive inside the subroutine must be included
by the ON directive in the caller.  If ON clauses appear in a subroutine
call chain, then the deepest ON clause (i.e.  the one in the last called
routine that is still active) effectively controls the computation.

In addition, any procedure whose caller is controlled by an ON directive
must include a valid declarative ON directive which applies to the entire
called procedure.  If the procedure has an explicit interface, this
ON declaration which must be included in this interface.  (I.e.  if the
interface is though an INTERFACE block, then the INTERFACE specification
must include the ENVIRONMENT :: ON clause.) If the procedure uses an
implicit interface, then the compiler may assume that the directive is in
fact present.

Note that EXTRINSIC procedures always have an explicit interface, and thus
will require an ON declaration if they are called from an ON block.  In
this case, the effect of the call is still as if the procedure were called
on all processors.

If the procedure uses alternate return, then the target of the return must
be in the same ON block as the CALL statement.  This implies that labels
passed as arguments must refer to statements in the same ON block as the
CALL statement.

Rationale:
It is natural to treat a procedure call within an ON clause as if it were
an in-lined program block.  This leads to "propagating" the ON directive
into the procedure.  Making the ON directive explicit in the callee allows
separate compilation (i.e.  the compiler need not perform interprocedural
analysis in order to select the calling convention).

The constraint on alternate return is similar to the prohibition against
jumping out of an ON block, and has the same justification.
End of Rationale.

Note to language designers:
I just had an epiphany:
        * The declarative ON constrains the declarations to only map to the ON
          processor set.
        * Global data must be declared.
        * Therefore, only global data stored on the ON processor set can be
          accessed.
        * Therefore, we don't need one-way communication for global data inside
          CALLS from ON clauses.
        * Similar reasoning applies to dummy arguments.
Somebody check this.  If this really finesses the one-way communications
problem, then we need to point it out!  (It's also a pretty significant
restriction on HPF expressibility - can't get to global data sometimes.)
        
Is this what we want for EXTRINSIC procedures?  On one hand, ON and
EXTRINSIC ought to be orthogonal.  On the other, it seems more natural that
an EXTRINSIC called within an ON would synchronize only the processors
from the ON, not all the processors.  But that would mean that ON would
change the semantics of EXTRINSIC (maybe...)

We could disallow alternate returns from procedures called from ON
clauses.  I'd sort of prefer that, but in the spirit of backward
compatibility I've just specified the minimum (I think) constraints.
End of note to language designers.

Advice to implementors:
Notice that nested ON clauses do not present additional problems for the
naive implementation.  If a processor failed the outer HOME comparison,
then it would fail any tests inside the ON as well.  Thus, it need not even
make the tests.

The difficulties of implementing one-way communication on message-passing
machines remain, however.
End of advice to implementors.

Advice to users:
A CALL statement from within an ON block effectively executes the
subroutine on a subset of processors.  In conjunction with other
declarations, this allows a fork-join style of parallelism.

Note that the earlier advice regarding data movement still applies.
Message-passing machines may need to perform the call statement on many
(even all) processors in order to properly exchange data between
processors.
End of advice to users.




OPTION 2: ALL SUBROUTINES PARTICIPATE IN ALL CALL STATEMENTS


It is legal for an ON directive to apply to a CALL statement or a statement
containing a function invocation.  In this case, the semantics are
somewhat different from an equivalent ON block.  The procedure invocation
is performed on all processors, although ON clauses in the called procedure
may again restrict execution to a subset of processors.

If the procedure uses alternate return, then the target of the return must
be in the same ON block as the CALL statement.  This implies that labels
passed as arguments must refer to statements in the same ON block as the
CALL statement.

Rationale:
Procedures may declare local variables and (re)declare globals, both of
which may be mapped onto all processors.  Therefore, the CALL must be
executed on all processors to properly handle memory allocation.  Note
that this is the same semantics as EXTRINSIC subroutines, which are also
conceptually invoked on all processors.

The constraint on alternate return is similar to the prohibition against
jumping out of an ON block, and has the same justification.
End of Rationale.

Note to language designers:
We could disallow alternate returns from procedures called from ON
clauses.  I'd sort of prefer that, but in the spirit of backward
compatibility I've just specified the minimum (I think) constraints.
End of note to language designers.

Advice to implementors:
This simplifies separate compilation of subroutines.  The compiler can
assume that the procedure is always invoked on all processors, and
therefore be assured that global operations (e.g. barriers) will not deadlock
due to inactive processors.  The CALL statement can be implemented by a
new request type in the request-reply loop.
End of advice to implementors.

END OF OPTIONS



Operations controlled by an ON clause must follow certain restrictions:

* If an ON directive applies to a DISTRIBUTE directive, then the set of
processors named in the HOME clause must include
        - All processors named in the DISTRIBUTE's ONTO clause if one is present
        - All processors in the default processors arrangement, if there is no
          ONTO clause
* If an ON directive applies to a REDISTRIBUTE directive, then the set of
processors named in the HOME clause must include
        - All processors that stored any element of the distributee before
          the REDISTRIBUTE was encountered
        - The processors required by an equivalent DISTRIBUTE directive
* If an ON directive applies to a REALIGN directive, then the set of
processors named in the HOME clause must include
        - All processors that store any element of the align-with-clause
* If an ON directive applies to a REALIGN directive, then the set of
processors named in the HOME clause must include
        - All processors that stored any element of the alignee before
          the REALIGN was encountered
        - The processors required by an equivalent ALIGN clause
* If an ON directive applies to an ALLOCATE statement which creates an
explicitly mapped variable, then the set of processors named in the HOME clause
must include the processors required by the associated DISTRIBUTE or ALIGN
directive.
* If an ON directive applies to any other declaration of an explicitly
mapped variable, then the set of processors named in the HOME clause
must include the processors required by the associated DISTRIBUTE or ALIGN
directive.

If operations within an ON block do not follow these constraints, then the
program is not HPF-conforming.

Note to language designers:
An alternative for the DISTRIBUTE would be that an missing ONTO clause
would mean "distribute on all processors in the nearest enclosing ON
directive" rather than "distribute on all processors".  In some sense, this
makes the nested fork-joins more modular (declarations inside a forked
region only need to know about the region they are part of).  It is
especially appropriate now that we have the declarative ON directive.  I
have not investigated the difficulty of writing things this way, but it
probably requires fine-tooth combing through the data mapping chapter and
inquiry intrinsics.
End of note to language designers.

Rationale:
Operations which allocate memory require the cooperation of all processors
that will own that memory.  Therefore, ON clauses must schedule those
operations to execute on the cooperating processors.
End of rationale.

Advice to implementors:
These restrictions ensure that HPF data distribution directives inside
an ON block (either lexically or in a dynamic call chain) can be
implemented without relying on one-way communication outside of the
"current" processing group.
End of advice to implementors.

The RESIDENT clause, if present, is an assertion to the compiler that
certain array references made within the ON are stored in local memory if
the computation is performed by the processor(s) named in the HOME clause.
Specifically, if an element in the local-var list is a variable, array, or
array section, then that element will be stored on at least one of the
processors from the HOME clause.  If the local-var is of the form
EVERY=variable-name, then all references to that variable that the ON
applies to will be stored on at least one of the named processors.  If
there is no local-var-list, then all references to all variables that the
ON clause applies to will be stored on at least one of the named
processors.  As with the HOME clause, the value(s) of the RESIDENT clause
are determined as if the local-var-list was evaluated when control flow
reached the ON directive, and possible side effects in the RESIDENT clause
cause the affected variables to become undefined.

Rationale:
The EVERY=variable-name form is syntactic sugar for listing every reference
to an array in the ON block.  In fact, it is more powerful than such a list
because the values of subexpressions within a reference need not be defined
when EVERY is encountered.  This is similar in spirit to APR's grouping
IGNORE directives by variable.  (Note that EVERY(variable-name) doesn't
work, since then a program with a variable named EVERY would be ambiguous.)

RESIDENT without any local-var-list is syntactic sugar for listing EVERY
for all variables referenced within an ON block.  It might well have been
named the ALL_RESIDENT clause; the present form, however, does not add yet
another keyword to the directive sublanguage.

The pseudo-evaluation of the RESIDENT clause is the same as the HOME clause,
for the same reason.  Treating the RESIDENT clause as naming sets of
variables, rather than as a list of lexical references, gives a reasonably
firm  logical basis to what the assertion means.  It is unclear whether
compilers would rather see this "symbolic" information or longer lists that
need to be pattern-matched.  It is also unclear which approach would
appeal more to programmers.
End Rationale.


OPTIONS

Note to language designers:
Here are two options for RESIDENT when applied to CALL.  They mirror the
above propagate/don't propagate options.
End of note to language designers.

OPTION 1: RESIDENT IS PROPAGATED TO CALLEES

If a RESIDENT clause applies to a CALL statement, then it is an assertion
about the internals of the called procedure as well as an assertion about
code in the caller.  For example,
        * RESIDENT( A(I) ) means that for the current value of I, if the called
          procedure accesses element A(I) then that element is stored on a
          processor named by this ON directive.
        * RESIDENT( EVERY=B ) means that all elements of B that are accessed by
          the callee are stored on a processor named by this ON directive.
        * RESIDENT with no local variable list means that all references made by
          the callee are stored on a processor named by this ON directive.

Rationale:
This provides maximum information to the compiler.  It is also in line with
the propagation of ON directives to the callee.
End of rationale.

Advice to implementors:
RESIDENT without a variable list guarantees that no one-sided
communication outside of the ON processor set will be generated by the
callee.  Such a procedure can be called only on the "active" processors,
unless the runtime system has additional constraints (for example, if the
runtime system requires all processors to participate in collective
communications).

The other forms of RESIDENT provide information that could be propagated
interprocedurally.  If the information is not propagated, the only result
will be less optimization.
End of advice to implementors.

Advice to programmers:
Although the RESIDENT assertion applies interprocedurally, it is by no
means certain that all compilers will make use of this information.  In
particular, separate compilation limits the propagation that can take
place.  It is therefore good practice to include a RESIDENT clause
both in the caller's ON directive and in the callee's ON declaration.
This ensures that the compiler has the RESIDENT information available when
it is compiling both ends of the procedure call.  This is especially
useful for RESIDENT clauses without a variable list; knowing that all data
accessed is local allows many optimizations that are not otherwise
possible.
End of advice to programmers.

OPTION 2: RESIDENT IS NOT PROPAGATED TO CALLEES

A RESIDENT clause that applies to a CALL statement does not assert
anything about the internals of the called procedure.  In particular,
        * RESIDENT( EVERY=A ) asserts that all references to A in the caller
          (including actual arguments to a subroutine) are stored on the ON
          processor set, but does not guarantee that the subroutine makes no
          nonlocal accesses.
        * RESIDENT without a local variable list ensures that the actual
          arguments to the CALL are all stored locally, but allows the subroutine
          to access global data from other processors.
Other RESIDENT assertions (in particular, asserting that particular
elements or subsections of an array are local) are true in both the caller
and the callee.  (Note that these are essentially statements of fact about
the array section bounds, so they must be universally true.)  However,
compilers are not required to propagate this information to the callee.

Rationale:
Propagated EVERY clauses are susceptible to becoming invalid if a procedure
is modified.  In particular, adding a reference in a subroutine could
invalidate EVERY assertions in all of its callers.
End of rationale.


END OF OPTIONS

Note that if the HOME clause specifies more than one processor, then RESIDENT
only asserts that the variables are stored on one of the processors.  For
example, if a statement is executed on a section of the processors
arrangement, then communication within that section may be needed for some
variables in the RESIDENT clause.  Communication with processors outside of
the section will not be needed for those variables, however.

Rationale:
The alternative to this interpretation would be that any variable named in
the RESIDENT clause would be local to all processors, i.e. replicated.
While that certainly allows more extensive optimizations, it is a less
common case.  In addition, it does not seem to capture the intent of ON
directives applied to CALL statements or compound statements.  For example,
        !HPF$ PROCESSORS PROCS(MP,MP)
        !HPF$ DISTRIBUTE X(BLOCK,BLOCK) ONTO PROCS
        !HPF$ ON HOME(PROCS(1,1:MP)), RESIDENT( X(K,1:N) )
        CALL FOO( X(K,1:N) )
would presumably call FOO on a row of the processors arrangement, passing
elements of X in place.  This is what the current definition does; if
RESIDENT meant "resident on every processor", the call would force X to be
replicated.
End Rationale.

The RESIDENT assertion is similar to the INDEPENDENT directive, in that if it
is correct it does not change the meaning of the program.  If the RESIDENT
clause in incorrect, the program is not standard-conforming (and is thus
undefined).  Like the INDEPENDENT directive, the compiler may use the
information in the RESIDENT clause, or ignore it if it is insufficient for the
compiler's purposes.  If the compiler can detect that the RESIDENT clause is
incorrect (i.e.  that a RESIDENT variable is definitely nonlocal), it is
justified in producing a warning.  Unlike the INDEPENDENT directive,
however, the truth of the RESIDENT clause depends on the mapping of
computations (specified with the ON clause) and the mapping of data
(specified with DISTRIBUTE and ALIGN clauses); if the compiler overrides
either of these, then it may not be able to use information in the RESIDENT
directive.

Rationale:
Knowing that a reference is local is valuable information for the
optimizer.  It is in keeping with the spirit of HPF to phrase this as an
assertion of fact, which the compiler can use as it pleases.  Expressing
it as advice to the compiler seems to have disadvantages.  Some possible ways
this advice could be phrased, and the counter-arguments, are
*       "Don't generate communication for this reference" has great potential
        for changing the meaning of the program.  Some programmers want this
        capability, but it violates the "correct directives should not change the
        meaning of a program" principle of HPF.  Also, once communication is
        "turned off" for a reference, it's not clear how to turn it back on.
*       "Generate communication for this reference" is not a useful directive,
        since the compiler has to do this anyway.
*       "Generate communication for this reference, and place it here" is useful,
        since it can override the default placement by the compiler.  It still
        has potential for changing program meaning.  It also has the potential to
        create programs as complex as message-passing, as programmers try to move
        communication out of loops.
End of rationale.


Examples

The following are valid examples of ON directives.  Most of them are
"reasonable" in the sense that they illustrate idioms that programmers
might want to use, rather than contrived situations.  For simplicity, the
first several examples assume the following array declarations:

        REAL A(N), B(N), C(N), D(N)
        !HPF$ DISTRIBUTE A(BLOCK), B(BLOCK), C(BLOCK), D(BLOCK)

One of the most commonly requested capabilities for HPF as to control how
loop iterations were assigned to processors.  (Historically, the ON clause
first appeared to perform exactly this role in the Kali FORALL construct.)
This can be done by the ON directive, as shown in the following examples:

        !HPF$ INDEPENDENT
        DO I = 2, N-1
          !HPF$ ON HOME(A(I))
          A(I) = (B(I) + B(I-1) + B(I+1))/3
        END DO
        
        !HPF$ INDEPENDENT
        DO J = 2, N-1
          !HPF$ ON HOME(A(J+1)) BEGIN
            A(J) = B(J+1) + C(J+1) + D(J+1)
          !HPF$ END ON
        END DO

The ON directive in the I loop advises the compiler to have each processor
run over its local section of the A array (and therefore B as well).  The
references to B(I-1) and B(I+1) must be fetched from off-processor for the
first and last iterations on each processor (except for the boundary
processors); note that those processors are not mentioned in the HOME
clause.  The ON directive in the J loop advises the compiler to "shift"
computations so that each processor does a vector sum of its local sections
of B, C, and D, stores the first element of the result on the processor to
its left, and stores the rest of the result (shifted by one) in A.  It is
worth noting that the directives would still be valid (and minimize
nonlocal data accesses) if the arrays were distributed CYCLIC, although the
number of nonlocal references would be much higher.

Advice to implementors:
It is highly recommended that compilers concentrate on optimizing DO loops
with a single ON clause including the entire loop body.  Schematically,
the code will be:

        DO i = lb, ub, stride
                !HPF$ ON HOME(array(f(i))) BEGIN
                body
                !HPF$ END ON
        END DO

Where array has some data mapping.  Assume the mapping give processor p
the elements my_set(p).  (In a BLOCK distribution, for example, my_set(p)
is a contiguous range of integers.)  Then the generated code on processor
p should be

        DO i in [lb:ub:stride] intersect f^-1(my_set(p))
                body
        END DO

(This schematic does not show where communication or synchronization must
be placed; that must be derived from analysis of the body.) Moreover, f is
most likely to be the identity function or a linear function with integer
coefficients, both of which can be inverted easily.  Given this, techniques
for iterating through the set can be found in several recent conferences.
End of advice to implementors.

Advice to users:
One can expect the I loop above to generate efficient code for the
computation partitioning.  In effect, the compiler will arrange for each
processor to iterate over its own section of array A.  The J loop is
slightly more complex, since the compiler must find the inverse of the HOME
clause's subscripting function.  That is, the compiler must solve K=J+1 for
J, where K ranges over the local elements of A.  Of course, in this case
J=K-1; in general, linear functions can be inverted by the compiler.  (It
should be pointed out, however, that complex combinations of ALIGN and
DISTRIBUTE may make the description of K unwieldy, and this may add
overhead to the inversion process.)
End of advice to users.
Sometimes it is advantageous to "split" an iteration between processors.
The following case shows one example of this:
        
        !HPF$ INDEPENDENT
        DO I = 2, N-1
          !HPF$ ON HOME(A(I))
          A(I) = (B(I) + B(I-1) + B(I+1))/3
          !HPF$ ON HOME C(I+1)
          C(I+1) = A(I) * D(I+1)
        END DO

Due to the first ON clause, the reference to A(I) is local in the first
statement. The second ON clause makes A(I) nonlocal (for some values of I)
there.  This maximizes the data locality in both statements, but does
require data movement between the two.

Advice to implementors:
If there are several non-nested ON clauses in a loop, then the schematic
above needs to be generalized.  In essence, the iteration range for each
individual ON clause must be generated.  A processor will then iterate over
the union of these ranges. Statements guarded by an ON directive must now
be guarded by an explicit test.  In summary, the code for

        DO i = lb, ub, stride
                !HPF$ ON HOME(array1(f1(i)))
                stmt1
                !HPF$ ON HOME(array2(f2(i)))
                stmt2
        END DO

on processor p becomes

        set1 = [lb:ub:stride] intersect f1^-1(my_set1(p))
        set2 = [lb:ub:stride] intersect f2^-1(my_set2(p))
        DO i in set1 union set2
          IF (i in set1) THEN
            stmt1
          ENDIF
          IF (i in set2) THEN
            stmt2
          ENDIF 
        END DO

where my_set1(p) is the local set for array1, and my_set2(p) is the local
set for array2.  (Again, synchronization and communication must be handled
by other means.)  Code transformations such as loop distribution and loop
peeling can be used to eliminate the tests in many cases.  They will be
particularly profitable if there are data dependences between the ON
blocks.
End of advice to implementors.

Advice to users:
Splitting an iteration like this is likely to require either additional
tests at runtime or additional analysis by the compiler.  Even if the
compiler can generate low-overhead scheduling for the individual ON
clauses, combining them is not necessarily low-overhead.  The locality
benefits must be rather substantial for this to pay off, but there are
cases where multiple ON clauses are valuable.  (All these statements are
particularly true if one ON block uses data computed in another one.)
End of advice to users.


Because ON clauses nest naturally, they can be useful for expressing
parallelism along different dimensions.  Consider the following examples:

        REAL X(M,M)
        !HPF$ DISTRIBUTE X(BLOCK,BLOCK)
        
        !HPF$ INDEPENDENT, NEW(I)
        DO J = 1, M
          !HPF$ ON HOME(X(:,J)) BEGIN
            DO I = 2, M
              !HPF$ ON HOME(X(I,J))
              X(I,J) = (X(I-1,J) + X(I,J)) / 2
            END DO
          !HPF$ END ON
        END DO

Each iteration of the J loop is executed by a column of the processors
arrangement.  The I loop further subdivides the computation, giving each
processor responsibility for computing the elements it owns.  Many
compilers would have chosen this computation partitioning automatically for
such a simple example.  However, the compiler might have attempted to fully
parallelize the outer loop, executing each inner loop sequentially on one
processor.  (This might be attractive on a machine with very fast
communications.) By inserting the ON clauses, the user has advised against
this strategy, thus trading additional locality for restricted parallelism.
Notice that the ON directive neither requires nor implies the INDEPENDENT
assertion.  In both nests, each iteration of the I loop depends on the
preceeding iteration, but the ON directive can still partition the
computation among processors.  The ON directive does not automatically make
a loop parallel.

Advice to implementors:
"Dimension-based" nesting, as above, will probably be a common case.  The
HOME clauses can be inverted at each level, treating indices from outer
loops as run-time invariants.
End of advice to implementors.

Advice to programmers:
Nested ON directives will tend to have efficient implementations if their
HOME clauses refer to different dimensions of the processors arrangements,
as in the above example.  This minimizes the interaction between the levels
of the loops, simplifying the implementation.
End of advice to programmers.

Consider the following variation on the above example:

        !HPF$ DISTRIBUTE Y(BLOCK,*)

        !HPF$ INDEPENDENT, NEW(I)
        DO J = 1, M
          !HPF$ ON HOME(Y(:,J)) BEGIN
            DO I = 2, M
              !HPF$ ON HOME(Y(I,J))
              Y(I,J) = (Y(I-1,J) + Y(I,J)) / 2
            END DO
          !HPF$ END ON
        END DO

Note that the ON clauses have not changed, except for the name of the
array.  The interpretation is similar to the above, except that the outer
ON directive assigns each iteration of the J loop to all of the processors.
The inner ON directive again implements a simple owner-computes rule.  The
programmer has directed the compiler to distribute a serial computation
across all the processors.  There are a few scenarios where this is more
efficient than parallelizing the outer loop:

1. Parallelizing the outer loop will generate many non-local references,
   since only a part of each column is on any processor. If nonlocal references
   are very expensive (or if M is relatively small), this overhead
   may outweigh any gain from parallel execution.
2. The compiler may take advantage of the INDEPENDENT directive to avoid
   inserting any synchronization.  This allows a natural pipelined execution.
   A processor will execute its part of the I loop for one value of J, then
   immediately go on to the next J iteration.  Thus, the first processor
   will start on J=2 while the second receives the data it needs (from
   processor one) for J=1.  (A similar pipeline would develop in the X
   example above.)

Clearly, the suitability of these ON clauses will depend on the underlying
parallel architecture.

Advice to programmers:
This example points out how ON may improve software engineering.  While the
"value" of HOME(X(I)) will change if X's mapping changes, its intent will
usually stay the same - run the loop "aligned with" the array X.  Moreover,
the form of the clauses is portable, and they simplify experimenting with
alternative computation partitioning.  Both qualities are similar to the
advantages of DISTRIBUTE and ALIGN over low-level data layout mechanisms.
End advice to programmers.

ON directives are particularly useful when the compiler cannot accurately
estimate data locality, for example when the computation uses indirection
arrays.  Consider three variations of the same loop:

        REAL X(N), Y(N)
        INTEGER IX1(M), IX2(M)
        !HPF$ DISTRIBUTE X(BLOCK), Y(BLOCK)
        !HPF$ DISTRIBUTE IX(BLOCK), IY(BLOCK)
        
        !HPF$ INDEPENDENT
        DO I = 1, N
          !HPF$ ON HOME( X(I) )
          X(I) = Y(IX(I)) - Y(IY(I))
        END DO
        
        !HPF$ INDEPENDENT
        DO J = 1, N
          !HPF$ ON HOME( IX(J) )
          X(J) = Y(IX(J)) - Y(IY(J))
        END DO

        !HPF$ INDEPENDENT
        DO K = 1, N
          !HPF$ ON HOME( X(IX(K)) )
          X(K) = Y(IX(K)) - Y(IY(K))
        END DO

In the I loop, each processor runs over its section of the X array.  Only
the reference X(I) is guaranteed to be local.  (If M<>N, then IX and IY
have a different block size than X, and thus a different mapping.)
However, if it is _usually_ the case that X(I), Y(IX(I)), and Y(IY(I)) are
located on the same processor, then this mapping may be the best one
available.  If X(I) and Y(IX(I)) are _always_ on the same processor, then
the RESIDENT clause should be added:
        !HPF$ ON HOME( X(I) ), RESIDENT( Y(IX(I)) )
This will avoid communication setup overhead on most systems, and there is
little chance that the compiler would deduce this automatically.  If both
references to Y are _always_ on the same processor as X(I), then further
improvement is possible and desirable:
        !HPF$ ON HOME( X(I) ), RESIDENT( EVERY=Y )
In the J loop, references IX(J) and IY(J) are always local.  This is the
most common array reference class in the loop, so it minimizes the
number of nonlocal data references in the absence of any special
properties of IX and IY.  It may not evenly balance the load among
processors; for example, if N=M/2 then half the processors will be idle.
As before, if the values in IX or IY ensure that one of the Y references is
always local, a RESIDENT assertion should be added.  In the K loop, only
reference Y(IX(K)) is guaranteed to be local (because Y and X have the same
distribution).  However, the values stored in IX and IY may ensure that
Y(IY(K)) and X(K) always local, a fact that should be noted if true:
  !HPF$ ON HOME( X(IX(K)) ), RESIDENT( Y(IY(K)), X(K) )
Even if the three REAL values are not always, but merely "usually" on the
same processor, this may be a good computation partitioning for both
locality and parallelism.  However, these advantages must be weighed
against the cost of computing this partitioning.  Since the HOME clause
depends on a (presumably large) array of runtime values, substantial time
may be required to determine which iterations are assigned to each
processor.  It should be clear from this discussion that there is no magic
solution for handling complex computation partitionings; the best answer
is usually a combination of application knowledge, careful data structure
design (including ordering of the elements), and efficient compilation
methodology and runtime support.

Advice to implementors:
The K loop is the situation that the inspector strategy described above is
designed for.  If there is an outer loop around any of these examples, and
that loop does not modify the distribution of X or the values of IX, then
a record of each processor's iterations can be saved for reuse.  The cost
is at worst linear in the sizes of the arrays.
End of advice to implementors.

Advice to users:
It is unlikely that any production compiler will generate low-overhead code
for K loop above in the near term.  The difference from previous examples
is that the HOME clause is not a function that can be easily inverted by
the compiler.  Some compilers may choose to execute every iteration on all
processors, testing the HOME clause at run-time; others may pre-compute a
list of iterations for every processor.  Of course, the cost of computing
the list will be substantial.

In practice, one would make all the arrays the same size to avoid some of
the alignment problems above; the example was written this way for
pedagogical reasons, not as an example of good data structure design.
End advice to programmers.

Explicit use of processors arrangements in ON directives is usually
associated with task parallelism.  Many examples can be found there (I
assume...) The following example illustrates how processors can be used
for a one-dimensional domain decomposition algorithm:

        !HPF$ PROCESSORS (PROCS(NP))
        !HPF$ DISTRIBUTE X(BLOCK) ONTO PROCS

        ! Compute ILO(IP) = lower bound on PROCS(IP)
        ! Compute IHI(IP) = upper bound on PROCS(IP)
        DONE = .FALSE.
        DO WHILE (.NOT. DONE)
          !HPF$ INDEPENDENT, NEW( ILO, IHI )
          DO IP = 1, NP
            !HPF$ ON HOME(PROCS(IP)), RESIDENT( X(ILO(IP):IHI(IP)) )
            CALL SOLVE_SUBDOMAIN( IP, X(ILO(IP):IHI(IP)) )
          END DO
          !HPF$ ON HOME(X) BEGIN
            CALL SOLVE_BOUNDARIES( X, ILO(1:NP), IHI(1:NP) )
            DONE = CONVERGENCE_TEST( X, ILO(1:NP), IHI(1:NP) )
          !HPF$ END ON
        END DO

The algorithm divides the entire computational domain (array X) into NP
subdomains, one for each processor.  The INDEPENDENT IP loop performs a
computation on each subdomain's interior.  The processors then collaborate
to update the boundaries of the subdomains and test for convergence.  The
subroutine SOLVE_SUBDOMAIN can use a transcriptive or descriptive
mapping for its array argument, placing it on a single processor.  An ON
block encompassing the entire procedure could then ensure the computation
proceded on a single processor.  Subroutines SOLVE_BOUNDARIES and
CONVERGENCE_TEST may well have their own loops similar to the IP loop, with
similar RESIDENT clauses.  Note that only the lower and upper bound of each
subdomain is recorded; this allows different processors to process
different-sized subdomains.  However, each subdomain must "fit" into one
processor's section of the X array.

Advice to implementors:
The IP loop above is likely to be a common idiom among programmers doing
block-structured codes.  In general, it can be implemented by inverting the
HOME clause as was done above.  In the one-to-one case shown here (probably
very popular with programmers), it can be implemented by assigning the
processor id to the loop index variable and testing the range of the loop
(once).
End of advice to implementors.

Advice to programmers:
Some compilers will propogate the ON information from the caller to the
callee at compile time, and some at run time.  Repeating the ON clause in
the caller and callee will tend to give the compiler better information,
resulting in better generated code.

Again, note the usefulness of RESIDENT clauses in giving the compiler
information.  Few compilers would be able to unravel nontrivial assignments
to ILO and IHI, and no current compiler would even attempt to understand
the comments in the above code fragment.
End of advice to programmers.

Because it is an assertion of act, the compiler can draw many inferences
from a single RESIDENT clause.  For example, consider the following case:

        !HPF$ ALIGN Y(I) WITH X(I)
        !HPF$ ALIGN Z(J) WITH X(J+1)
        
        !HPF$ ON HOME( X(K) ), RESIDENT( X(INDX(K)) )
        X(K) = X(INDX(K)) + Y(INDX(K)) + Z(INDX(K))

The compiler is justified in making the following assumptions in compiling
the assignment statement (assuming it honors both the ALIGN directives and
the ON directive):

* X(K) requires no communication (because of the HOME clause)
* X(INDX(K)) requires no communication (because of the RESIDENT clause)
* Y(INDX(K)) requires no communication (because Y has the same mapping as X)

The compiler cannot make any assumption about INDX(K) or Z(INDX(K)) from
the above code.  There is no indication how INDX is mapped relative to X,
so the ON directive gives no guidance.  Note that the fact that an
expression (here, X(INDX(K))) is local does not imply that its
subexpressions (here, INDX(K)) are also local.  Similarly, Z's mapping does
not determine if Z(INDX(K)) would be local; it indicates that Z(INDX(K)-1)
is local, but that isn't a great help.  If the compiler has additional
information (for example, X is distributed by BLOCK and INDX(K) is not near
a block boundary), it might be able to make additional deductions.

Advice to implementors:
One mark of a good compiler will be that it aggressively propagates RESIDENT
assertions.  This is likely to significantly reduce communication costs.
Note the cases under "Advice to users" below.
End advice to implementors.

Advice to users:
One can expect compilers to differ in how aggressive they are in drawing
these deductions.  Higher-quality compilers will be able to identify more
references as local, and use this information to eliminate data movement.
All compilers should recognize that if an element of one array is local,
then the same element of any other arrays with the same static mapping
(i.e.  arrays ALIGNed together, or with the same DISTRIBUTE pattern and
array size) will also be local.  That is, any compiler should recognize
Y(INDX(K)) in the above example as local.  Dynamically changing array
mappings (i.e. REALIGN and REDISTRIBUTE) will tend to limit such
information and information propagation.
End advice to users.

Another example of drawing further inferences from RESIDENT clauses is the
following:

        !HPF$ ON HOME(PROCS(IP)), RESIDENT( X(ILO(IP):IHI(IP)) )
        DO I = ILO(IP)+1, IHI(IP)-1
          X(I) = (X(I-1) + 2*X(I) + X(I+1)) / 4
        END DO

(Note that this is the construct form of the DO, so the ON clause applies
to all three lines below it.) Here, I is alwas between ILO(IP) and IHI(IP).
Therefore, the two references to X(I) require no communication, since they
always refer to an element of the RESIDENT array section.  Slightly deeper
reasoning reveals that the same is true of X(I-1) and X(I+1).  In short,
the entire DO loop can be executed without communication on PROCS(IP).

Advice to implementors:
Analysis of ranges with symbolic bounds will pay big dividends for
applications such as block-structured CFD codes.  Of course, the expense
of doing this analysis must be weighed against its benefits.  See also
the "Advice to programmers" below.
End advice to implementors.

Advice to programmers:
As with the previous example, one can expect the capabilities of compilers
to differ widely in making these inferences.  Most compilers should
recognize and use compile-time constants in RESIDENT clauses and the
executable code.  Many compilers will extend this to variables that are
invariant within the ON clause and will propogate the bounds of DO loops;
others will use pattern recognition to achieve some of this effect.  Such
pattern recognition suffices to recognize X(I) in the example.  Detailed
reasoning about symbolic values is beyond the capabilities of most
compilers.  For example, determining the range of X(I-1) in the above
example may require symbolic manipulation (or rather aggressive
pattern-matching).
End advice to users.

Consider the statements

        !HPF$ PROCESSORS PROC(NP)
        !HPF$ DISTRIBUTE X(BLOCK)
        ...
        !HPF$ ON HOME(PROC(1)), RESIDENT( EVERY=X )
        X(I) = X(I+2) + X(I) + X(I-2)
        ...
        !HPF$ ON HOME(PROC(2)), RESIDENT( X )
        X(I) = X(I+2) + X(I) + X(I-2)

The two RESIDENT clauses have slightly different meanings.  The EVERY
clause refers to the elements of X actually referenced; thus, it is a
statement about X(I), X(I+2), and X(I-2).  In effect, the first ON clause
asserts that I is at least two elements away from any block boundary.
Without EVERY, the RESIDENT clause applies to the entire array.  In effect,
the second ON directive asserts that all of X is mapped to processor
PROC(2).  This is clearly impossible given the distribution of X, so the
compiler would be justified in warning the programmer.  A different mapping
of X could make the second ON clause true (for example, if X were ALIGNed
to a subsection of a larger array, that subsection might reside entirely on
PROC(2)).  Both ON clauses are valid simultaneously if and only if X is
replicated.

Advice to programmers:
The EVERY clause is typically the right way to assert that all references
to an array are local.
End of advice to programmers.



--============_-1391005225==_============--
---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------

From owner-hpff-task  Mon Jan 15 18:48:45 1996
Received: (from daemon@localhost) by cs.rice.edu (8.7.1/8.7.1) id SAA18573 for hpff-task-out; Mon, 15 Jan 1996 18:48:45 -0600 (CST)
Received: from moe.rice.edu (moe.rice.edu [128.42.5.4]) by cs.rice.edu (8.7.1/8.7.1) with ESMTP id SAA18557 for <hpff-task@cs.rice.edu>; Mon, 15 Jan 1996 18:48:42 -0600 (CST)
Received: from hplms26.hpl.hp.com (hplms26.hpl.hp.com [15.255.168.31]) by moe.rice.edu (8.7.1/8.7.1) with ESMTP id SAA15440 for <hpff-task@cs.rice.edu>; Mon, 15 Jan 1996 18:08:05 -0600 (CST)
Received: from hplpp3.hpl.hp.com by hplms26.hpl.hp.com with ESMTP
	($Revision: 1.36.108.11 $/15.5+ECS 3.3+HPL1.1S) id AA185850879; Mon, 15 Jan 1996 16:07:59 -0800
Received: by hplpp3.hpl.hp.com
	(1.37.109.14/15.5+ECS 3.3+HPL1.1) id AA091170883; Mon, 15 Jan 1996 16:08:03 -0800
Date: Mon, 15 Jan 1996 16:08:03 -0800
From: Rob Schreiber <schreibr@hplpp3.hpl.hp.com>
Message-Id: <199601160008.AA091170883@hplpp3.hpl.hp.com>
To: hpff-task@cs.rice.edu
Sender: owner-hpff-task
Precedence: bulk

---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------

One more time.   Here is the reduction proposal WITH the external syntax,
and WITH examples of derived type reductions.   Please comment

\chapter{Proposed HPF Extension for Reduction Operations in Independent Loops}

    Rob Schreiber and Chuck Koelbel, and Joel Saltz

    January 15, 1996




\section{Changes since the last version}

Changes reflect response at the November, 1995 HPFF meeting.

There is a proposal for intrinsic reduction, followed by a proposed extension for
user defined reduction.   

The ``external" form in which the reduction variable is given as a
clause of the INDEPENDENT directive is described.  Now, the loop in
which a reduction variable is protected is labeled explicitly.  Each and
every occurrence of a reduction variable 
encountered while the loop is active must be
as a reduction variable in a reduction statement of the form specified
in Section~\ref{sec-redux}.
This alternative syntax has the advantages of economy (only one
directive regardless of the number of reduction statements) and clarity
(the protected loop is explicitly indicated).

\section{Rationale}

It is often the case that a data parallel computation cannot be
expressed in HPF Version 1.1 as an {\tt INDEPENDENT} loop because several of
the loop iterations update one or more variables.  In such cases
parallelism may be possible and desirable, because the order of updates
is immaterial to the final result.  This is most often the case with
accumulations, such as the following loop:
\CODE
        do i = 1, 1000000000
            x = x + complicated_function(i)
        end do
\EDOC
This loop can run in parallel as long as its iterations make their
modifications to the shared variable ({\tt X)} in an atomic manner.
Alternately, the loop can be run in parallel by making updates to
temporary copies of {\tt X}, with a (short) final phase to merge these
changes.  In either case, the computation is conceptually parallel but
not allowed by the original definition of {\tt INDEPENDENT}.

When programming this kind of computation to run in parallel in HPF
Version 1, the programmer must store update information in a temporary
array whose size is equal to the number of loop iterations, and then
use an intrinsic reduction function or {\tt XXX_SCATTER} library function
after the loop.  The problems with this approach are twofold:  the
temporary array may be excessively large; and the reduction operation
has to be intrinsic.

For these reasons, we have added a reduction feature to HPF that allows
loops that update shared variables (shared means that they are updated
by more than one loop iteration) to be labeled as independent.

\section{Technical proposal}

In this discussion, we use the term {\em range} (of a {\tt DO} construct) and
{\em active} (an active {\tt DO} construct) in their technical sense as defined
by Fortran 95.  
The range of a {\tt DO} construct consists of the statements
lexically bounded by the {\tt DO} statement and the terminal statement of the
construct.  Thus, a subprogram invoked from inside a {\tt DO} loop is not in
its range, although the {\tt CALL} statement or function reference that
invokes it is.    On the other hand, the {\tt DO} construct is active from
the time the {\tt DO} statement is encountered until the time that the
matching execution of the terminal statement is executed, or the
program stops, or control is transferred out of the construct in one of
the other ways allowed for {\tt INDEPENDENT} loops (see page 302 of the
Handbook and Section 4.4 of the HPF Standard.)


\subsection{Enhanced {\tt INDEPENDENT} directive}
The syntax of the INDEPENDENT directive is modified as follows
\BNF
independent-directive      \IS   !HPF$ INDEPENDENT [,new-clause] [,reduction-clause]
\FNB
The reduction clause lists the reduction variables that are active in the following
loop:
\BNF
reduction-clause           \IS  REDUCTION(reduction-var-list)

reduction-var              \IS  array-variable-name
                           \OR  scalar-variable-name
                           \OR  structure-component
\FNB

\begin{constraints}
\item
A reduction variable must be of intrinsic type.
\end{constraints}

Note that the syntax does not allow an array element, array section, or substring
to be designated as a reduction variable in the reduction clause of the INDEPENDENT
directive.    Even though such a subobject may occur in a reduction statement, it is the
entire array or character variable that is treated as a reduction variable.

Any variable whose name occurs in a reduction clause is said to be
{\em protected} while the immediately following {\tt INDEPENDENT DO} loop is active.
A protected variable may not be referred to at all, {\em except}  in a reduction statement
(see below).   In particular, it may not occur in any HPF directive, including the
variable list in a {\tt NEW} clause.   This includes any {\tt NEW} clause in the same
{\tt INDEPENDENT} directive.

\subsection{Reduction statement}
\label{sec-redux}

A reduction statement is an assignment statement of the following special form that  
occurs in the range (i.e. the lexical body) of an
independent {\tt DO} loop for which the name of its reduction variable occurs in a
reduction clause.
\BNF
reduction-stmt          \IS  

     reduction-var-ref = expr reduction-op reduction-var-ref
			\OR
     reduction-var-ref = reduction-var-ref reduction-op expr
			\OR
     reduction-var-ref = &
       reduction-function(reduction-var-ref, expr)
			\OR
     reduction-var-ref = &
       reduction-function(expr, reduction-var-ref)

reduction-var-ref   \IS  variable

reduction-op         \IS  intrinsic-op 

reduction-function   \IS        MAX
                           \OR  MIN
                           \OR  IAND
                           \OR  IOR
                           \OR  IEOR
                           \OR  MATMUL
\FNB
The operator or function that precedes or follows the reduction variable
reference after the equals sign is the reduction operator.

\begin{constraints}
\item
An assignment statement of the requisite form is a reduction statement if and only if
it occurs in the range of an {\tt INDEPENDENT DO} loop in which the name of its reduction-var
occurs in a
reduction clause on the {\tt INDEPENDENT} directive that precedes the {\tt DO}
statement.

\item  The two reduction-var-refs that occur in a
reduction statement must be lexically identical.

\item  A reduction-op must be an intrinsic, associative
operator; therefore, reduction-op may not be a rel-op; and it may not be
power-op (i.e. **).

\item  The following two forms of the reduction-stmt are not allowed:
\CODE
    reduction-var-ref = expr - reduction-var-ref

    reduction-var-ref = expr / reduction-var-ref
\EDOC

\item In strict HPF 2.0, the noncommutative intrinsics {\tt //} and {\tt MATMUL}
are forbidden in reduction statements.
\end{constraints}

The intrinsic operators and functions that are allowed are all
associative (at least in their mathematical definitions, even though the
usual implementations of the arithmetic operators by Fortran language
processors and the underlying hardware are not); they are listed below
(in Section~\ref{sec-implem}).  All are commutative except for character
concatenation and {\tt MATMUL}.  (We apply the terms associative and
commutative to the subtraction and division operators, since their
meaning is addition of the additive or multiplication by the
multiplicative inverse, respectively:  {\tt A - B} means {\tt A + (-B)}.)

{\bf  Note:
In HPF version 2.0, the associative but noncommutative intrinsic functions and operators
(i.e. MATMUL and concatenate) are disallowed in reductions.
The description of their use in reduction that occurs here is of an endorsed extension.
}

\subsection{Relaxation of the HPF standard's requirement for {\tt INDEPENDENT} loops}

In Section 4.4 of the HPF standard, 
the first proviso on what may not occur in the loop is modified to:

*  Any two operations that assign to the same object interfere with
each other, (begin added text) except that the allowed 
occurrences of a reduction variable in a
reduction statement do not cause interference.  (end added text)

Thus, the user's assertion implied by {\tt INDEPENDENT} about the behavior of
the program is weakened, and no longer applies to reduction variables
that are protected in the {\tt DO} construct.  

On exit from the outermost {\tt DO} loop in which it is protected
(the one whose INDEPENDENT directive contains a reduction clause naming it),
the value of a reduction variable is
processor dependent, although it is constrained by the model
implementation mechanism given in Section~\ref{sec-implem}.

\subsection{Example}

Here is an {\tt ADD_SCATTER} done as an independent loop:
\CODE
    !hpf$ independent, reduction(X)
    do i = 1, n
        X(index(i)) = X(index(i)) - f(i)
    enddo
\EDOC
The implementation will most likely make a local copy of an array of the
same shape as {\tt X}, and initialize it to zero.  Each iteration will
subtract the value of {\tt f(i)} from its local {\tt X(index(i))}.  To
create the final result, it will combine all the local arrays with the
initial value of {\tt X}.  The combining operator is addition, so that the
result is the sum of the initial value of {\tt X} and the local arrays.

\subsection{Restrictions on the 
lexical location of reduction statements, and on the
use of reduction variables}
\label{sec-location}

A variable that is updated by reduction statements in an independent
loop must be protected by explicit appearance in a reduction clause in
the INDEPENDENT directive for the outermost independent loop that
contains no NEW clause for that variable.

When a reduction operation is executed, some nest of {\tt DO} constructs will
be active; this is guaranteed by definition.
The reduction variable is protected while the loop in which it occurs in the reduction
clause of the INDEPENDENT directive is active.
If there are several nested INDEPENDENT DO loops surrounding the reduction statements
in which the variable is updated, which one is the right one to get the reduction
clause?  The answer is the outermost one, subject to the constraint that a reduction
variable may not appear in a NEW clause for that loop and may not occur at all
in that loop outside of reduction statements, in particular, it may not get the NEW
attribute.

On the other hand, if a statement that has the form of a reduction statement occurs
while an independent loop is active, but the updated variable is not a protected reduction
variable, then the programmer is guaranteeing that no two iterations of the independent
loop will update the same location.  For example:
\CODE
    !hpf$ independent
    do i = 1, n
        ! I know there are no repeated values in indx(1:n)
        x(indx(i)) = x(indx(i)) + f(i)
    end do
\EDOC

HPF requires that any reduction statement occur in the range of the {\tt DO}
construct in which its variable is protected: it must be lexically
contained in the body of the outermost loop in which it is protected.

A protected reduction variable may only occur in reduction statements,
and only in references to the left of the assignment operator and the
matching reference to the right of the assignment operator.  Outside of
the loop in which it is protected, however, it is treated as an ordinary
variable.

\subsection{Example loop nests}

When loops are nested, a reduction variable may need to be protected in an
independent outer loop even though the reduction operations in which it
occurs are nested inside an inner loop.  Moreover, the inner loop and
any intervening loops may or may not be independent.
\CODE
!   Nested Loop Example 1.  Inner loop is sequential

      x = 10
      outer: do while (x < 1000)   ! This loop is sequential
         !hpf$ independent, new(j), reduction(x)
         middle: do i = 1, n       
           inner: do j = 1, m
             x = x + j
             !  Note that it would be incorrect
             !  to refer to x here, except in a reduction statement
           enddo inner
           !  Note that it would be incorrect
           !  to refer to x here, except in a reduction statement
         enddo middle
         print *, x
      enddo outer
\EDOC
Since the variable {\tt x} occurs in a reduction clause for loop {\tt middle},
it is a protected reduction variable
throughout that loop, including inside the inner loop.   
If {\tt inner} was an independent loop,
it would be incorrect to include {\tt x} in a reduction clause of its INDEPENDENT
directive, either in addition to or instead of loop {\tt middle}.   

The outermost
loop is not {\tt INDEPENDENT}, and so {\tt x} need not and cannot be protected 
in that part of its range outside the middle loop.

This remains true if the inner loop is independent.   On the other
hand, a variable that occurs in a {\tt NEW} clause may not be a reduction
variable in the range of a loop, although it may be in the range of a
contained loop:
\CODE
!   Nested Loop Example 2.  Outer loop NEW clause.

      !hpf$ independent, new(i)
      outer: do k = 1, 100
         !hpf$ independent, new (j,x)
         middle: do i = 1, n
            x = 10
            !hpf$ independent, reduction(x)
            inner: do j = 1, m
              x = x + j**2
              !  Note that it would be incorrect
              !  to refer to x here, except in a reduction statement
            enddo inner
            y(i) = x
         enddo middle
      enddo outer
\EDOC
Here, {\tt x} is a protected reduction variable only in the inner loop.


The reduction statement may not occur in a subprogram called from the range of the
independent loop.   To get this effect, return the updating value from the
subprogram and perform the update with a reduction statement in the lexical loop
body.
\CODE

        program example
        real y, z(10)
        integer x, xinc, i
        
        x = 5
        !hpf$ independent, new(y), reduction(x)
        do i = 1, 10
            x = x + update_with_side_effect(i)
        enddo

        contains
        integer function update_with_side_effect(i)
          integer i, local_b, j, k

          !  Must return the increment to x
          !  A reduction statement for x may not appear inside sub
          !  local_b is implicitly NEW on each outer iteration.
          !
          local_b = (i-1)**2
          xinc = i * local_b

          !hpf$ independent, reduction(y)
          do j = 1, 10
             y = y + 1
          enddo

          !  Sequentially update a different part of z
          !  No reduction or new clause needed for z
          !
          do k = 1, 10
             z(i) = z(i) + k*y
          enddo
        end subroutine sub
	end program
\EDOC
\subsection{Reduction operators may not be mixed}

In most cases, only one operator will be used in the reduction
statements (if there are more than one) that update a given reduction
variable.   It is sensible, however, to use different operators, such
as + and -, together on the same reduction variable: mathematically,
subtraction is just addition of the additive inverse.  The same is true
for multiplication (*) and division (/).  No other mixing of operators
is allowed, however.

\section{Implementation and semantics}
\label{sec-implem}

The result of the Fortran 95 intrinsic function  {\tt SUM}  is defined to
be a processor dependent approximation to the sum of the elements of
its argument array.  On exit from the loop in which it is
protected, the value of a reduction variable is likewise not completely
specified by HPF.  One possible value is that which would have been
computed by sequential execution of the loop, but other,
processor-dependent approximations to this value may be produced.

While HPF does not explicitly define
the set of possible values, it does give a precise implementation
mechanism that serves to define this set.

Since no reference to a protected reduction variable can occur except in 
a reduction statement, it is not necessary to define
the values that these variables may have while protected.

In this discussion, the term processor means a single physical
processor or a group of physical processors that together sequentially
execute some or all of the iterations of an independent loop.

We first give a simple implementation mechanism that applies to
commutative reduction operations.  
On entry to an independent loop,
every executing processor allocates a local
reduction variable associated with each variable in the reduction clause on the 
{\tt INDEPENDENT} directive, and initializes it to the identity element for the
intrinsic reduction operator.  The local reduction variable has the
same type and kind type parameter
as the global reduction
variable.  Its shape is determined by the shape of the global reduction variable,
and is identical to it, except as noted below.

The identity elements for the intrinsic operators are as
follows:

\begin{table}
\begin{verbatim}
        Operator      Identity Element
        -------       ----------------
        +                0
        -                0
        *                1
        /                1
        .and.            .true.
        .or.             .false.
        .eqv.            .true.
        .neqv.           .false.

        //               ''

\end{verbatim}
\caption{Identity elements for intrinsic reduction operators.}
\end{table}

The following intrinsics may be used as reduction-functions, with the indicated
identity elements:

\begin{table}
\begin{tabular}{l@{~~~}p{3in}}
        Operator    &   Identity Element \\
\hline
        {\tt IAND(I,J)}   &   (all ones)\\
        {\tt IOR(I,J)}    &   0         \\
        {\tt IEOR(I,J)}   &   0         \\
        {\tt MIN(X,Y)}    &   {\tt -HUGE(X)} where {\tt X} 
                              has the same type and kind type
                              parameter as the reduction variable\\
        {\tt MAX(X,Y)}    &   {\tt +HUGE(X)} where {\tt X} has the same 
                        type and kind type
                        parameter as the reduction variable\\
        {\tt MATMUL(A,B)} &   {\tt MATMUL} is noncommutative; 
                        it is the only allowed
                        associative and noncommutative intrinsic reduction function.
                        As a consequence of the form {\tt A = MATMUL(A, B)}
                        of the reduction operation, it must be
                        true that {\tt SIZE(A,2) = SIZE(B,1) = SIZE(B,2)}.
                        As a consequence of the alternate form 
                        {\tt A = MATMUL(B, A)}
                        of the reduction operation, it must also be
                        true that {\tt SIZE(A,1) = SIZE(B,2) = SIZE(B,1)}.  Thus,
                        the identity is the identity matrix of shape 
                        {\tt (/N, N/)},
                        of the same type and kind type parameters as the reduction
                        variable; for the left identity, {\tt N = SIZE(A,1)}, for
                        the right identity, {\tt N = SIZE(A,2)}.   \\
\end{tabular}
\caption{Identity elements for intrinsic reduction functions.}
\end{table}

Let {\tt L} be the local reduction variable, {\tt A} the global reduction variable.
For array operators other than {\tt MATMUL}, 
the shape of {\tt L} will be the same as the shape of {\tt A},
except that if all the expr's in reduction statements for {\tt A} are 
scalar, then a
scalar {\tt L} may be used.   
(In the case of {\tt MATMUL}, {\tt L} will have shape
determined by that of {\tt A}, as described in Table 2.)

As an example, consider:
\CODE
        double precision y(100)
        real x(100), z(100)

        z = 5.
        x = (/ (real(i), i = 1:100) /)
        !hpf$ independent, new(y), reduction(x, z)
        do i = 1, 10
           z = i + z
           y = ...
           x = x - y
        enddo
\EDOC
The local reduction variables will be real arrays {\tt local_x} and 
{\tt local_z} of shape {\tt (/ 100 /)}.   All of these are initially zero.
An optimizing compiler will use a scalar {\tt local_z},
one per processor.
The double precision  values (of {\tt local_x + y}) will be rounded to single precision 
after every local addition.    

Each processor performs a subset of the loop iterations;  when it
encounters a reduction statement, it updates its own copy of the
reduction variable.    A processor is free to perform its loop
iterations in any order; furthermore, it may start an iteration,
suspend work on it, do some or all of the work of other iterations, and
resume work on the suspended iteration.   However, any update of a
local reduction variable occurs through the execution of a reduction
statement, and reduction statements {\em are} executed atomically.

The final value of the reduction variable is computed by combining the
local values with the value of the global reduction variable on entry
to the loop, using the combining operator.  
The ordering of this
reduction is language processor dependent, just as it is for the
intrinsic reduction functions ({\tt SUM}, etc.)   
Thus, in the example above, the final value of {\tt z} will be (approximately)
60.

Next, we give an implementation for the noncommutative operator {\tt MATMUL}.

First, assume that a given processor executes a contiguous set of the
loop iterations -- i.e. there is a block mapping of loop iterations to
processors.  In this case, the processor allocates a local reduction
variable to accumulate all updates applied from the right  (as in 
{\tt X = MATMUL(X, <expr>))} and initializes it to the right identity element.  it
allocates another local reduction variable to accumulate all updates
applied from the left  (as in {\tt X = MATMUL(expr, X))} and initializes it
to the value of the left identity element.  It may omit either of these
if there are no corresponding updates in the loop.

When all iterations of the loop have been completed, the local
reduction variables are combined to produce the final result.   The
combination is an evaluation of the expression
\CODE
(1)   lrv(P) <op> lrv(P-1) <op> ... <op> lrv(1) 
                     <op> A <op> 
                            rrv(1) <op> ... <op> rrv(P)
\EDOC
where A is the initial value of the reduction variable, lrv(k), k=1,
..., P is the value of the local reduction variable that has
accumulated the updates from the left due to the kth group of loop
iterations (on processor k, in the case of a block mapping of
iterations to processors), and rrv(k), k = 1, ..., P is the value of
the local reduction variable that has accumulated the updates from the
right due to the kth group of loop iterations.  In making this
evaluation, the language processor is free to employ any associated
rearrangement of the given expression; i.e. it is not required to
evaluate the expression left to right, but may use any parenthesization
of it.   It may not interchange the order of the expressions, since
the operator is not commutative.

If the mapping of iterations to processors is not a block mapping, then
a left, a right, or both a left and a right reduction variable must be
created and used for every contiguous group of loop iterations, with
updates accumulated and combined as above; in expression (1), P is the
number of such contiguous groups of iterations.

In the endorsed extension to HPF for noncommutative intrinsic reduction, updates
may occur on either the left or the right, but not both, in a given independent loop.

\section{Examples}

The reduction variable may be a scalar array element:
\CODE
      !hpf$ independent, reduction(x)
      do i = 1, n
          x(indx(i)) = x(indx(i)) + <expr>
      enddo
\EDOC
It may be to a section of an array:
\CODE
      !hpf$ independent, reduction(x)
      do i = 1, n
          x(i: i+4) = x(i: i+4) + <expr>   ! as many as 5 updates to elements of x
      enddo
\EDOC
Note that the compiler, if it distributes iterations of this loop in a block-wise
manner, will not need to make a private copy of the entire array x.

\section{Extension to user-defined operators}
This section is a separate and detachable part of the proposal.

\subsection{{\tt COMMUTATIVE} attribute for functions}

First, allow {\tt COMMUTATIVE} as an addition attribute of functions:
\CODE
        PURE, COMMUTATIVE FUNCTION FOOBAR(A,B)
\EDOC
The user asserts that the value of {\tt FOOBAR(A,B)} is equal to the value
of {\tt FOOBAR(B,A)} for all inputs A and B.

If it is desired to make no change to the language, but simply add an assertion directive,
then this can be done instead.

\subsection{Initial value for local variables in reduction clause}

The reduction clause is allowed to designate an initial value
for the local reduction variable:
\CODE
!hpf$ independent, new(B), reduction (A = MY_OP_IDENT(SIZE(B,1)))
do i = 1, n
   B = ...
   A = MY_OP(A, B)
enddo
\EDOC
The reason to assign the identity to the reduction directive rather than
make it an attribute of the function is the difficulty of restricting
the kind of expression allowed in the later case.  In this proposal, any
expression whose value is conformable with the reduction variable is
permitted.  Note that the expression is evaluated only once on entry to the
loop nest, and its value is conceptually replicated, so that all
processors initialize their local reduction variables to this value.
Thus, the expression could use a non-pure function.  We might disallow
this, in order to permit concurrent, one-per-processor evaluations of
the identity expression.  

In this proposal, all restrictions that apply to intrinsic reduction apply equally.
Thus, reduction functions may not be mixed.   If a reduction function is applied
``from the right" (as in the example) then it may not be applied from the left in the same
loop.

\subsection{Examples of derived type reduction}
First, one may use extended precision arithmetic:
\CODE

    interface operator (+)
       pure commutative type (bignum) function big_plus(x,y)
       type (bignum) x,y
       end function big_plus
    end interface

    type (bignum) b, big_zero
    external big_zero

    !hpf$ independent, reduction(b = big_zero())
    do i = 1, n
        b = b + <bignum-expr>
    enddo
\EDOC

In the following example the user decides to accumulate sparse local updates to
a distributed dense matrix.

\CODE
      module sparsity
      type real_sparse_matrix
          integer nrows, ncols, nnz ! shape, number of nonzeros
          real, pointer :: vals(:)
          integer, pointer :: row_index(:), col_index(:)
      end type real_sparse_matrix

      interface operator (+)
        pure function full_plus_sparse(x,y)
        real x(:,:), full_plus_sparse(size(x,1), size(x,2))
        type (real_sparse_matrix) :: y
        end function full_plus_sparse

        !  THIS IS THE REDUCTION FUNCTION
        pure commutative function sparse_plus_sparse(x,y)
        type (real_sparse_matrix) :: x, y, sparse_plus_sparse
        !hpf$ commutative, &
              identity (spzeros(x%nrows, x%ncols)) :: sparse_plus_sparse
        end function sparse_plus_sparse
      end interface

      contains
      pure function spzeros(nr, nc) result(res)
        type (real_sparse_matrix) res
        integer nr, nc
        res%nrows = nr;   res%ncols = nc;   res%nnz = 0
        allocate(res%vals(0))
        allocate(res%row_index(0))
        allocate(res%col_index(0))
      end function spzeros

      pure function elt(i, j, x) result(res)
        type (real_sparse_matrix) res
        integer i,j
        real x
        res%nrows = nr;   res%ncols = nc;   res%nnz = 1
        allocate(res%vals(1))
        allocate(res%row_index(1))
        allocate(res%col_index(1))
        res%vals(1) = x
        res%row_index(1) = i
        res%col_index(1) = j
      end function elt
      end module

      program test
      use sparsity
      real gs(n,n)
      type (real_sparse_matrix) :: s

      s = spzeros(n, n)
      !hpf$ independent, new(row, col, val), reduction(s = spzeros(n, n))
      do i = 1, num_updates
         s = s + elt(row, col, val) ! adds in val at position (row, col)
      enddo
      gs = full_plus_sparse(gs, s)
      end program

\EDOC





---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------