From owner-hpff-task  Thu Sep  7 12:19:45 1995
Received: by cs.rice.edu (MAA13069); Thu, 7 Sep 1995 12:19:45 -0500
Received: from cat.syr.edu by cs.rice.edu (MAA13058); Thu, 7 Sep 1995 12:19:35 -0500
From: choudhar@cat.syr.edu
Received: from localhost.syr.edu by cat.syr.edu (4.1/1.0-6/5/90)
	id AA14730; Thu, 7 Sep 95 13:18:48 EDT
Message-Id: <9509071718.AA14730@cat.syr.edu>
To: hpff-task@cs.rice.edu
Cc: choudhar@cat.syr.edu, chk@cs.rice.edu, ken@cs.rice.edu,
        rajesh@nova.npac.syr.edu
Subject: hpff-task: Preliminary proposal for Supporting Out-Of-Core Arrays in HPF
Date: Thu, 07 Sep 95 13:18:46 -0400
Sender: owner-hpff-task
Precedence: bulk

---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------


  Preliminary Proposal to provide support for OOC arrays in HPF

(The proposal is based on addressing some basic questions which are
listed at the end of this message)

We would like to start a discussion on this.

thanks
Alok, Chuck, Ken


The following proposal presents a way to support out-of-core array in HPF.
The main objectives are consistency with the "normal" HPF data mapping
directives, simplicity, and minimal extensions.

The following illustrates an example of declaring out-of-core arrays.
The only addition here is the directive OUT-OF-CORE. (Other directives
and information will follow later in this writeup)

!HPF$ TEMPLATE TEMP(100,100)
!HPF$ DISTRIBUTE TEMP(CYCLIC(B),CYCLIC(B))
!HPF$ ALIGN WITH TEMP :: A,B,C
!HPF$ OUT-OF-CORE: TEMP

This directive simply says that if nodes had infinite memory then
elements of the array will be in the memory of the processor as described
by the distribution directives. In other words, the directives describe
which processor's memory will an element be brought into when in-core.
This is a logical and simple extension of the HPF data mapping directives.

As usual, directives can be directly applied to arrays (rather than through
a template).

Restrictions: 
1) One cannot align an in-core array with an out-of-core array. E.g.,
   if A(10,10) is an OOC array and B(10) is an in-core array, it is
   not permitted to align B with A (or vice-versa) in any form, because
   not all the elements of A may be in-core, and therefore, it is
   difficult to enforce the meaning of align in such cases.
2) Others?


Association of File(s) with Arrays.

If the data for an out-of-core array comes from an input file then 
the file name must be specified. This is different from traditional
open and read/write of a file because the user does not explicitly access the
file.

For this we propose a directive

!HPF$ ASSOCIATE (<fn>, [,other parameters]) :: OOC array name

e.g., 

!HPF ASSOCIATE (filex) :: A

By default, the storage order in the file is the fortran storage order
and the element (1,1) corresponds to the first elment of the
file.

[,Other parameters] can be used to specify.
1) Storage order
2) starting element of the dataset within the file.

1) is needed if the storage order is anything other than the default
   column major.

   There are two possibilities. Use key words such as row major, column
major, or chucnks with parameters (such as chunk size in each dimension);

 OR use an explicit order, e.g., (2,1,3) which says that the outermost
dimension is 2 followed by 1 followed by 3. This may not allow chunking.

2) Starting element of the dataset is necessary if file contains some
   meta data in the beginning (e.g., a description of storage order from
   which the info in 1 will be derived) of a file, record size etc.

   This would require an open of a file, read of meta data and close
   of a file before the corresponding ASSOCIATE may be executed.

Note that it is not allowed to have a file open for regular access
 as well as for out-of-core computation because OOC computation
 provides implicit access ot a file and any explicit access
presents consistency problems.

However, the following is allowed.

	OPEN the file
	READ/WRITE/INQUIRE to your heart's content
	CLOSE the file
	(somewhere down the call chain) declare the OOC array
	use the array
	RETURN (i.e. exit the scope where the array is declared)
	now you can access the file again
The CLOSE acts as a sync point.  Similarly, the array being
deallocated acts as a sync point.

Note that this is necessary if metadats is contained within the file.


If an OCC array is used purely for scratch purposes, then it is not
necessary to associate a file with it. Compiler can choose
a name and create files in whatever way necessary. HOWEVER,
any array with which no file name is associated, must not be
read before anything is written into it (for obvious reasons).
In fact, it should be a runtime error if there is a part of the
OOC array which has not been assigned any values is read.

ASSOCIATE directive is also an executable (like redistribute) because
a user may want to process several files using the same array
(at different points in time, e.g., pipelined computations).
At any point, however, only one association exists. 


E.G., 

!HPF$ TEMPLATE TEMP(100,100)
!HPF$ DISTRIBUTE TEMP(CYCLIC(B),CYCLIC(B))
!HPF$ ALIGN WITH TEMP :: A,B,C
!HPF$ OUT-OF-CORE: TEMP

 DO i=1,numer_of_frames

    ASSOCIATE (frame.i) :: A ! Syntax to be decided of course!
    EXTRACT_NICE_FEATURES(A)
 END Do

Note one can associate multiple OOC arrays with the same file but
the consistency (if multiple assignments are done) is user's
responsiblity.

			TILING


Even when OOC distribution is provided, it may require significant
compiler analysis to figure out in what order to bring data in
to be processed. User, based on her knowledge of computation
(philosophy of HPF) can provide this hint.

For this purpose, a directive IO_DISTRIBUTE can be used, which essentially
provides a hint as to how data should be fetched into the memory.

For example, say each processors "owns" chunk of OOC data after
a (block,block) distribution. How data is to be fetched from whichin this
chunk (bunch of rows, bunch of columns, two-dim blocks) can be
described by IO_DISTRIBUTE.

Note that IO_DISTRIBUTE will also help in reorganizing data into local
files of each processor wrt to storage order within each file, if the
implementation chooses to do such a reorganization.

However, note that block size here would depend on the amount of memory
(tile size) available (and not on the NOP). The syntax for this
size specification within IO_DISTRIBUTE needs to be resolved.

One way is to do it explicitly; e.g., IO_DISTRIBUTE A(5,10), meaning
blocks of 5X10 elements...

But I believe this is a detail.
----------------------------------------------------------------

The above discussion is based on the following questions. Some of them
are not answered (in terms of a concrete proposal)

 1. Type of array: is the given array in-core or out-of-core? What is
the distribution of the array?
 2. If out-of-core, is there a corresponding file?
 3. If there exists a file, is the file persistent (input/output) or
temporary?
 4. Information about array mapping
    1. Multiple arrays mapped to the same file
    2. Multiple files mapped to the same array
 5. Information about File Mapping
    1. File Ordering
    2. File Distribution Information
 6. Hints about tiling
    1. Available Memory
    2. Tiling Parameters

* Execution Model
 Compiler has to decide underlying execution model. Two possible
models are
  1. Local Placement Model (OOC data in local space)
  2. Global Placement Model (OOC data in Shared Space)




==============================================================================
Alok Choudhary                 Phone: 315-443-4280 (Off.)
Associate Professor                   315-423-8065(home)
ECE, CIS and                    (Fax): 315-443-2583
121 Link Hall                   choudhar@cat.syr.edu (preferred)
Syracuse University             choudhar@nova.npac.syr.edu
Syracuse, NY 13244.             http://www.cat.syr.edu/~choudhar

=============================================================================

---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------

From owner-hpff-task  Fri Sep  8 17:21:35 1995
Received: by cs.rice.edu (RAA09106); Fri, 8 Sep 1995 17:21:35 -0500
Received: from frey.riacs.edu by cs.rice.edu (RAA09092); Fri, 8 Sep 1995 17:21:20 -0500
Received: by frey.riacs.edu (8.6.12/1.35)
	id PAA02022; Fri, 8 Sep 1995 15:25:14 -0700
Date: Fri, 8 Sep 1995 15:25:14 -0700
From: schreibr@frey.riacs.edu (Rob Schreiber)
Message-Id: <199509082225.PAA02022@frey.riacs.edu>
To: hpff-task@cs.rice.edu
Subject: hpff-task: Reduction proposal
Sender: owner-hpff-task
Precedence: bulk

---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------


Although this is still a work in progress, I thought it is mature
enough to share it, and thereby allow enough time for you to read
and comment before the meeting:



    Proposed HPF Extension for Reduction Operations in Independent Loops

    Rob Schreiber and Chuck Koelbel, and Joel Saltz

    September 8, 1995


1.  Rationale

It is often the case that a data parallel computation cannot be
expressed in HPF as an INDEPENDENT loop because several of the loop
iterations update one or more variables.  In such cases, parallelism is
possible because the order of updates is immaterial to the final result.
This is most often the case with accumulations, such as the following loop.

        DO I = 1, 1000000000
            X = X + COMPLICATED_FCN(I)
        END DO

This loop can run in parallel as long as its iterations make their
modifications to the shared variable (X) in an atomic manner.  Alternately,
the loop can be run in parallel by making updates to temporary copies of X,
with a (short) final phase to merge these changes.  In either case, the
computation is conceptually parallel but not allowed by the original
definition of INDEPENDENT.

When programming this kind of computation to run in parallel in HPF
Version 1, the programmer must store update information in a temporary
array whose size is equal to the number of loop iterations, and then
use a reduction intrinsic or XXX_SCATTER library function after the
loop.  The problems with this approach are twofold:  the temporary
array may be excessively large; and the reduction operation has to be
intrinsic.

For these reasons, we have added a reduction feature to HPF that allows
loops that update shared variables (shared means that they are updated
by more than one loop iteration) to be labeled as independent.
Interestingly, with this extension it is feasible to write general
purpose _SCATTER and reduction functions, that can work with
user-defined operators.

2.   Technical Proposal

2.1 Add a REDUCE Directive

H499 reduce-directive         is  !HPF$ REDUCE

Constraint:  The first non-comment, non-blank line following a reduce-directive
an assignment-stmt of a special form, called a reduction statement.

2.2 Reduction Statments

H498 reduction-stmt           is  assignment-stmt

A reduction statement is any statement immediately preceded by a reduce directive.

Constraint: The form of a reduction-stmt must be

     reduction-variable = reduction-variable reduction-op expr

or
     reduction-variable = &
       reduction-function(reduction-variable, expr)

Constraint:  The two reduction-variable references that occur in a
reduction statement must be lexically identical.

H498.1  reduction-variable    is  variable

H498.2  reduction-op          is     intrinsic-op 
                                 or  defined-binary-op

H498.3  reduction-function    is  function-name

Constraint:  A reduction-function must be PURE.

Constraint:  A reduction-op may not be a rel-op; and it may not be power-op (i.e. **).

Constraint:  A reduction-stmt must occur in the range of an INDEPENDENT DO loop
in which the reduction variable does not have the NEW attribute.    Addition
constraints on the location of a reduction statement are covered in Section 2.4.

The operator or function that follows the reduction variable reference
after the = sign is the reduction operator.

Any variable that appears on the left hand side of a reduction
statement is said to be a reduction variable.   

2.3 Allow a reduction-clause in the independent directive:

H413 independent-directive    is  INDEPENDENT [, new-clause ] [, reduction-list]

H413.1 reduction-clause       is   
    REDUCTION(reduction-variable-list; COMBINE = combining-op; IDENTITY = value)

H413.2 combining-op     is      intrinsic-op  
                            or  defined-binary-op 
                            or  reduction-function

Constraint: The combining operator must be a binary operator or a
function of two arguments, defined for two objects of the type of each
of the the reduction variables, and producing an object of that same
type (it may not be a rel-op); it may not be the power operator **.   The only
intrinsic operators and functions that are allowed are commutative; they are listed
below (in Section 3).

Constraint:  The value in the identity clause must be an expression
conformable with each of the reduction variables, and for which
the assignment var = value is well-defined for each of the reduction variables; 
or it may the name of a PURE function of one argument that can accept every member
of the reduction variable list as an argument and that produceds a result
conformable with its argument and for which the assignment
var = value(var) is well-defined.

The reduction clause is mandatory for user-defined reduction operators;
it is optional for intrinsic operators applied to a reduction variable of intrinsic type.

In either case, it must appear in the correct INDEPENDENT directive.
This is the INDEPENDENT directive that labels the outermost 
DO construct in which the
reduction variable does not occur in a NEW clause, and which contains no INDEPENDENT
directive in which the reduction variable occurs in a NEW clause.
See Section 2.4.

Examples:

There may be more than one reduction clause in the INDEPENDENT directive:

        !HPF$ INDEPENDENT, REDUCTION(X, COMBINE=+, IDENTITY=0), &
        !HPF$ & REDUCTION(Y, COMBINE=MAX, IDENTITY= (-HUGE(Y)))
        DO I = 1, N
            !HPF$ REDUCE
            X = X + A(I)
            !HPF$ REDUCE
            Y = MAX(Y,A(I))
        END DO

The reduction variable may be of derived type, and the reduction operator
may be an overloaded operator:

interface operator (+)
   type (bignum) function rank0_big_plus(x,y)
   type (bignum) x,y
   end function rank0_big_plus
end interface

interface zero
   real function rank0_real_zero(x)
   real x
   end function rank0_real_zero

   real function rank1_real_zero(x)
   real x(:), rank1_real_zero(size(x))
   end function rank1_real_zero

   double precision function rank0_dbl_zero(x)
   double precision x
   end function rank0_dbl_zero

   type (bignum) function rank0_big_zero(x)
   type (bignum) x
   end function rank0_big_zero
end interface

real x0, x1(10)
double precision d
type (bignum) b

!HPF$ INDEPENDENT, REDUCTION(x0, x1, d, b; combine = +; identity = zero)
do ...
   ...
    !hpf$ reduce
    b = b + <bignum-expr>
   ...
    !hpf$ reduce
    b = b - <bignum-expr>
enddo


2.4  Restrictions on the Lexical Location of Reduction Statements, and on the
Use of Reduction Variables.

In this discussion, we use the term *range* (of a DO construct) and
*active* (an active DO construct) in their technical sense as defined by Fortran 90.
The range of a DO construct consists of the statements lexically bounded by the DO
statement and the terminal statement of the construct.  Thus, a subprogram invoked from
inside a DO loop is not in its range, although the CALL statement or function reference
that invokes it is.    On the other hand, the DO construct is active from the time the DO
statement is encountered until the time that the matching execution of the terminal
statement is executed, or the program stops, or control is transferred out of the
construct in one of the other ways allowed for INDEPENDENT loops
(see page 302 of the Handbook and Section 4.4 of the HPF Standard.)

When a reduction operation is executed, some nest of DO constructs will be active;
this is guaranteed by the third constraint in Section 2.2.
The reduction variable is said to be *protected* in some, and possibly all of the
active nest; the rule is this:  it is protected in the outermost DO construct that is
INDEPENDENT, has no NEW clause that names the reduction variable, 
and does not *contain* an independent
directive in which the reduction variable 
occurs in a NEW clause (see Example 2, below).   It is protected in
all of this DO construct, including any other control structures nested within it.

HPF requires that any reduction statment occur in the range of the DO construct in which
its variable is protected: it must be lexically contained in the body of the
outermost loop in which it is protected.

A protected reduction variable may only occur in reduction statments,
and only in references to the left of the assignment operator and the
matching reference to the right of the assignment operator.  Outside of
the loop in which it is protected, however, it is treated as an ordinary
variable.

2.4.1 Example loop nests.

When loops are nested, a reduction variable may be protected in an independent outer loop
even though the reduction operations in which it occurs are nested inside an inner loop.
Moreover, the inner loop and any intervening loops may or may not be independent.

!   Nested Loop Example 1.  Inner loop is sequential

      x = 10
      outer: do while (x < 1 000 000)
         !hpf$ independent
         middle: do i = 1, n
	   inner: do j = 1, m
             !hpf$ reduce
	     x = x + j
   !         illegal to refer to x here, except in a reduction statement
	   enddo inner
   !       illegal to refer to x here, except in a reduction statement
         enddo middle
         print *, x
      enddo outer

Since the variable x occurs in a reduction statement in the body on the 
independent
middle loop, it is a protected reduction variable throughout that loop, including
inside the inner loop.   The outermost loop is not INDEPENDENT, and so x is not protected
in that part of its range not included in the middle loop.

This remains true if the inner loop is independent.   On the other
hand, a variable that occurs in a new clause may not be a reduction
variable in the body of a loop, although it may be in the body of a
contained loop:

!   Nested Loop Example 2.  Outer loop new clause.

      !hpf$ independent
      outer: do k = 1, 100
         !hpf$ independent, new x
         middle: do i = 1, n
            x = 10
            !hpf$ independent
            inner: do j = 1, m
              !hpf$ reduce
              x = x + j**2
              ! illegal to refer to x here, except in a reduction statement
            enddo inner
            y(i) = x
         enddo middle
      enddo outer

Here, x is a protected reduction variable only in the inner loop.


The reduction statment may not occur in a subprogram called from the body of the
independent loop.   To get this effect, return the updating value from the
subprogram and perform the update with a reduction statement in the lexical loop
body.

	program example
        integer x, xinc, i
	
	x = 10
        !hpf$ independent, new(xinc), reduction(x, combine=+, identity = 0)
	do i = 1, n
            call sub(i, xinc)
            !hpf$ reduce
	    x = x + xinc
	enddo

	subroutine sub(i, xinc)
	integer i, xinc, local_b
	  local_b = (i-1)**2
          xinc = i * local_b
	end subroutine sub

2.5 Relaxation of the HPF Standard's requirement for INDEPENDENT loops.

In Section 4.4, the first proviso on what may not occur in the loop is modified to:

*  Any two operations that assign to the same object interfere with each
other, 
(begin added text)
unless the object is assigned a new value in a reduction statement.
(end added text)

Thus, the user's assertion implied by INDEPENDENT about the behavior of
the program is weakened, and no longer applies to reduction variables
that are protected in the DO construct.   The value of such a variable
on exit from the loop is processor dependent, although it is
constrained by the model implementation mechanism given in Section 3.


2.6  Reduction Operators May Be Mixed.

In most cases, only one operator will be used in the reduction
statements (if there are more than one) that update a given reduction
variable.   It is sensible, however, to use different operators, such
as + and -, together on the same reduction variable: mathematically,
subtraction is just addition of the additive inverse.  The language
makes no restriction on the mix of operators the programmer chooses to
use.

3.  Implementation and Semantics  

Following an independent loop, the value of any reduction variable that
was protected in the loop is not completely defined by HPF.   One possible
value is that which would have been computed by sequential execution of
the loop, but others are also possible.

While HPF does not explicitly define
the set of possible values, it does give a precise implementation
mechanism that serves to define this set.

Since no reference to a protected reduction variable can occur outside of
a reduction statement in the loop body, it is not necessary to define
the values that these variables may have inside the loop.

In this discussion, the term processor means a single physical
processor OR a group of physical processors that together sequentially
execute some or all of the iterations of an independent loop.

The value on exit from the loop is to be computed as follows.  Each
processor makes a local copy of the reduction variable, and initializes
it to the identity value specified in the reduction-clause if any, or
the the identity element for the intrinsic reduction operator if there
is no reduction-clause.
The identity elements for the intrinsic operators are as follows:

	Operator	Identity Element
	-------		----------------
	+		0
	-		0
	*		1
	/		1
	.and.		.true.
	.or. 		.false.
	.eqv.		.true.
	.neqv.          .false.

	//		''

The following intrinsics may be used as reduction-functions, with the indicated
identity elements:


	Operator	Identity Element
	-------		----------------
	IAND(I,J)	(all ones)
	IOR(I,J)        0
	IEOR(I,J)       0
	MIN(X,Y)	-HUGE(X) where x has the same type and kind type
			parameter as the reduction variable
	MAX(X,Y)	+HUGE(X) where x has the same type and kind type
                        parameter as the reduction variable
        MATMUL(A,B)     The identity matrix of the same shape, type, and kind type
			parameter as A, provided the shape of A is square, i.e.
			RANK(A) is 2 and SIZE(A,1) is the same as SIZE(A,2).   As a
                        consequence of the form A = MATMUL(A, B) of the reduction
			operation, it must also be true that 
			SIZE(A,2) = SIZE(B,1) = SIZE(B,2).
			

Each processor performs a subset of the loop iterations;  when it encounters
a reduction statement, it updates its
own copy of the reduction variable.    A processor is free to perform
its loop iterations in any order; furthermore, it may start an iteration, suspend
work on it, do some or all of the work of other iterations, and resume work on the
suspended iteration.   However, any update of a local reduction variable occurs through
the execution of a reduction statement, and reduction statements *are* executed
atomically.

The final value of the reduction variable is computed by combining the
local values with the value of the global reduction variable on entry
to the loop, using the combining operator.  The ordering of this
reduction is language processor dependent, just as it is for the
intrinsic reduction functions (SUM, etc.)

3.1   Loop nests; protection of reduction variables

When loops are nested, a reduction variable may be protected in an independent outer loop
even though the reduction operations in which it occurs are nested inside an inner loop.
Moreover, the inner loop and any intervening loops may or may not be independent.
The general rule is that 
it is protected while is the *outermost* independent loop in which it does not appear
in a NEW clause is active.
It is therefore protected throughout all iterations
of any independent loop in which it occurs in a reduction statment, unless it
appears in a NEW clause in the independent directive that applies to the loop.

For example:

!   Nested Loop Example 1.  Inner loop is sequential

      x = 10
      !hpf$ independent
      do i = 1, n
	do j = 1, m
          !hpf$ reduce
	  x = x + j
!         illegal to refer to x here, except in a reduction statement
	enddo
!       illegal to refer to x here, except in a reduction statement
      enddo
      print *, x

Since the variable x occurs in a reduction statement in the body on the independent
outer loop, it is a protected reduction variable throughout that loop, including
inside the inner loop.

This remains true if the inner loop is independent.   On the other
hand, a variable that occurs in a new clause may not be a reduction
variable in the body of a loop, although it may be in the body of a
contained loop:

!   Nested Loop Example 2.  Outer loop new clause.

     !hpf$ independent, new x
      do i = 1, n
        x = 10
        !hpf$ independent
        do j = 1, m
        !hpf$ reduce
          x = x + j**2
!         illegal to refer to x here, except in a reduction statement
        enddo
        y(i) = x
      enddo
!  In contrast to the example above, the value of x is not defined after the loop,
!  although it can be redefined here

Here, x is a protected reduction variable only in the inner loop.

The reduction statment may not occur in a subprogram called from the body of the
independent loop.   To get this effect, return the updating value from the
subprogram and perform the update with a reduction statement in the lexical loop
body.

	program example
        integer x, xinc, i
	
	x = 10
        !hpf$ independent, new(xinc), reduction(x, combine=+, identity = 0)
	do i = 1, n
            call sub(i, xinc)
            !hpf$ reduce
	    x = x + xinc
	enddo

	subroutine sub(i, xinc)
	integer i, xinc, local_b
!  Since local_b is automatic it is effectively NEW
	  local_b = (i-1)**2
          xinc = i * local_b
	end subroutine sub

4.   Examples.

For intrinsic operators and scalars, the reduction-clause is unnecessary.

      !hpf$ independent
      do i = 1, n
         ...
         !hpf$ reduce
         x = x + <expr1>
         ...
         !hpf$ reduce
         x = x - <expr2>
      enddo

The reference may be to a scalar array element:
      !hpf$ independent
      do i = 1, n
          !hpf$ reduce
          x(indx(i)) = x(indx(i)) + <expr>
      enddo


It may be to a section of an array:

      !hpf$ independent
      do i = 1, n
          !hpf$ reduce
          x(i: i+4) = x(i: i+4) + <expr>   ! as many as 5 updates to elements of x
      enddo

Note that the compiler, if it distributes iterations of this loop in a block-wise
manner, will not need to make a private copy of the entire array x.

Consider the loop:

      !hpf$ independent
      do i = 1, n
          !hpf$ reduce
          x(ix(i)) = <expr>
      enddo

This is syntactically disallowed.   The effect that is apparently wanted
here, namely to store in x(k) some unspecified one of the subset of instances of
<expr> for which ix(i) == k (a copy_scatter) can be achieved by using the
second form of reduction statement.

In the following example the user decides to accumulate sparse local updates to 
a distributed dense matrix.

      type real_sparse_matrix
          integer nrows, ncols, nnz ! shape, number of nonzeros
          real, pointer :: vals(:)
      end type real_sparse_matrix

      interface operator (+)
        function full_plus_sparse(x,y)
        real x(:,:)
        type (real_sparse_matrix) :: y
	end function full_plus_sparse

        function sparse_plus_sparse(x,y)
        type (real_sparse_matrix) :: x, y
	end function sparse_plus_sparse
      end interface

      function spzeros(nr, nc) result(res)
        type (real_sparse_matrix) res
	integer nr, nc
	res%nrows = nr
	res%ncols = nc
	res%nnz = 0
	allocate(res%vals(0))
      end function spzeros

      function elt(i, j, x) result(res)
        type (real_sparse_matrix) res
	integer i,j
	real x
	res%nrows = nr
	res%ncols = nc
	res%nnz = 1
	allocate(res%vals(1))
	res%vals(1) = x
      end function elt

      real gs(n,n)
      !hpf$ distribute (block, block) :: gs
      type (real_sparse_matrix) :: s

      s = spzeros(n,n)
      !hpf$ independent, reduction(s; combine = +, identity = spzeros(n,n))
      do i = 1, num_updates
         ...
         !hpf$ reduce
         s = s + elt(row, col, val) ! adds in elt at position (row, col)
      enddo
      gs = gs + s     ! full_plus_sparse













---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------

From owner-hpff-task  Tue Sep 12 11:10:55 1995
Received: by cs.rice.edu (LAA08659); Tue, 12 Sep 1995 11:10:55 -0500
Received: from N2.SP.CS.CMU.EDU by cs.rice.edu (LAA08654); Tue, 12 Sep 1995 11:10:51 -0500
From: Jaspal.Subhlok@n2.sp.cs.cmu.edu
Message-Id: <199509121610.LAA08654@cs.rice.edu>
Date: Tue, 12 Sep 95 12:08:25 EDT
To: hpff-task@cs.rice.edu
Subject: hpff-task: Proposal for task parallelism
Cc: bwolen@CS.CMU.EDU
Sender: owner-hpff-task
Precedence: bulk

---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------

Here is a modified task parallelism proposal. This proposal is not
quite formal since the final wording will be somewhat dependent on
SUBGROUP and ON features which are under development. On the bright
side, I think it is relatively easy to read and better debugged 
than earlier proposals.

Jaspal and Bwolen
----------------------------------------------------------------------------
Assumptions of functionality available from related features:

1) A way to group processors into subgroups P1, P2, P3 and distribute
   variables onto them, i.e. a1, a2, a3.

2) An ON construct  that directs execution on groups of processors
   P1, P2, P3 etc. for a block of code.


PROPOSAL:

A "task region" is a single entry, single exit region
delimited by (say) TASK REGION  .... END TASK REGION.
A task region can have blocks of code that are directed
to execute ON a processor subgroup. All other code executes
on all available processors, referred to as ALL.

The following restrictions must hold for the code inside a task region.

A code block directed to execute ON a subgroup must be a single
entry single exit region.

A code block directed to execute ON a subgroup P may access a variable
location not mapped to P  only if that variable location is:

a) accessed exclusively in code  directed to execute ON P. 
               OR
b) not written to in the task region.

[There are no other access constraints: Code  executing on ALL processors
 has unrestricted  access to all variable locations. A code block directed
 to execute ON a subgroup P has unrestricted access to all variable locations
 mapped to P]


An I/O operation in a code section directed to execute ON a subgroup
may not ``interfere'' with an I/O operation in a code section not
explicitly directed to execute on that subgroup. The interference 
of I/O operations is detailed in Section 4.4 (INDEPENDENT).

For a subroutine call inside an ON block, "all available processors"
are processors in the corresponding subgroup. This is the number that is
used for mapping the parameters of the subroutine. [This part will
become more specific after the syntax etc. of creating subgroups is
decided. There should probably be a system inquiry function for the
number of processors in the current subgroup, if NUMBER_OF_PROCESSORS()
is supposed to return the total number of processors for the program]

COMPILATION/EXECUTION MODEL

The execution model for a subgroup is  to unconditionally execute code
ON it, unconditionally skip code  ON others, and participate in the
execution of common code (on ALL processors) as normal data parallel
code. 
[The access restrictions  guarantee that the results will be consistent
 with pure data parallel execution. A processor group cannot be "invisibly"
 writing to a location being accessed by ALL or another processor group, and
 vice versa]

Following is one model for accessing variables in a task region:

Accesses to variables owned by other processors is cooperative, i.e.
the owner sends the value, and the user receives it, with one exception -
when code ON a subgroup has to access a variable not mapped to it, it
use a remote fetch/deposit. (It can also cache remote locations locally in
the subgroup for the duration of the execution of the task region since 
computation not ON that  subgroup cannot access it)

EXAMPLE: 2DFFT

Sequential:


      real, dimension(n,n) :: a1, a2

      do while(.true.)
          read (unit = iu, end = 100) a1
          call rowfft(a1)
          a2 = a1
          call colfft(a2)
          write (unit = ou) a2
          cycle
100       continue
          exit
      enddo


Pipelined Data/Task Parallel HPF

	real dimension(n,n) :: a1,a2
        boolean done1
!hpf$   disjoint processor groups P1, P2 (Syntax TBA)
!hpf$   distribute a1(block,*) onto P1
!hpf$   distribute a2(*,block) onto P2
!hpf$   distribute done1 onto P1
                 
!hpf$   TASK REGION
        done1 = .false.
        do while (.true.)
!hpf$       ON HOME(P1) BLOCK 
              read (unit = iu,end=100) a1
              call rowfft(a1)
              goto 101
    100       done1 = .true.
    101       continue
!hpf$       END BLOCK
            
            if (done1) exit
            a2 = a1

!hpf$       ON HOME(P2) BLOCK
               call colfft(a2)
               write(unit = ou) a2
!hpf$       END BLOCK
        enddo
!hpf$   END TASK REGION


The data parallel code on the two processor
groups will look something like this, after
the task region is compiled.

Processor Group P1:

	real dimension(n,n) :: a1
!hpf$   distribute a1(block,*)
        boolean done1

	done1 = .false.
	do while (.true.)
           read (unit = iu,end=100) a1
           call rowfft(a1)
           goto 101
    100    done1 = .true.
    101    continue
           _send(done1,P2)
           if (done1) exit
	   _send(a1,P2)
	enddo


Processor group P2:

	real dimension(n,n) :: a2
!hpf$   distribute a2(*,block)
        boolean local_done1

	do while (.true.)
           _receive(local_done1,P1)
          if (local_done1) exit 
          _receive(a2,P1)
          call colfft(a2)
          write(unit = ou) a2
	enddo




 COMMENTS:
 
1) The main difference with the simple parallel section/region,
   (or using an INDEPENDENT do loop to achieve parallel sections),
   is that  task regions presented can have code that executes on  ALL
   processors also. If it has no such code, it is similar to parallel
   section/region. However, allowing other code makes this construct more
   general, and implicitly allows pipelining in particular. At the same 
   time, existence of code in ALL can constrain parallelism due to data
   dependence, and in the worst case no task parallelism may exist.

2) No explicit control dependence constraints are required. Inside an ON
   block, any variable being read (or used for control flow) cannot be written
   by any other processor group - it can be only written by ALL processors,
   in which case the control flow from the subgroup must also reach that point.

   Outside an ON block, all processor groups execute all control flow (and
   other) statements. If a subgroup skips a control construct because it is
   not involved( i.e. its variables are not involved and there is no code 
   inside the scope of the control construct that is directed to execute
   ON it) and continues to execute its next ON block, the constraints
   ensure that it cannot write to a location that is used for managing control
   flow.
   
---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------

From owner-hpff-task  Wed Sep 13 12:52:30 1995
Received: by cs.rice.edu (MAA26436); Wed, 13 Sep 1995 12:52:30 -0500
Received: from theory.tc.cornell.edu by cs.rice.edu (MAA26424); Wed, 13 Sep 1995 12:52:23 -0500
Received: (from presberg@localhost) by theory.tc.cornell.edu (8.6.9/8.6.6) id NAA169376; Wed, 13 Sep 1995 13:52:22 -0400
Date: Wed, 13 Sep 1995 13:52:22 -0400
From: David Presberg <presberg@tc.cornell.edu>
Message-Id: <199509131752.NAA169376@theory.tc.cornell.edu>
To: hpff-task@cs.rice.edu
Cc: schreibr@frey.riacs.edu (Rob Schreiber), presberg@tc.cornell.edu
Subject: re: hpff-task: Reduction proposal
In-Reply-To: <199509082225.PAA02022@frey.riacs.edu>
References: <199509082225.PAA02022@frey.riacs.edu>
Sender: owner-hpff-task
Precedence: bulk

---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------

Quick reaction concerning the simplest reduction cases after having
only read about 1/6th of the text:

 - it is good that the "reduction-clause" is optional for the simple
cases: "... intrinsic operators applied to a reduction variable of
intrinsic type."  BUT that leaves the status of reductions formed from
(Fortran) intrinsic functions in doubt as read with the mandate for
"reduction-clause" for "...user-defined reduction operators". I.e., no
mention of optional/mandatory for "reduction-function"-using forms. Note
that your first example in section 2.3 uses MAX(...) --- clearly not a
"user-defined reduction operator" nor an "intrinsic operator".

  [<Formalistic-aside> The proper Fortran standardeze phrase
  would be '"intrinsic-operator" of an "intrinsic binary operation"'.
  There is English pseudo-syntax in F90 Section 7.1.2 that covers that
  topic.  It is there because the actual syntax doesn't slice the topic in
  a convenient way to allow talking collectively about all binary
  operators. </Formalistic-aside>]

 - having seen the !HPF$ REDUCTION early in the text, I wondered
immediately about the need for the "reduction-clause" when I
encountered it.  Not having read further ahead (yet) I don't know if
_both_ are required...  But I note that the simplest cases (e.g., that
intrinsic operator case and the one that is the easy generalization
from "x = max( x, expr )" ) are very easily discovered by any compiler
with the technology for doing the rest of HPF.  (I cite an early
1970's automatic parallelizing (extended) Fortran 77 compiler for a
SIMD machine -- David L. and Carl O. stop grimacing.)  Thus for that
simple case, neither the (statement form) directive nor the clause
ought to be required.

 - there is a serious flaw in the design of the "reduction-clause":
the "; IDENTITY = value" phrase immediately leads to some (potential)
semantic difference in the HPF as processed by an HPF system and the
(serial) Fortran as processed by a non-HPF Fortran system.  

==> Please drop that initialization phrase in favor of:
    
    (a) tabulating in the HPF standard the "zero"-element-value for
    each of the intrinsic operators and suitable binary intrinsic
    functions, and
    
    (b) thinking a bit harder about how to convey that same idea for
    "defined-binary-op"s or "reduction-function"s (particularly
    user-defined functions).  Perhaps you'll have to settle on e.g. "0
    of the correct data type" for these cases.

(Is there text later in the discussion about an initial value of the
reduction variable prior to the INDEPENDENT loop with the reduction?
Slap me down if I didn't read far enough before reacting.)


Have fun next week.

-- Pres
- ------------------------------------------------------------------------
- David L. Presberg, Parallel Computing Specialist, CTC/ST/Tools
-   Cornell Theory Center, 740 Frank H.T. Rhodes Hall, Cornell University,
- Ithaca, NY 14853-3801, USA  607-254-8861  presberg@tc.cornell.edu
- Nickname:  Pres
- ------------------------------------------------------------------------
---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------

From owner-hpff-task  Wed Sep 13 14:07:12 1995
Received: by cs.rice.edu (NAA27425); Wed, 13 Sep 1995 13:13:52 -0500
Received: from [128.42.1.213] by cs.rice.edu (NAA27407); Wed, 13 Sep 1995 13:13:43 -0500
Date: Wed, 13 Sep 1995 13:13:43 -0500
X-Sender: chk@titan.cs.rice.edu
Message-Id: <v0153050aac7c80d69ce0@[128.42.1.213]>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
To: Jaspal.Subhlok@n2.sp.cs.cmu.edu
From: chk@cs.rice.edu (Chuck Koelbel)
Subject: Re: hpff-task: Proposal for task parallelism
Cc: hpff-task
Sender: owner-hpff-task
Precedence: bulk

---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------

>Here is a modified task parallelism proposal. This proposal is not
>quite formal since the final wording will be somewhat dependent on
>SUBGROUP and ON features which are under development. On the bright
>side, I think it is relatively easy to read and better debugged
>than earlier proposals.

I tend to agree, and will make sure that it gets printed for the next meeting.

A couple details follow...

>The following restrictions must hold for the code inside a task region.
>
>A code block directed to execute ON a subgroup must be a single
>entry single exit region.

Good news, this is part of the ON proposal already.  (Albeit phrased as "no
jumps into or out of the region)

>A code block directed to execute ON a subgroup P may access a variable
>location not mapped to P  only if that variable location is:
>
>a) accessed exclusively in code  directed to execute ON P.
>               OR
>b) not written to in the task region.

Shouldn't the second constraint be
b) not written to by code in another ON block in the task region.
?

Another way of asking this is, "Aren't the sequential regions also in the
task region?"

>COMPILATION/EXECUTION MODEL
>
>The execution model for a subgroup is  to unconditionally execute code
>ON it, unconditionally skip code  ON others, and participate in the
>execution of common code (on ALL processors) as normal data parallel
>code.
>[The access restrictions  guarantee that the results will be consistent
> with pure data parallel execution. A processor group cannot be "invisibly"
> writing to a location being accessed by ALL or another processor group, and
> vice versa]

Right motivation.  But if processor groups can't write to locations read by
ALL, how is data going to flow from group to group?

>The data parallel code on the two processor
>groups will look something like this, after
>the task region is compiled.

Definitely "might" look something like this.  I expect many compilers,
especially those on scalable machines, to produce SPMD code with IF
statements (checking local data, like myproc()) about where the ON
directives are.

>2) No explicit control dependence constraints are required. Inside an ON
>   block, any variable being read (or used for control flow) cannot be written
>   by any other processor group - it can be only written by ALL processors,
>   in which case the control flow from the subgroup must also reach that point.
>
>   Outside an ON block, all processor groups execute all control flow (and
>   other) statements. If a subgroup skips a control construct because it is
>   not involved( i.e. its variables are not involved and there is no code
>   inside the scope of the control construct that is directed to execute
>   ON it) and continues to execute its next ON block, the constraints
>   ensure that it cannot write to a location that is used for managing control
>   flow.

In the following sequence, does P1 execute the GOTO?  It doesn't involve
any data mapped to P1, nor is it directed ON P1.

        !HPF$ DISTRIBUTE A1(BLOCK) ONTO P1
        !HPF$ DISTRIBUTE A2(BLOCK) ONTO P2

        !HPF$ TASK REGION
        IF (A2(2) > 0) THEN
                !HPF$ ON HOME(P2) BLOCK
                ... do something with A2 ...
                !HPF$ END BLOCK
                GOTO 100
        END IF
        !HPF$ ON HOME(P1) BLOCK
        ... do something with A1 ...
        !HPF$ END BLOCK
        100 CONTINUE
        !HPF$ END REGION

I think I know what you mean.  Probably just need to be a little more
careful about the execution model, in particular, what does "normal
data-parallel model" mean?


                                                Chuck


---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------

From owner-hpff-task  Thu Sep 14 15:23:50 1995
Received: by cs.rice.edu (OAA22148); Thu, 14 Sep 1995 14:48:12 -0500
Received: from N2.SP.CS.CMU.EDU by cs.rice.edu (OAA22131); Thu, 14 Sep 1995 14:48:01 -0500
From: Jaspal_Subhlok@n2.sp.cs.cmu.edu
Received: from N2.SP.CS.CMU.EDU by N2.SP.CS.CMU.EDU id aa19220;
          14 Sep 95 15:12:41 EDT
To: Chuck Koelbel <chk@cs.rice.edu>
cc: hpff-task@cs.rice.edu
Subject: Re: hpff-task: Proposal for task parallelism 
In-reply-to: Your message of "Wed, 13 Sep 95 13:13:43 CDT."
             <v0153050aac7c80d69ce0@[128.42.1.213]> 
Date: Thu, 14 Sep 95 15:12:39 -0400
Message-ID: <19218.811105959@N2.SP.CS.CMU.EDU>
Sender: owner-hpff-task
Precedence: bulk

---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------


Chuck,

(This refers to  the task parallelism proposal and Chuck's last email
about it, neither of which is included)

You point out some text on access restrictions that seems to be in 
error. I think the restrictions may be more subtle than they appear, 
but the text is what is intended. I will add a general motivation
section along the lines of the next paragraph in the next revision -
please take a look and mail again if you think something needs to be changed.


   TASKING MODEL: Subgroups ``normally'' read and write only to/from the
   variables that are mapped to them. The code in ALL has unrestricted
   access to all variables, and data is exchanged between subgroups by
   copying the variables of one subgroup to the variables of another
   subgroup in the ALL code.

   In some circumstances, a subgroup may want to access a variable NOT mapped
   to it (e.g. a common block). Such access is allowed, but it must not cause
   a  data dependence during the entire duration of the execution of the
   task region. (Even dependence between a subgroup and ALL are not allowed
   since that would imply that when ALL is executing no other subgroup
   can execute as there is no easy way for a compiler to figure out what
   variable a subgroup may access ). A sufficient condition for
   "no dependences" is that such accesses should only be to variables that 
   are ``read only'' in the task region, or ``read and written'' only by code 
   ON only one subgroup.


Other points are well taken and I will add corrections/clarifications in
the next revision.  Yes, the compiled code is just one way
to do it, and meant only for illustration of the task constructs.
Every processor executes control constructs in ALL unless the compiler
can determine that it is not necessary. In general, a GOTO is certainly
executed by everybody. The  "normal data-parallel model" is supposed to
mean that execution follows the normal HPF semantics (some say no such
thing exists :) but I am not going to fix that) without any notion of
task parallelism constraints.

Let me know if there are unresolved things or other comments.



jaspal
---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------

From owner-hpff-task  Thu Sep 14 15:34:01 1995
Received: by cs.rice.edu (PAA24304); Thu, 14 Sep 1995 15:34:01 -0500
Received: from [128.42.1.213] by cs.rice.edu (PAA24254); Thu, 14 Sep 1995 15:33:46 -0500
Date: Thu, 14 Sep 1995 15:33:46 -0500
X-Sender: chk@titan.cs.rice.edu
Message-Id: <v01530511ac7dfa7c698e@[128.42.1.213]>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
To: Jaspal_Subhlok@n2.sp.cs.cmu.edu
From: chk@cs.rice.edu (Chuck Koelbel)
Subject: hpff-task: Re: Proposal for task parallelism
Cc: hpff-task
Sender: owner-hpff-task
Precedence: bulk

---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------

Sounds to me like you're addressing my concerns.  Maybe it would also help
to explicitly mention that a processor group always has write access to
data mapped to it.  Also, note that "normal data-parallel mode" on some
machines may mean GET/PUT access; copying data in this way in ALL may not
be sufficient for the synchronization that you need.

                                                Chuck



---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------

From owner-hpff-task  Thu Sep 14 16:06:13 1995
Received: by cs.rice.edu (QAA25939); Thu, 14 Sep 1995 16:06:13 -0500
Received: from N2.SP.CS.CMU.EDU by cs.rice.edu (QAA25923); Thu, 14 Sep 1995 16:06:01 -0500
From: Jaspal_Subhlok@n2.sp.cs.cmu.edu
Received: from N2.SP.CS.CMU.EDU by N2.SP.CS.CMU.EDU id aa20485;
          14 Sep 95 17:05:42 EDT
To: Chuck Koelbel <chk@cs.rice.edu>
cc: hpff-task@cs.rice.edu
Subject: hpff-task: Re: Proposal for task parallelism 
In-reply-to: Your message of "Thu, 14 Sep 95 15:33:46 CDT."
             <v01530511ac7dfa7c698e@[128.42.1.213]> 
Date: Thu, 14 Sep 95 17:05:41 -0400
Message-ID: <20482.811112741@N2.SP.CS.CMU.EDU>
Sender: owner-hpff-task
Precedence: bulk

---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------



> 
> Sounds to me like you're addressing my concerns.  Maybe it would also help
> to explicitly mention that a processor group always has write access to
> data mapped to it. 

Actually it is mentioned that processor groups have unrestricted access
to their variables, but only as a comment. Excerpt:

[There are no other access constraints: Code  executing on ALL processors
 has unrestricted  access to all variable locations. A code block directed
 to execute ON a subgroup P has unrestricted access to all variable locations
 mapped to P]

Perhaps it should be in the main text.

> Also, note that "normal data-parallel mode" on some
> machines may mean GET/PUT access; copying data in this way in ALL may not
> be sufficient for the synchronization that you need.
> 

If GET/PUT access is used, the compiler has to include explicit
synchronization like barriers for normal data parallel processing.
The compiler has to use "subset barriers" to be able to exploit task
parallelism. One concern is that if a full barrier is used around
ALL sections, then the program may be oversynchronized, so subgroup
barriers should be used in ALL as needed. e.g. if a variable in
group1 is copied to a variable in group2, correct use of subgroup
barriers will allow group3 processors to continue if there is no
other code in ALL that needs them.

Anyway, I think this is about efficient  compilation in GET/PUT model 
and I don't think it changes the specification or the basic execution
model.

jaspal
---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------

From owner-hpff-task  Thu Sep 14 16:57:12 1995
Received: by cs.rice.edu (QAA28618); Thu, 14 Sep 1995 16:57:12 -0500
Received: from [128.42.1.213] by cs.rice.edu (QAA28588); Thu, 14 Sep 1995 16:56:59 -0500
Date: Thu, 14 Sep 1995 16:56:59 -0500
X-Sender: chk@titan.cs.rice.edu
Message-Id: <v01530515ac7e086eb06d@[128.42.1.213]>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
To: Jaspal_Subhlok@n2.sp.cs.cmu.edu
From: chk@cs.rice.edu (Chuck Koelbel)
Subject: hpff-task: Re: Proposal for task parallelism
Cc: Chuck Koelbel <chk@cs.rice.edu>, hpff-task@cs.rice.edu
Sender: owner-hpff-task
Precedence: bulk

---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------

At 17:05 09/14/95, Jaspal_Subhlok@N2.SP.CS.CMU.EDU wrote:
>>
>> Sounds to me like you're addressing my concerns.  Maybe it would also help
>> to explicitly mention that a processor group always has write access to
>> data mapped to it.
>
>Actually it is mentioned that processor groups have unrestricted access
>to their variables, but only as a comment. Excerpt:
>
>[There are no other access constraints: Code  executing on ALL processors
> has unrestricted  access to all variable locations. A code block directed
> to execute ON a subgroup P has unrestricted access to all variable locations
> mapped to P]
>
>Perhaps it should be in the main text.

I think so.  It's an important point.

>> Also, note that "normal data-parallel mode" on some
>> machines may mean GET/PUT access; copying data in this way in ALL may not
>> be sufficient for the synchronization that you need.
>>
>...
>Anyway, I think this is about efficient  compilation in GET/PUT model
>and I don't think it changes the specification or the basic execution
>model.
>
>jaspal

Agreed.  Is it true that your execution model assumes that the owners of A
and B both participate in A=B (at least in the sense that they do some
synchronization)?  If so, I'd feel better if this were explicitly stated
somewhere.  Yes, in some sense the synchronization is also implied in
straight HPF - but here the model is explicitly multi-threaded, as opposed
to the single-threaded model in HPF-default mode.

                                                Chuck


---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------

From owner-hpff-task  Fri Sep 15 10:35:05 1995
Received: by cs.rice.edu (KAA27636); Fri, 15 Sep 1995 10:35:05 -0500
Received: from N2.SP.CS.CMU.EDU by cs.rice.edu (KAA27626); Fri, 15 Sep 1995 10:34:58 -0500
From: Jaspal_Subhlok@n2.sp.cs.cmu.edu
Received: from N2.SP.CS.CMU.EDU by N2.SP.CS.CMU.EDU id aa26200;
          15 Sep 95 10:56:33 EDT
To: Chuck Koelbel <chk@cs.rice.edu>
cc: hpff-task@cs.rice.edu
Subject: Re: hpff-task: Re: Proposal for task parallelism 
In-reply-to: Your message of "Thu, 14 Sep 95 16:56:59 CDT."
             <v01530515ac7e086eb06d@[128.42.1.213]> 
Date: Fri, 15 Sep 95 10:56:32 -0400
Message-ID: <26198.811176992@N2.SP.CS.CMU.EDU>
Sender: owner-hpff-task
Precedence: bulk

---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------


> 
> Agreed.  Is it true that your execution model assumes that the owners of A
> and B both participate in A=B (at least in the sense that they do some
> synchronization)?  If so, I'd feel better if this were explicitly stated
> somewhere.  Yes, in some sense the synchronization is also implied in
> straight HPF - but here the model is explicitly multi-threaded, as opposed
> to the single-threaded model in HPF-default mode.
> 
>                                                 Chuck
>

OK, it will probably make things clearer to state that owners of A
and B both participate in an A=B. But the main point is that execution 
model in ALL is same as regular HPF, (with extra information that subgroup
computations can be assumed to only read and modify subgroup variables
from ALL's perspective), and it should be stated in those 
terms, else we have to detail the behavior in all scenarios. Perhaps
some illustration will make things easier.

jaspal

---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------

From owner-hpff-task  Fri Sep 15 14:40:13 1995
Received: by cs.rice.edu (OAA12944); Fri, 15 Sep 1995 14:40:13 -0500
Received: from frey.riacs.edu by cs.rice.edu (OAA12930); Fri, 15 Sep 1995 14:40:00 -0500
Received: by frey.riacs.edu (8.6.12/1.35)
	id MAA05389; Fri, 15 Sep 1995 12:43:54 -0700
Date: Fri, 15 Sep 1995 12:43:54 -0700
From: schreibr@frey.riacs.edu (Rob Schreiber)
Message-Id: <199509151943.MAA05389@frey.riacs.edu>
To: hpff-task@cs.rice.edu
Subject: hpff-task: reduction comments
Sender: owner-hpff-task
Precedence: bulk

---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------

Original comments by Pres, and replies by me.

-- Rob


	Major items:

	 - I am still not comfortable with the " ; IDENTITY = value" clause.
	What if the user codes:

	       program AbsurdInitialization
	       x=0
	!HPF$ INDEPENDENT, REDUCTION( x; COMBINE = +; IDENTITY = 2. )
	       do i=1,100
	!HPF$    REDUCE
	         x = x + log(i)
	       enddo
	       print *,x
	       end

	Should they be allowed to state an absurd "zero" value for an
	intrinsic operation?  
Absurd, hardly.  The semantics are well defined.   It prints a
processor dependent approiximation to 2*number_of_processors() +
sigma(i=1 to 100) log(i)

I have changed it, however, to forbid combine and identity phrases (clauses?) in the
optional reduction clause for an intrinsic operators.   
Maybe we should just bar the whole REDUCTION clause on the independent directive
for intrinsic reductions.   Should we define the term "intrinsic reduction" in the
text?

        (I wonder if the keyword would be more intuitive
	as "ZERO" rather than "IDENTITY".)  
As in, like, 
!HPF$ INDEPENDENT, REDUCTION(X, COMBINE = *, ZERO = 1)
The mind boggles.

        If each processor initializes its
	local value of x to 2., it would seem that the HPF interpretation of
	this code (if run on more than one processor) would give wildly
	different results than a sequential F90 interpretation.
Yes, a very BAD processor dependent approximation to the sequential result.
This is disconcerting, but I dont see any way around this for user defined
operators, except to REQUIRE that the user create a value "foo" for
his operator bar(x,y) that has the property that bar(foo,y) = y for all y.
Should we.

	I agree that you need a hook for the "zero" value of a user-defined
	operator or reduction function, but I think you also need either:

	 (a) constraint language about the defined state, _and_ _defining_
	_value_ of the reduction variable prior to entry to the loop nest with
	the reduction, or
Do you mean what I said above about the "identity element" property of the
reduction operator and the identity value?

	 (b) prohibit use of the clause for the trivial cases you have
	enumerated in section 3.
Those are the intrinsic cases, where we could certainly make this prohibition.

	 - Again, I think you've surrounded the syntax with so many lexical
	restrictions that any undergraduate senior-level compiler course
	project could find and deal with the reductions with absolutely _no_
	directives!!  Could you at least relax the form of the reduction
	statement to include:

	  var = expr op var

	and maybe even:

	  var = expr1 op var op expr2   ! with constraint that the 2 ops are indentical

No problem here semantically, I guess;   not my call.


	 - It would seem that one cannot:

	 ...
	!HPF$ REDUCE
	          var = var .op. f(x)
	 ...
	!HPF$ REDUCE
	          var = var .op. f(x)
	 ...

	Why not?
Oh?  Where is this rule?


	 - I think the constraints on types of lists of variables and on other
	things that should agree in type, etc., are a little slack.
Got to show me a bug, however, before I can fix them.


	Bottom line:  For user-defined operators and reduction functions, and
	defined-type reduction variables, the clause on the INDEPENDENT and
	the REDUCTION directive are (probably) necessary.  I think you really
	_only_ have to define *protected* for the other cases without any
	clauses and directives, and then you can do the relaxation of
	constraints on the body of INDEPENDENT for the simple situations and
	not have to require any extra stuff for those simple cases.

Can you be a little clearer?   If i understand you, in the case of
reductions with intrinsic operators, you suggest no reduction clause on
the loop and no reduce directive on the operation.   I like this idea;
however, we have, in that case, to prohibit mixed operators, so that
the combine operator and identity value are implied by the (one class
of) reduction operator/function that occurs in the loop.

Rob


---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------

From owner-hpff-task  Mon Sep 18 21:36:57 1995
Received: by cs.rice.edu (VAA23040); Mon, 18 Sep 1995 21:36:57 -0500
Received: from squid.icase.edu by cs.rice.edu (VAA23032); Mon, 18 Sep 1995 21:36:52 -0500
Received: by squid.icase.edu (8.6.11/lanleaf8.6.4)
	id WAA06317; Mon, 18 Sep 1995 22:36:53 -0400
Date: Mon, 18 Sep 1995 22:36:52 -0400 (EDT)
From: Piyush Mehrotra <pm@icase.edu>
To: hpff-task@cs.rice.edu
Subject: hpff-task: extension to the on-clause and alternative for task regions
In-Reply-To: <199509121610.LAA08654@cs.rice.edu>
Message-ID: <Pine.SUN.3.91.950918223446.6305B-100000@squid.icase.edu>
Phone: (804)864-2188
Fax: (804)864-6134
WWW: http://www.icase.edu/~pm
Address: ICASE MS 132C NASA Langley Research Center Hampton VA 23681
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-hpff-task
Precedence: bulk

---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------


I would like to propose an extension to Chuck's on-clause proposal
as follows:

simple-on-directive  IS  ON HOME ( home-expr ) [, LOCAL [ (local-var-list)] ]

That is the local-var-list on the LOCAL is optional with the semantics
that if the keyword LOCAL appears with no variable list, then
the programmer is asserting the fact that if the on-block (including 
any procedure calls) is executed on the specified set of processor(s),
all its references will be confined to the set of processor(s) 
and will not access any data outside this set.


Given this extension to the on-clause, I would like to
ask Jaspal if this solves his TASK region issue without introducing
any new syntax. It certainly seems to solve his pipeline problem as
follows:
        real dimension(n,n) :: a1,a2
        boolean done1
!hpf$   disjoint processor groups P1, P2 (Syntax TBA)
!hpf$   distribute a1(block,*) onto P1
!hpf$   distribute a2(*,block) onto P2
!hpf$   distribute done1 onto P1
                 
        done1 = .false.
        do while (.true.)
!hpf$       ON HOME(P1), LOCAL, BLOCK
              read (unit = iu,end=100) a1
              call rowfft(a1)
              goto 101
    100       done1 = .true.
    101       continue
!hpf$       END BLOCK
            
            if (done1) exit
            a2 = a1

!hpf$       ON HOME(P2), LOCAL, BLOCK
               call colfft(a2)
               write(unit = ou) a2
!hpf$       END BLOCK
        enddo

Thus the LOCAL clause without the local var list will imply
at least a part of what I understand Jaspal was proposing with 
his TASK regions.  This proposal does not allow an on-block with 
a LOCAL keyword to access data mapped onto another processor subset.
Question is how important is that?


	- Piyush
---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------

From owner-hpff-task  Tue Sep 19 02:05:17 1995
Received: by cs.rice.edu (CAA27803); Tue, 19 Sep 1995 02:05:17 -0500
Received: from N2.SP.CS.CMU.EDU by cs.rice.edu (CAA27797); Tue, 19 Sep 1995 02:05:11 -0500
From: Jaspal.Subhlok@n2.sp.cs.cmu.edu
Message-Id: <199509190705.CAA27797@cs.rice.edu>
Date: Tue, 19 Sep 95 00:15:38 EDT
To: hpff-task@cs.rice.edu, pm@icase.edu
Subject: Re:  hpff-task: extension to the on-clause and alternative for task regions
Sender: owner-hpff-task
Precedence: bulk

---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------


(Response to Piyush's proposal about using  LOCAL with ON to
achieve task parallelism, thus eliminating the need for
TASK REGION..END TASK REGION syntax)


Piyush,

  ON clauses with LOCAL would make it possible to write a class
  of task parallel programs. The tradeoffs with using a task
  region are:

  1. As you pointed out, there is no reasonable way to access
     global/common variables from tasks. At the previous HPF
     meetings, several people conveyed that the code in tasks should
     be general and should allow access to common blocks. 

  2. I am not sure how I/O constraints can be specified in this model
     (short of something  ugly like specifying unit numbers as LOCAL
     to an ON region). In the task region model (and in INDEPENDENT
     loops) I/O constraints can be specified relatively since a
     relevant code section is defined. Without a task region, the entire
     program has to be considered.


jaspal
  
---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------

From owner-hpff-task  Tue Sep 19 08:24:43 1995
Received: by cs.rice.edu (IAA03829); Tue, 19 Sep 1995 08:24:43 -0500
Received: from squid.icase.edu by cs.rice.edu (IAA03823); Tue, 19 Sep 1995 08:24:38 -0500
Received: by squid.icase.edu (8.6.11/lanleaf8.6.4)
	id JAA06425; Tue, 19 Sep 1995 09:24:31 -0400
Date: Tue, 19 Sep 1995 09:24:29 -0400 (EDT)
From: Piyush Mehrotra <pm@icase.edu>
To: Jaspal.Subhlok@n2.sp.cs.cmu.edu
cc: hpff-task@cs.rice.edu
Subject: Re: hpff-task: extension to the on-clause and alternative for task regions
In-Reply-To: <199509190705.DAA17140@icase.edu>
Message-ID: <Pine.SUN.3.91.950919090421.6411A-100000@squid.icase.edu>
Phone: (804)864-2188
Fax: (804)864-6134
WWW: http://www.icase.edu/~pm
Address: ICASE MS 132C NASA Langley Research Center Hampton VA 23681
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-hpff-task
Precedence: bulk

---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------

> 
>   ON clauses with LOCAL would make it possible to write a class
>   of task parallel programs. The tradeoffs with using a task
>   region are:
> 
>   1. As you pointed out, there is no reasonable way to access
>      global/common variables from tasks. At the previous HPF
>      meetings, several people conveyed that the code in tasks should
>      be general and should allow access to common blocks. 
> 

There is no restriction on accessing global and/or common variables
with the LOCAL proposal, the only restriction is that they be mapped
onto the set of processors specified in the on clause.
Thus if a global variable or common block is mapped onto
a different subset of processors (or all the processors)
then the on-block cannot access the variable.

The problem with allowing access to any global variable is that
it involves other processors during the execution unless we make the
assumption that either a) there is support for one-sided communication
in hardware or software, or b) the compiler/runtime system determines
all globals accessed by an on-block (including all the way down the 
call chain) and then ensures that they are communciated to the processors
of the on-block before execution of the on-block.  Neither of these
approaches seems to be very palatable.


>   2. I am not sure how I/O constraints can be specified in this model
>      (short of something  ugly like specifying unit numbers as LOCAL
>      to an ON region). In the task region model (and in INDEPENDENT
>      loops) I/O constraints can be specified relatively since a
>      relevant code section is defined. Without a task region, the entire
>      program has to be considered.

One can certainly propose that replicated variables can be read in
any on-block to solve the above problem.  Note that in your pipeline
example (that I repeated) the variable done1 had to be mapped onto P1.

In your task region proposal, you have broken the specification 
into two parts, one a task region directive and another the 
on-block directive. The two combined together provide 
the effect that you want. In my proposal,  I was trying to 
bring the two together into the same directive thus hoping to
avoid the need for more syntax.

> 
> jaspal
>   
> 

	- Piyush

---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------

From owner-hpff-task  Tue Sep 19 15:05:29 1995
Received: by cs.rice.edu (PAA21765); Tue, 19 Sep 1995 15:05:29 -0500
Received: from N2.SP.CS.CMU.EDU by cs.rice.edu (PAA21756); Tue, 19 Sep 1995 15:05:20 -0500
From: Jaspal_Subhlok@n2.sp.cs.cmu.edu
Received: from N2.SP.CS.CMU.EDU by N2.SP.CS.CMU.EDU id aa01346;
          19 Sep 95 15:02:25 EDT
To: Piyush Mehrotra <pm@icase.edu>
cc: hpff-task@cs.rice.edu
Subject: Re: hpff-task: extension to the on-clause and alternative for task regions 
In-reply-to: Your message of "Tue, 19 Sep 95 09:24:29 EDT."
             <Pine.SUN.3.91.950919090421.6411A-100000@squid.icase.edu> 
Date: Tue, 19 Sep 95 15:02:10 -0400
Message-ID: <1344.811537330@N2.SP.CS.CMU.EDU>
Sender: owner-hpff-task
Precedence: bulk

---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------


I will be happy if we the ON construct obviates the need for
begin...end task region, but I think the model you propose may
be too restrictive. Our earlier proposal was functionally equivalent
to this and my sense from the previous HPF meetings is:

a) Access to distributed global structures (common blocks) from
   tasks is important. This cannot be supported well without 1
   way communication.

b) Since 1 way communication is needed anyway to compile INDEPENDENT
   loops, there is no reason to constrain task parallelism for that
   reason.


>One can certainly propose that replicated variables can be read in
>any on-block to solve the above problem.  Note that in your pipeline
>example (that I repeated) the variable done1 had to be mapped onto P1.

I am not sure if I understand this. The problem I was thinking of was 
how to disallow parallel tasks from performing I/O from the same unit 
in parallel.

Anyway, perhaps we can discuss further this tomorrow.

jaspal
---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------