From owner-hpff-task  Mon Oct  9 16:47:18 1995
Received: by cs.rice.edu (QAA10051); Mon, 9 Oct 1995 16:47:18 -0500 (CDT)
Received: from icarus.riacs.edu by cs.rice.edu (QAA10028); Mon, 9 Oct 1995 16:47:08 -0500 (CDT)
Received: from frey.riacs.edu by icarus.riacs.edu (8.6.12/2.7G)
	   id OAA26274; Mon, 9 Oct 1995 14:47:04 -0700
Received: by frey.riacs.edu (8.6.12/2.0N)
	   id OAA00853; Mon, 9 Oct 1995 14:51:25 -0700
Message-Id: <199510092151.OAA00853@frey.riacs.edu>
Date: Mon, 9 Oct 1995 14:51:25 -0700
From: Rob Schreiber <schreibr@riacs.edu>
To: hpff-task@cs.rice.edu, zosel@phoenix.ocf.llnl.gov
Subject: hpff-task: CCI's
Sender: owner-hpff-task
Precedence: bulk

---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------


Here is my understanding of the four CCI items on our agenda.
Please correct:


Group C, CCI status.


Number 18.   Source:  Zongaro   date 5-4-95
Subject:  Defined assignment in Forall
Status:

Resolved.  CHK will provide clarification in HPF 2 document.   It will be
achange to the "sequentializatio" of the forall to account for defined
assignments.


Number 19:  Source:   Rosing    date 6-27-95
Subject:  independent Do; implementation questions

Status:  Resolved.   No unanticpated or controversial questions about the
semantics of HPF were raised, and we do not have much to say about the
implementaion issues he raised.


Number 29:   Source:  Schreiber    date 8-3-95
Subject:   extrinsic local routines called in forall/independent


Status:   Under discussion.   Summary of the discussion in September:

1.  Only and HPF type routine can be PURE.  Thus, only and HPF_LOCAL
routine could be local, pure, and hence called in a forall.

2.  There is no semantic problem with this, or with invocation of any
extrinsic routine in an independent loop.

3.  An example

	real x(N, 100000)
	!hpf$ distribute x(*, block)

	!hpf$ independent
	do i = 1, N
           call extrinsic_fn(x(i,:))
	enddo

The independent loop is a mechanism for spawning N threads per processor, each
independent of the others;  the load per thread may be variable.  It would
possibly be useful here to NOT synchronize after each call to the extrinsic
routine!   Is there any semantic reason to force the synchronization?

4.  There are some very important issues in the implementation, with possible
language impacts.

Let us assume an MPI based implementation of HPF calling an extrinsic
local routine that uses MPI for communication.  Because of the unity of
purpose and of hotel between HPF and MPI, it is arguably necessary for
HPFF to make this work cleanly and efficiently.

Issue MPI-1:   Does HPF handle MPI_Init and MPI_Finalize, automatically.

Issue MPI-2:   Are any of the MPI routines PURE?

Issue MPI-3:   Thread safety.   In a naive implementation, HPF does a barrier
before and after the call to the extrinsic;  but there is no guarantee that
there are no outstanding, nonreceived messages in the messaging system.  Thus, to
be safe, any extrinsic routine should use its own communicator.   To prevent
interference between separate calls to the routine, a new communicator should be
created for every call.

An obvious way to do this is to call

      MPI_Comm_Dup(MPI_COMM_WORLD, New_comm)

at the beginning of every such extrinsic.    However, an extrinsic
that consumes all its messages would be justified in doing this once, on its
first invocation, and saving the communicator for reuse on later invocations.

But consider what happens if the extrinsic is called in an independent DO loop
as in the example above, and there is no barrier used.  Now we really need a
separate communicator per thread.
On the other hand, a call to MPI_Comm_Dup is a collective call, which
synchronizes the processes.

Perhaps this should be done by the calling HPF routine, so that the
MPI_COMM_WORLD communicator is different on every call.

Issue MPI-4:  If called from the range of an ON_HOME directive, what
set of processors does MPI_COMM_WORLD correspond to?   If it corresponds to the
subset executing the ON block, then how can the called routine access nonresident
data?   Should there be a way to access a communicator that corresponds to these
executing processors, while MPI_COMM_WORLD always corresponds to all of the
processors?

Issue MPI-5:  If called from separate ON_HOME blocks in the scope of
a TASK directive, with disjoint processors groups, so that the two
ON blocks may be executed concurrently, what communicators correspond to the
two processor groups?   (If, in issue 4 above, the answer is that MPI_COMM_WORLD
corresponds to the executing subset of the processors, then the answer
here is MPI_COMM_WORLD.)


Number ????     Source:   Meadows     date:   10-7-95
Subject:   Remapping at subprogram interface in INDEPENDENT

Status:   To be discussed.


---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------

From owner-hpff-task  Tue Oct 10 07:03:49 1995
Received: by cs.rice.edu (HAA04524); Tue, 10 Oct 1995 07:03:49 -0500 (CDT)
Received: from felix.dircon.co.uk by cs.rice.edu (HAA04516); Tue, 10 Oct 1995 07:03:34 -0500 (CDT)
Received: by felix.dircon.co.uk id AA08802
  (5.67b/IDA-1.5 for <hpff-task@cs.rice.edu>); Tue, 10 Oct 1995 13:03:32 +0100
Message-Id: <199510101203.AA08802@felix.dircon.co.uk>
Received: from ai072.du.pipex.com(193.130.248.72) by amnesiac via smap (V1.3)
	id sma008796; Tue Oct 10 13:03:25 1995
Received: by jim (5.x) id AA00537; Tue, 10 Oct 1995 09:59:06 +0100
To: hpff-task@cs.rice.edu
Subject: Re: hpff-task: CCI's 
In-Reply-To: Your message of "Mon, 09 Oct 1995 14:51:25 PDT."
             <199510092151.OAA00853@frey.riacs.edu> 
Date: Tue, 10 Oct 1995 09:59:06 +0100
From: "James Cownie <jcownie@bbn.com>" <jcownie@bbn.com>
Sender: owner-hpff-task
Precedence: bulk

---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------

A few comments on Rob's MPI related issues

> Issue MPI-1:   Does HPF handle MPI_Init and MPI_Finalize, automatically.
I would say that the HPF run-time should have called MPI_Init before
any user code has run, therefore user extrinsic functions which need
MPI can just use it.

This is actually not a big deal, since the user routine can always do
use MPI_Initialized() to guard her call to MPI_Init. (Though if this
is done, then the HPF run-time needs to do the same, since MPI_Init
should only be called once.) That's why it's simpler to say that
MPI_Init has already been called before any user HPF code has run.
	
> Issue MPI-2:   Are any of the MPI routines PURE?
Probably. For instance one could cast reductions as functions which
return the result, and only read the arguments (though why you'd want
to use an MPI reduction extrinsically rather than an HPF one is beyond
me).

> Issue MPI-3: Thread safety.  In a naive implementation, HPF does a
> barrier before and after the call to the extrinsic; but there is no
> guarantee that there are no outstanding, nonreceived messages in the
> messaging system.  Thus, to be safe, any extrinsic routine should use
> its own communicator.  To prevent interference between separate calls
> to the routine, a new communicator should be created for every call.
> 
> An obvious way to do this is to call
> 
>       MPI_Comm_Dup(MPI_COMM_WORLD, New_comm)
> 
> at the beginning of every such extrinsic.  However, an extrinsic that
> consumes all its messages would be justified in doing this once, on
> its first invocation, and saving the communicator for reuse on later
> invocations.
> 
> But consider what happens if the extrinsic is called in an independent
> DO loop as in the example above, and there is no barrier used.  Now we
> really need a separate communicator per thread.
> 
> On the other hand, a call to MPI_Comm_Dup is a collective call, which
> synchronizes the processes.
> 
> Perhaps this should be done by the calling HPF routine, so that the
> MPI_COMM_WORLD communicator is different on every call.
> 
> Issue MPI-4:  If called from the range of an ON_HOME directive, what
> set of processors does MPI_COMM_WORLD correspond to?   If it corresponds to the
> subset executing the ON block, then how can the called routine access nonresident
> data?   Should there be a way to access a communicator that corresponds to these
> executing processors, while MPI_COMM_WORLD always corresponds to all of the
> processors?

> Issue MPI-5:  If called from separate ON_HOME blocks in the scope of
> a TASK directive, with disjoint processors groups, so that the two
> ON blocks may be executed concurrently, what communicators correspond to the
> two processor groups?   (If, in issue 4 above, the answer is that MPI_COMM_WORLD
> corresponds to the executing subset of the processors, then the answer
> here is MPI_COMM_WORLD.)

I would suggest that in all of these cases the HPF run-time should
provide a "current communicator" which includes the set of processes
running the current construct. In some cases this will be
MPI_COMM_WORLD (or a Comm_Dup of COMM_WORLD), in others (ON_HOME, task
parallelism, processor subsets) it will represent a subset of the
available processes. In MPI MPI_COMM_WORLD is always available as the
set of all processes (until MPI-2 introduces a dynamic process model,
though that shouldn't worry HPF implementations).

Therefore I think
1) MPI_COMM_WORLD is *always* the set of all processes. (This is the
   current MPI view).
2) If you need subsets then you should create new communicators, and
   provide a way for the user code to access them. 

MPI_COMM_WORLD should mean the same thing in a routine called from HPF
extrinsic as it did in a "raw" MPI program. The HPF extrinsic MPI
environment should contain additions to the raw MPI environment (new
communicators, maybe pre-defined datatypes giving array distributions,
etc), but should not change the meaning of things in the raw MPI
world. In other words you may need to learn more to work in the HPF
extrinsic environment, but you shouldn't have to unlearn things you
already knew about MPI.

-- Jim 

James Cownie 
BBN UK Ltd
Phone : +44 117 9071438
E-Mail: jcownie@bbn.com


---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------

From owner-hpff-task  Tue Oct 10 08:58:49 1995
Received: by cs.rice.edu (IAA07166); Tue, 10 Oct 1995 08:58:49 -0500 (CDT)
Received: from VNET.IBM.COM by cs.rice.edu (IAA07156); Tue, 10 Oct 1995 08:58:40 -0500 (CDT)
Received: from TOROLAB by VNET.IBM.COM (IBM VM SMTP V2R3) with BSMTP id 9049;
   Tue, 10 Oct 95 09:58:35 EDT
Received: by TOROLAB (XAGENTA 4.0) id 0612; Tue, 10 Oct 1995 10:00:23 -0400 
Received: by twinpeaks.torolab.ibm.com (AIX 3.2/UCB 5.64/4.03)
          id AA19887; Tue, 10 Oct 1995 09:57:45 -0400
From: <zongaro@vnet.ibm.com> (Henry Zongaro)
Message-Id: <9510101357.AA19887@twinpeaks.torolab.ibm.com>
Subject: Re: hpff-task: CCI's (fwd)
To: hpff-task@cs.rice.edu
Date: Tue, 10 Oct 1995 09:57:43 -0400 (EDT)
X-Mailer: ELM [version 2.4 PL24alpha3]
Content-Type: text
Sender: owner-hpff-task
Precedence: bulk

---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------

I just sent this to Rob.  I forgot to copy hpff-task as well.

Thanks,

Henry

Forwarded message:
> From henry Tue Oct 10 09:55:42 1995
> Subject: Re: hpff-task: CCI's
> To: schreibr@riacs.edu (Rob Schreiber)
> Date: Tue, 10 Oct 1995 09:55:42 -0400 (EDT)
> In-Reply-To: <199510092151.OAA00853@frey.riacs.edu> from "Rob Schreiber" at Oct 9, 95 02:51:25 pm
> X-Mailer: ELM [version 2.4 PL24alpha3]
> Content-Type: text
> Content-Length: 756
>
> > Group C, CCI status.
> >
> >
> > Number 18.   Source:  Zongaro   date 5-4-95
> > Subject:  Defined assignment in Forall
> > Status:
> >
> > Resolved.  CHK will provide clarification in HPF 2 document.   It will be
> > achange to the "sequentializatio" of the forall to account for defined
> > assignments.
>
>      X3J3 is taking a different approach on CCI 18.  They're trying to prohibit
> references in the procedure that defines the assignment to the variable that
> appears on the left-hand side of the defined assignment.  I believe WG5 will
> decide on this in their November meeting.
>
>      We might want to pick up whatever they decide.  One advantage of the
> restriction is that no extra compiler mechanisms are required for this somewhat
> obscure case.
>
> Thanks,
>
> Henry
>
---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------

From owner-hpff-task  Thu Oct 26 14:11:03 1995
Received: (from daemon@localhost) by cs.rice.edu (8.7.1/8.7.1) id NAA08169 for hpff-task-out; Thu, 26 Oct 1995 13:52:42 -0500 (CDT)
Received: from N2.SP.CS.CMU.EDU (N2.SP.CS.CMU.EDU [128.2.250.82]) by cs.rice.edu (8.7.1/8.7.1) with SMTP id NAA08158 for <hpff-task@cs.rice.edu>; Thu, 26 Oct 1995 13:52:29 -0500 (CDT)
From: Jaspal.Subhlok@n2.sp.cs.cmu.edu
Message-Id: <199510261852.NAA08158@cs.rice.edu>
Date: Thu, 26 Oct 95 14:48:17 EDT
To: schreibr@FREY.RIACS.EDU
Subject: hpff-task: Task proposal
Cc: hpff-task@cs.rice.edu
Sender: owner-hpff-task
Precedence: bulk

---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------


I am attaching a revised proposal for task parallelism. There are some
minor corrections and clarifications, a few words about SMPs, and I have
stated the scheme Rob Schreiber proposed as a separate section. 

I will not be able to attend the next meeting because of an ARPA site
visit. I will be in email contact right through. If you send me any
comments relatively soon, I can prepare a revised proposal if needed.

jaspal

------------------------------------------------------------------------

PROPOSAL FOR TASK PARALLELISM IN HPF

[Contact: Jaspal Subhlok (jass@cs.cmu.edu)]

------------------------------------------------------------------------
Assumption of functionality available from related features:

 1) A way to group processors into subgroups P1, P2, P3. A way to 
   attach and distribute variables onto subgroups, i.e. a1, a2, a3
   can be mapped onto subgroups P1, P2, P3. (For SMPs
   explicit distribution of variables to subgroups is not necessary 
   and may not mean anything. However, a way to attach variable names
   to subgroups is still needed)

2) An ON construct that directs execution on groups of processors
   P1, P2, P3 etc. for a block of code.

-------------------------------------------------------------------------

1. GENERAL IDEA AND MODEL:

Task parallelism is expressed by mapping different data objects on
different subgroups of processors and specifying that blocks of code
be executed on named subgroups of processors using an ON directive. Code
executing on a processor subgroup inside a designated ``task region''
normally reads and writes only to/from the variables that are mapped to
them. The code inside a task region that is not directed to execute ON
a task region (at least conceptually) executes on ALL processors and
has unrestricted access to all variables. Data is exchanged
between subgroups by copying the variables of one subgroup to the
variables of another subgroup in the ALL code.

A subgroup is allowed access to variables not mapped to it, if that would
not cause a data dependence. A sufficient condition for ``no dependences''
is that such accesses should only be to variables that are ``read only''
in the task region, or ``read and written'' only by code ON any one subgroup.
The proposal essentially offers a way in which the programmer can tell the
compiler that these rules will be followed in designated code regions.

2. PROPOSAL:

A ``task region'' is a single entry, single exit region delimited by
(say) TASK REGION .... END TASK REGION.  A task region can have blocks
of code that are directed to execute ON  processor subgroups. All
other code executes on all available processors, referred to as ALL.

The following restrictions must hold for the code inside a task region:  

[This is the core of the proposal]
A code block executing on ALL processors has unrestricted  access to
all variables. A code block directed to execute ON a subgroup P
has unrestricted access to any variable mapped to P. A code block directed
to execute ON a subgroup P can access a variable not mapped to P only 
if the following constraint holds for the entire code in the task region:

a) The variable is``read only''. 
               OR
b) accessed only in code directed to execute ON P.
              
(Variable in this context means any addressable location)

An I/O operation in a code section directed to execute ON a subgroup
may not ``interfere'' with an I/O operation in a code section not
explicitly directed to execute on that subgroup. The interference of
I/O operations is detailed in Section 4.4 (INDEPENDENT).

For a subroutine call inside an ON block, ``all available processors''
are processors in the corresponding subgroup. This is the number that is
used for mapping the parameters of the subroutine. [This part will
become more specific after the syntax etc. of creating subgroups is
decided. There should probably be a system inquiry function for the
number of processors in the current subgroup, if NUMBER_OF_PROCESSORS()
is supposed to return the total number of processors for the program]

3. COMPILATION/EXECUTION MODEL:

3.1 Basic Execution:

The execution model for a subgroup is to unconditionally execute code
ON it, unconditionally skip code ON others, and participate in the
execution of common code (on ALL processors) as normal data parallel code.
An operation in ALL involving a set of variables starts only when all
processors of the subgroups owning those variables  reach that point
of execution. This is the basic execution model for shared and 
distributed memory machines.

[The access restrictions guarantee that the results will be consistent
 with pure data parallel execution. A processor group cannot be
 ``invisibly'' writing to a location being accessed by ALL or another
 processor group, and vice versa]

3.2 Variable Access:

We state ``one'' model for accessing variables in a task region for a
distributed memory machine. (This is important for building an
efficient compilation scheme although not really a part of the
execution model).

Accesses to variables owned by other processors is cooperative, i.e.
the owner sends the value, and the user receives it, with one
exception - when code ON a subgroup has to access a variable not
mapped to it, it use a remote fetch/deposit. (It can also cache remote
locations locally in the subgroup for the duration of the execution of
the task region since computation not ON that subgroup cannot access it)

4 EXAMPLE: 2DFFT

Sequential:


      real, dimension(n,n) :: a1, a2

      do while(.true.)
          read (unit = iu, end = 100) a1
          call rowfft(a1)
          a2 = a1
          call colfft(a2)
          write (unit = ou) a2
          cycle
100       continue
          exit
      enddo


Pipelined Data/Task Parallel HPF

	real dimension(n,n) :: a1,a2
        boolean done1
!hpf$   disjoint processor groups P1, P2 (Syntax TBA)
!hpf$   distribute a1(block,*) onto P1
!hpf$   distribute a2(*,block) onto P2
!hpf$   distribute done1 onto P1
                 
!hpf$   TASK REGION
        done1 = .false.
        do while (.true.)
!hpf$       ON HOME(P1) BLOCK 
              read (unit = iu,end=100) a1
              call rowfft(a1)
              goto 101
    100       done1 = .true.
    101       continue
!hpf$       END BLOCK
            
            if (done1) exit
            a2 = a1

!hpf$       ON HOME(P2) BLOCK
               call colfft(a2)
               write(unit = ou) a2
!hpf$       END BLOCK
        enddo
!hpf$   END TASK REGION


The data parallel code on the two processor groups might look something
like this, after the task region is compiled.

Processor Group P1:

	real dimension(n,n) :: a1
!hpf$   distribute a1(block,*)
        boolean done1

	done1 = .false.
	do while (.true.)
           read (unit = iu,end=100) a1
           call rowfft(a1)
           goto 101
    100    done1 = .true.
    101    continue
           _send(done1,P2)
           if (done1) exit
	   _send(a1,P2)
	enddo

Processor group P2:

	real dimension(n,n) :: a2
!hpf$   distribute a2(*,block)
        boolean local_done1

	do while (.true.)
           _receive(local_done1,P1)
          if (local_done1) exit 
          _receive(a2,P1)
          call colfft(a2)
          write(unit = ou) a2
	enddo

4. AN ALTERNATE MODEL:
   (Rob Schreiber proposed this at the last meeting)

   A related but different model for task parallelism is as follows:

   1) All code in a task region is directed to execute ON some
      (arbitrary) subgroup of processors. If no ON directive
      is present, ALL is assumed.

   2) If a variable is ``read only'' in the region, there are no
      other restrictions. Otherwise, if two subgroups P1 and P2
      access a variable x, P1 and P2 must have at least one common
      processor.

The execution model is that all processors execute the code that
is mapped to a subgroup they belong to, and skip other code.

4.1 TRADEOFFS:

This model is very clean and simple to state. It separates the
control aspect of task parallelism from the data aspect. Other
mechanisms are used for mapping data to the tasks in a distributed 
memory machine for performance.

The cons are that even though it is simple to state, it is a
subtle construct for task  parallelism, and there is no clear
user programming model. Compilation model is also not as clear
and it is extremely hard for the compiler to check for any
violations of the user assertions.  There is no experience in 
using something like this.

5. GENERAL COMMENTS:
 
1) The main difference with the simple parallel section/region, (or
   using an INDEPENDENT do loop to achieve parallel sections), is that
   task regions presented can have code that executes on ALL
   processors also. If it has no such code, it is similar to parallel
   section/region. However, allowing other code makes this construct
   more general, and implicitly allows pipelining in particular. At
   the same time, existence of code in ALL can constrain parallelism
   due to data dependence, and in the worst case no task parallelism
   may exist.

2) No explicit control dependence constraints are required. Inside an
   ON block, any variable being read (or used for control flow) cannot
   be written by any other processor group - it can be only written by
   ALL processors, in which case the control flow from the subgroup
   must also reach that point.  Outside an ON block, all processor
   groups execute all control flow (and other) statements. If a
   subgroup skips a control construct because it is not involved(
   i.e. its variables are not involved and there is no code inside the
   scope of the control construct that is directed to execute ON it)
   and continues to execute its next ON block, the constraints ensure
   that it cannot write to a location that is used for managing
   control flow.

3) There may be some issues with respect to extrinsic subroutine calls
   to ensure that the basic model works in their presence. It is
   probably best to address them after subroutine call execution model
   is more clearly defined for ON regions in general.
---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------