From: remedy@SDSC.EDU To: hanson@math.uic.edu Cc: milfeld@hpc.utexas.edu Subject: [0028930] Update Update from milfeld: Hi Floyd, I compiled your code on our T3E at the Univ. of Texas, and it worked just fine; however, we have just changed to the latest and greatest MPT (Message Passing Toolkit) and there were modifications to the MPI buffering that occurs in the send/recv-- the compiled code uses a new default buffer size. (Try using /tmp/laplace.exe. It was compiled on the UT system.) o The barrier is not causing any problem. However, you do not need to have two barriers-- remove the _barrier call and keep MPI_Barrier. (Using two barriers, immediately following each other, servers no purpose.) o The problem is that your sends have created a deadlock, and should be changed to avoid deadlocking. Your code has two sets of sends, followed by two sets of receives; and the sends are "standard" sends. If the default buffer size is too small, a standard send will block until a receive is posted. This is what has happened in your code. All sends are waiting for a receive to be posted. This now occurs since the default buffer sizes have changed with subsequent revisions. Correct coding will avoid this dependency. There are two ways to do this: 1.) Post Immediate receives before the sends, and then wait on the immediate receives once the code has fallen through the receive and performed the send. MPI_Irec(...) ... MPI_Send(...) ... MPI_Wait(...) 2.) Another option is to simply move the first set of receives between the two sets of sends: MPI_Send(...) 1 ---> MPI_Send(...) 1 MPI_Send(...) 2 ---> MPI_Recv(...) 1 MPI_Recv(...) 1 ---> MPI_Send(...) 2 MPI_Recv(...) 2 ---> MPI_Recv(...) 2 This will avoid the deadlock produced by a send "right", send "left", followed by a receive "left", receive right pair. The send right, receive left (non-blocking) paradigm is explained below: Send Right: 0 -> 1 -> 2 -> 3 Receive from Left: 0 -) 1 -) 2 -) 3 On the send, processors 0, 1 and 2 block. Once PE 3 posts its receive 2 "falls through" its send and posts its receive, and in a similar fashion PE 1 and 0 do the same thing. (A 2nd set of sends and receives in your code do likewise.) I have placed the modified code in /tmp on the T3E, it is named "l.c" (/tmp/l.c). Hope this helps, Kent Milfeld