MCS 572 TMC CM-5 & IBM SP2 Group Project Suggestions
Fall 1995

Professor F. B. HANSON

DUE Friday 01 December 1995 in class.

MCS 572 Fall 1995 TMC CM-5 and IBM SP2 Group Projects NCSA TMC CM-5 and CTC IBM SP2 Group Projects (2 report copies) are due on 01 December in class, but project presentations will begin earlier on Monday 27 November to allow for students attending Supercomputing '95. The same instructions or rules apply for the CM-5/SP2 projects as with the Cray C90/T3D group projects [12Ways], [BBDS], [OSW]. See the Local Guide for the CM-5 [LG] for CM-5 information and the Class HomePage outline for the IBM SP2.

You will have to compile, link, and execute on your scratch disk on the CM-5 using cmf and jsub of DJM, and the corresponding things on the SP2.
Also, describe what and how you used CM-Fortran or C* extensions in your code.
Be sure to do you real computations with the `-double_precision' CMF option.
At least ONE CM compiler information listing is required in your report.
Be sure to cite all articles, books, reports and persons consulted in making your report.
Make sure your arbitrary test matrices are randomly filled.
Use powers of 2 for problem sizes, unless you are testing fill-effects.
Examine both CM busy-time and CM busy time as a percent of elapsed-time for your problem.

Project Suggestions

Own Project:
An CM-5/SP2 project or your own design, such as implementation or design or optimization code work with your own thesis research work or other course work.
Study CM-5 Vector Unit Performance:
Compare the performance of the CM-5 Processing Nodes (PNs) WITH the Vector Units (VUs, cmf option `-cm5 -vu) and WITHOUT the Vector Units (VUs, cmf option `-cm5 -sparc) for computing several array or other operations. For each operation, try to figure out how many CM-5 VUs are equivalent to one CM-5 PN. See [GSCMF], [GSCS], [CMEGS] for example programs, especially, paris/random.fcm for filling arrays with random entries.
Comparison of SP2 and CM-5 Computation:
Compare the performance of the CM-5 with the SP2 for computing several array or other operations. For each operation, try to figure out how many SP2 floating point processors are equivalent to one CM-5 processor (PNs plus VUs). See [GSCMF], [GSCS], [CMEGS] for example programs, especially, paris/random.fcm for filling arrays with random entries.
Comparison of Front-End (Partition Managers (PMs) on CM-5 and analogous things on the SP2) and CM (Processing Nodes (PNs) with Vector Units (VUs) on the CM-5) Computation:
Compare the performance of the CM Front-End and either CM for computing several array or other operations. For each operation, try to figure out how many CM processors are equivalent to one Cm Front-End processor. See [GSCMF], [GSCS], [CMEGS] for example programs, especially, paris/random.fcm for filling arrays with random entries. This needs to be generalized on the SP2.
Optimal Virtual Processor Ratio:
Pick an application like matrix-matrix multiplication or iterative methods for Laplace's equation, and determine the most efficient VP ratio (VPR), the ratio VPR=#VPs/#PPs, where #VPs is the number of Virtual Processors (number of elements in the array size) and #PPs is the number of Physical Processors. Once a user could set the number of VPs, but this is not true now since the Slice-Wise computation model was introduced on the CM-2, and is also a basic execution model on the CM-5. You will have to guess the #VPR from the largest array sizes. Also, measure the effect of CM fill-in or wrap-around by using non-powers of 2 for problem array sizes.
Statistics Project:
Test the performance in computing basic statistics (mean, variance, chi^2) using the CM-5 or/and SP2, similar to the Cray projects. See the example program paris/random.fcm in [CMEGS] for filling arrays with random entries. Find a significant performance model for chi^2. How well do statistical data and algorithms parallelize? How good is cmf-random as a Random Number Generator?
Square Matrix Multiplication --- Cannon's Algorithm:
This is a project of testing the basic data parallel algorithm (a result of L.E. Cannon's 1969 Montana Ph.D. thesis, not available) [DPS], [MMCM], [CMFUG] for square matrix multiplication. The algorithm and the implementation is same as discussed in the class. Use CM Fortran timing tools to test the timing performances on NCSA CM-5 or CTC SP2. Your project must contain both double-precision and single-precision results. Your performances summary table need to have a listing for different sizes of the problems (such as the problems of sizes large than number of physical processors). It would be more interesting if you can include the VP ratio in your discussions. See the CM example programs cannon_CMF.fcm, mm.fcm and paris/cannon.fcm in [CMEGS], as well as the intrinsic function matmul(A,B), for use this project.
Evaluation of Recurrence Relation:
This is a special case of the two-term linear recurrence [PAMC],
```
(1.1)  x(i) = a(i)*x(i-1) + b(i)*x(i-2),    (i = 1, 2, ..., n) 
```
to be solved for x(2), x(3), ....., x(n), given the values of x(0) and x(1), and the coefficient a(i) and b(i). At first sight, the parallelism in (1.1) is not obvious. But if we reformulate the equation in matrix-vector notation. Write
```
            | x(1) |               | x(i)   |
     y(1) = |      |, ...,  y(i) = |        |,  (i=2, 3, ..., n)
	    | x(0) |               | x(i-1) |
          
```
Then (1.1) can be written
```
                                  | a(i) b(i) |
      y(i) = M(i)*y(i-1),  M(i) = |           |,   (i = 2, 3, ....., n)
                                  |  1    0   |
```
therefore
```
      | x(n)   |                                | x(1) | 
      |        |  = y(n) = M(n) M(n-1) ... M(2)*|      |
      | x(n-1) |                                | x(0) | 
```
Code the algorithm for data parallel computation (say, by pairwise or fan-in multiplication) for sufficient large n, say n=8k (8192). You need to discuss in detail in your project on the implementation data structure as well as any optimization techniques. The coefficients a(i),b(i) can be assumed properly.
Find All Primes Within 1 to N --- Sieve Algorithm:
The Sieve algorithm [GSCMF], [GSCS] works as follows:
1. Assign processor P_i an index number i for i = 1, 2, ....., N.
2. Mark all processors except P_1 with ``true''.
3. Find the smallest index from the index set
  {i_1, i_2, ..... , i_k | where P_{i_j} is marked true for all j=1, 2, ..... ,k } .
4. Use this smallest index to divide all OTHER (not including the smallest index) indices whose processors are marked with ``true''.
  - Mark the processor ``false'', if the index is divisible
  - Otherwise, keep the original marks.
5. If all indices have been processed then stop, otherwise goto step (c)
Use CM Fortran or C* to code this algorithm, discuss your implementation and the data structure and any optimization techniques. Compare the implementations in different cases (with optimization techniques or without optimization techniques). See the CM example programs primes0.f (serial), primes1.fcm and primes2.fcm in [CMEGS].
Matrix-Vector Computation:
Matrix-Vector Multiplication Ab=x in CM-5 can be performed in two ways [XHC]:
1. Use of the spread function
```
                        real A(M,N),b(N), x(M)
                  CMF$  ALIGN b(I) WITH A(1,I)
                            :
                        x = sum(spread(b,dim=1,ncopies=N)*A,dim=2)
```
2. Use of Broadcasting technique
```
                        real A(N,M), b(M), x(M)
                  CMF$  LAYOUT(:serial, :news)
                            :
                        DO 20 L = 1, M
                          x(L) = sum(a(L,:)*b)
                   20   CONTINUE
```
Is the Fortran spread function is always valid in the first computational method? If not, why not and when? Give your analysis in both computation methods. Make sure you use both double-precision and single-precision to discuss them. See the CM example program matvec.fcm in [CMEGS], for the purposes of comparison. How does this generalize to the SP2?
CM Random Numbers:
CMF_random is a utility routine supplied with CM Fortran that fills an array with random numbers. It takes two arguments. The first is a CM array of type integer, real, or double precision. The second (used only if the first is an integer array) is an exclusive limit on the random numbers to be generated. The format is :
```
                       call CMF_random(A, 10)
```
Generate suitable sets of random numbers, each with a different sample size n. For each set, compute basic statistics (like mean, variance, chi-squares, and other moments) with a high level of parallelization (i.e., make use of the extended intrinsic sum function on CM). See [CMEGS] for an example program, paris/random.fcm for filling arrays with random entries, for another sample random subroutine.
Forward Gaussian Elimination:
Implement the Forward Gaussian Elimination with Back Substitution method on the CM-5 or SP2 with a genuine CMF code for a properly sized problem. Recall the sample code fragment illustrating an application of the CMF Spread intrinsic function that was discussed in class (the idea is mentioned in [GSCMF]); are there efficiency advantages to this type of code or are there faster ways to code it for the CM. Test you code on randomly filled matrices. Considered modifications for row pivoting with row scaling or full pivoting. Then discuss your work.
LU-decomposition:
Implement the LU decomposition method on SP2 and CM-5 with a genuine F90 code for a properly sized problem. Test you code on randomly filled matrices. See the related comments in the previous suggestion. Then discuss your work.
Iterative Methods on the CM:
Test the performance of the CM-5 and SP2 for computing iterations. See Richard Shapiro's sample program laplace.fcm in [CMEGS] or on CMS getdisk hanson with timers for an example of solving Laplace's equations on the unit square by Point Jacobi iterations. What happens if genuine (not simple) boundary conditions are imposed? Modify the code for more general Elliptic PDE problems (i.e., with first derivatives, and variable coefficients) or modify the code for 3D. Other variations of this problem are implementations of the Conjugate Gradient Method or the MultiGrid Method. For useful reference, see Ortega's textbook [IPVSLS].
Black-Red Schemes for Gauss-Seidel Method:
The Gauss-Seidel Method is an inherently serial method, because, unlike Jacobi, it updates node values at each iteration making it highly data dependent. By alternately pretending to color the nodes black and red (using white the method is called a white-black scheme), so that in 2D no red node is adjacent to a black node, then by separately iterating the black set with old red values and the red set with old black values, interchanging values only at the end of one complete iterate, an essentially decoupled Gauss-Seidel modified method is created. Compare this scheme with the Point-Jacobi Method, a naturally parallelizable method since updating is only done after a complete iteration, for the Laplace iteration problem. For a useful, see Ortega's textbook [IPVSLS].
"Arrayizing" the Cray Starter Problem on the CMs:
Take the original Cray starter problem and see how much you can tune it on the CM-5 and SP2 to make it run faster. Compare your results to your Cray results, both in terms of speed-up ratio and in detail for each loop compared to the Cray. Use n = 256 or n = 128 on both Cray and CMs.