MCS572 UIC User's Local Guide to
PSC Terascale Computing System (TCS) Cluster

version 0.80

24 March 2003


F. B. Hanson

Mail address:

Office address:

Hanson World Wide WEB Home Page:

 

UIC Fall 2003 Course:

MCS 572 Class World Wide WEB Home Page:

Acknowledgement:


 

Table of Contents

  • Preface.

  • TCS Overview.

  • Guide Notation.

  • Background References.

  • UNIX Command Dictionary. AVAILABLE As Separate File.
  • Interrupts Dictionaries Telnet and UNIX.
  • MPI Message Passing Programming on TCS. UNDER RECONSTRUCTION

  • TCS Fortran90 and other Extensions. UNDER RECONSTRUCTION
  • f90 and cc Timing Utility Functions. UNDER RECONSTRUCTION


    Preface

    This User's Local Guide is intended to be a sufficient, hands-on introduction to the Pittsburgh Supercomputing Center TCS (Terascale Computing System Parallel Cluster for our MCS 572 Introduction to Supercomputing class. The TCS has a Compaq variation of the UNIX operating system called Tru64 UNIX.

    The PSC Class Account for MCS572 Fall 2000 is `sc70jpp' for the PSC Grant SEEE030003P.


    TCS Overview.

    The PSC TCS MC512 is a large scale parallel cluster with 64 Compaq (HP) Alphaserver compute nodes, each wth 4 667 MHz processors, making a total of 256 processors. The PSC TCS's internet address is

    with the prompt of `%'. For TCS information from PSC, see

    Remark: There are a lot of inaccuracies in this outdated page, some of which are corrected in this local guide.

    The TCS is a protoype for a larger, final terascale system called Lemieux with 750 computer nodes with at total of 3000 processors. Lemieux's web page should be consuted for updated system information:

    The TCS and Lemieux AlphaServer System Reference card is found at

    What does the PSC TCS look like? PSC TCS Picture

    A simple view of the TCS architecture is given for the larger system:

     

    TCS Compute Nodes and Procesors.

    Each compute node is an HP AlphaServer SC ES40/EV67/ nodes configured as as a 4 processor Symmetric MultiProcessor or Shared Memory Processor (SMP) with 4 GigaBytes (GB) of memory (RAM). The TSC cluster nodes are connected with a proprietary Quadrics Interconnection (IC) network. For more information on the compute nodes, see the nice Compaq slide show of N. Srivastava:

     

    TCS Benchmark Performance.

    The PSC TCS, installed at PSC in April 2001, ranks as the 246th top computer in the world (Top 500 Computer Reports, November 2002, Source: http://www.top500.org) and has a theoretical asymptotic peak speed of Rmax = 264 GigaFlops (GF) on LINPACK linear algebra benchmarks, with Hockney Linear Model (see MCS572 class notes) parameters of asymptotic peak speed Rpeak = 342 GF (also called Rinfinity) and N1/2 = 20,000, with maximum order run Nmax=106,000 given at the web link above, or see the class summary

     

    TCS Memory Units.

    The random access memory (RAM) is globally shared 4GB memory on the 4 processor nodes, but distributed memory or 256 GB with respect as a cluster of 64 nodes, so is has a hybrid memory system globally as a 256 processor system. The processors or CPUs each have a 8 MB L2 cache memory (level 2 local memory).

     

    TCS Operating System.

    The operating system is the Compaq Tru64 UNIX V5.1A (Rev. 1885). However, since compilation and execution TCS is by remote batch scheduling, the user uses a combination of the Compaq Portable Batch System (PBS) and the UNIX Network Queueing System (NQS), the user should refer to subsections on those topics. See

     

    TCS Login Shells.

    The operating system environment is set by a UNIX shell and the default shell on the PSC TCS is the C-Shell. The shell can be changed, but that is not recommended, by the Change Shell command "chsh" and has the format:

    where the Shell Path "[shellpath]" can be found with system "which" in the format:

    where "[shell]" is the standard system shell "sh", Bourne again shell "bash", the Korn shell "korn" and others. However, all of the NQS QSUB job scripts given here assume the C-shell which uses the resouce configuration file ".cshrc" which resides in the user's home directory and can be used to define commands and make aliases (format: "alias [aliasname] [aliasdefinition]", in cases of special command characters quotation marks are needed.). A sample of a ".cshrc" file for use on the TCS is

     

    TCS-UIC Login Access.

    Users MUST access the PSC TCS directly using the Secure Shell (ssh), such as from UIC `icarus' or from department systems,

    or

    If your computer system does not have this secure form, you will have to find one that does, like the UIC student computer server icarus.uic.edu since every student should have a UIC netid. If ssh has difficulty with the Unix ".ssh/known_hosts" (will differ on other platforms) then edit the file by deleting the entry for the node that is giving the problem since the ssh key may be expired and try ssh command again.

    SSH works like the Unix remote login command `rlogin', but encrypts your password so that it is nearly impossible to steal. The commands `rlogin' or `telnet' do not work with the `tcs', resulting in the response "tcs.PSC.edu: Connection refused". See "man ssh" for help from the UNIX manual pages.

    SSH is a UNIX command found on may UNIX systems, but you can get a free MS Windows version that comes in two main flavors:

     

    TCS-UIC File Transfer.

    Users MUST do their file transfer bewteen the PSC TCS and UIC using the Secure Shell (ssh) commands such as secure copy scp or secure FTP sftp. For example, from UIC

    SCP Secure Copy:

    or from PSC TSC

    This form of the command works well for a single file, which can also have a directory path, but the user password has to be given each time. For multiple files a wild card version can be use, e.g., for all C files omitting the target file name from PSC:

    SFTP Secure File Transfer Protocol: See "man scp" for help from the UNIX manual pages.

    Also, you can use the secure File Transfer Protocol (FTP) called sftp that works like the usual FTP, except that you can not use any abbreviations of the FTP subcommands (e.g., use "put" and not "put"), but SFTP secures your session better. For example, from UIC,

    or from PSC TSC

    Remark: If your username is the same at both UIC node and PSC node, then the "[username]@" is optional. See "man sftp" for help from the UNIX manual pages.

     

    TCS File Systems.

    HOME Directory:

    Each PSC User has a home directory to keep files and subdirectories with the full path specified by "/usr/users/[n]/[username]" where [n] = 0:9. The home directory can be more simply referenced by the UNIX symbol ~ or the UNIX meta or environmental variable representation ${HOME} as in cd $HOME to change directory back to home or ls ${HOME}/mcs572 to list contents of a home subdirectory "mcs572" (note that the curly brackets are optional in the first example but required in the second example where "HOME" is followed by nonblank characters. Home directory quotas are 100MB (Mb?), but may not be enforced.

    SCRATCH Directory:

    Each user has a scratch or work directory "/usr/scratch/[nx]/[username]" where [nx] = 0:9 or [nx] = 0x:9x, and these directories are linked to the disks /scratch1/ or /scratch2/. The user's scratch directory can simply be referenced by the meta representation ${SCRATCH}, where the curly brackets are optional if ${SCRATCH} is used as a sole argument. It is strongly recommended that the scratch directory directory be used for scheduling batch jobs (essentially the only ones allowed) on the TCS cluster with the qsub queueing submit command including all necessary input files. Caution: On the tcs.html webpage the no longer existing $STAGE is incorrectly listed for the work directory.

    LOCAL Directory:

    Each TCS cluster computing node has global node memory accessible to all four of its processors and that memory is accessible to the user only when the user's code is executing, technically beginning with the qsub script required shell identification, e.g., "#!/bin/csh" escape to the C-shell. Hence, it should not be necessary to change the current directory to ${LOCAL}. However, the parallel run command prun needs seemly redundant "./[executable]" file. On the tcs.html webpage the no longer existing $TMPDIR is incorrectly listed for the compute node directory.

    Remark: The commands "qsub" and "prun" are discribed more below.

    FAR File ARchiver System:

    The FAR system runs on golem.psc.edu and is accessible from TCS and Lemieux for large file storage for long periods of time. You will need information on the Andrew File System (AFS) and FAR special instructions to use it and more information on FAR is at

    However, for the class, you will likely not be needing FAR.

     

    TCS Programming Languages.

    The TCS programs are compiled directly on the TCS, given here with some typical options, using the

    Fortran90 Compiler:

      f90 -O -lmpi -lelan -arch ev67 -lm -o [executable] [source].f
    or the

    C Compiler:

      cc -O -lmpi -lelan -arch ev67 -lm -o [executable] [source].c
    or the

    C++ HP Compiler:

      cxx -O -lmpi -lelan -arch ev67 -lm -o [executable] [source].c
    See "man cxx" for help from the UNIX manual pages or http://h30097.www3.hp.com/cplus/cxx_ref.htm

    In the above compilation commands, the opitons are

    • "-O": a typical level of optimization for that compiler (see the Quick Reference Card for more information on other optimization options:

    • "-lmpi" or "-l mpi": references the Message Passing Interface (MPI) Library that is called by Fortran or C compilers so that the code can execute in parallel, permitting the use of MPI parallel programming in the code. In addition to the MPI Library option, the code itself must include the MPI header directives in the code preface (code beginning), "include 'mpif.h'" for Fortan 90 and "#include <mpi.h>" for the C family of programming languages. Both the MPI Library and MPI Include statements are needed.

    • "-lelan": allows parallel communication between the compute notes by linking to the ELAN Library.

    • "-arch ev67": allows the parallel execution to be tuned to the TCS AlphaServer chip in the EV40/EV67 series from Hewlett Packard (HP).

    • "-lm": allows the use of the UNIX Math Library in conjunction with the math.h header file (if you are not using math functions, then you do not need them, but you need both library and header if you do).

    • "-o [executable]": names the output executable object file "[executable]", unless this option is missing and the excutable is given the generic default name "a.out". Execution of the executable is by the massively parallel envelope run command prun.

     

    PRUN Parallel Run Command:

      prun -N [Number_Nodes] -n [Number_Processors] [executable] < [data]
    where "< [data]" means the data file is directed into standard UNIX input. An executable can not run in parallel without "prun". Usually, the number of nodes "[Number_Nodes]" and the number of processors "[Number_Processors]" are specified by the local meta environmental variables, ${RMS_NODES} and ${RMS_NODES}, respectively, since both must be initially set by a PBS statement in the QSUB script or in the options of the "qsub" command, which automatially initialize the meta variables. See "man prun" for help from the UNIX manual pages.

     

    TCS Batch Queueing Systems: PBS and NQS with MPI.

    NQS Job Scripts:
    Remote job scheduling on the TCS is accomplished by using the UNIX Network Queueing System (NQS) job scripts, but the script directives use the so-called Portable Batch System (PBS) Directives used on the HP Alphaservers, in place of the usual NQS Directives. A sample PSC TCS 4 processor target job script for C code is given in

    and a one for PSC TCS 4 processor script for Fortran 90 code is given in

    The new user should study these sample job scripts and others listed on the class homepage:

     

    Executable Job Scripts:
    Before any job script can be used as an argument of the qsub the job script must be made executable for all, e.g., using the UNIX change mode command:

      chmod 755 cpgm.job
               or
      chmod a+x cpgm4.job
           for C Languages

               Else

      chmod 755 fpgm4.job
               or
      chmod a+x fpgm4.job
             for Fortran 90

    where in the second from, the files should already be readable (r).

     

    NQS qsub Submit Command:

    These job scripts are run with the NQS QSUB submit command from the user's `${SCRATCH}' scratch directory, for example,

      qsub cpgm4.job
         for C Languages
             or
      qsub fpgm4.job
         for Fortran 90
    where `${SCRATCH}' denotes the meta-name of the user's scratch directory on the TCS cluster. See "man qsub" for help from the UNIX manual pages.

     

    NQS qstat Status Command:

    The job status can be checked by the NQS QSTAT status command:

      qstat -u [tcs-username]
    and when done, the user can view the output if any. Under the table heading called "S" ,e.g., "Q" means that the job is queued waiting to run, "R" means running, and "E" means exiting. See "man qstat" for help from the UNIX manual pages.

     

    NQS qdel Delete Command:

    If for any reason you need to kill the job before the end, first note the job id number `[job_id]' at the beginning of your job line in the "qstat -u [tcs-username]" output, then enter the command:

      qdel [job_id]
    which should stop a running job, unless the system is busy. See "man qdel" for help from the UNIX manual pages.

     

    Job Script Examples:

    A user can try out the class sample NQS QSUB job scripts by down loading and copying one of the following sample codes

    • Pi calculation for any number of processors and MPI, but needs cdata:
      • pi_mpi.c, C version;
      • pi_mpi.f, F90 version;
      • cdata,sample data file for Pi code that also works as a dummy data file for the Trap and Laplace Code below;
    • Trapezoidal Rule calculation for any number of processors and MPI:
    • Laplace Equation Interation for 4 Processor - 4 Subdomain MPI:

    to your home directory and then recopying it, say "[ExampleCode].c" or "[ExampleCode].f" to the recyclable source file of the form `*pgm.*' as follows:

      cp [ExampleCode].c    cpgm.c
               or
      cp [ExampleCode].f    fpgm.f
    for C or F90, respectively.

    The user will also have to create a simple input data file called "cdata" or use the Pi Code example data file for the qsub scripts since the script are written to take a data file as standard input, (e.g., using the editor "vi" to revise the set of integration points in cdata, terminated by zero) into the input data file; then in the home directory entering the queue submit command for 4 processors on a single node:

      qsub cpgm4.job
               or
      qsub fpgm4.job

    then check for a finished job with "qstat -u [psc-username]" until the your queue record no longer is displayed, finally looking for the standard output and standard error files, for example "ls -l *pgm4.output *pgm4.error". You can always modify the sample job scripts to suit your particular job requirements, your own file naming preferences or if you prefer to open and close files in the code by hand.

     

    TCS Message Passing Interface (MPI) Sources.

    MCS 572 Class MPI webpages:

    PSC MPI Basics:

    Cray native SHMEM communication library also available, but is optimized between nodes like ELAN only and not within a node:

    OpenMP is supported in Tru64 UNIX for C and Fortran, but not C++:

     

    More TCS Information.

    For TCS information from PSC, see

     


    Guide Notation.

    This local-guide is meant to indicate ``what works'' primary for access from UNIX systems to PSC TCS. The use of the Unix C-Shell on the TCS is assumed throughout most of this local guide.

    UNIX is a trademark of AT&T.

    Computer prompts or broadcasts will be enclosed in double quotes (``_''), background comments will be enclosed in curly braces ({_}), commands cited in the comments are highlighted by single quotes or double quotes depending on emphasis (`_') or ("_") {do not type the quotes when typing the commands}, and optional or user specified arguments are enclosed in square brackets ([_]) {However, do not enter the square brackets.}. The symbol (CR) will denote an immediate carriage return or enter. {Ignore the blanks that precede it as in `[command] (CR)', making it easier to read.} The symbol (Esc) will denote an immediate pressing of the Escape-key {Use no brackets please.} The symbol (SPACE) will denote an immediate pressing of the Space-bar {Warning: Do not type any of these notational symbols in an actual computer session.}


    Return to TABLE OF CONTENTS?

    Background References

    For further information, please consult the sources (you can just click on the highlighted topics to access if you are surfing the world Wide Web):

    1. Professor Hanson's MCS 572 Introduction to Supercomputing Home Page provides a large variety of links to useful supercomputing information.

    2. Pittsburgh Supercomting Center (PSC) Home Page on the World Wide Web permits the direct search of the PSC public web information directories.

    3. Resources Available to PSC Users Page.

    4. Pittsburgh Supercomputing Center Hardware.

    5. TCS Prototype Cluster (MCS 572 class cluster).

    6. Lemieux TCS Final Cluster.

    7. Terascale Computing System (TCS) FAQ.

    8. TCS Compaq AlphaServer SC System Overiew, HP Workshop notes (you may need an AFS password for these nice PowerPoint Slides and will need a MS PowerPoint Viewer forMS Windows).

    9. Kai Hwang's "Basic Network-Based Cluster Computing,".

    10. Raj Buyya's Cluster Computing Information Center

    11. Rajkumar Buyya (editor)and Hai Jin (slide author), High Performance Cluser Computing: Architures and Systems -- Lecture Notes for selected chapters ( MS PowerPoint Viewer needed, but available free on-line).

    12. Raj Buyya's Trends in Cray Supercomputer versus Killer Micros.

    13. Raj Buyya's "Cluster Computing Architectrue,".

    14. PSC UNIX overview.

    15. Compaq DTKS C Overwiew for TCS.

    16. HP Fortran User Manual for Tru64 UNIX for TCS.

    17. Getting Started With MPI: A Message Passing Interface for Parallel Programming: An Introduction to MPI at SDSC.

    18. MCS572 MPI General Information Page.

    19. MPICH Reference Card

    20. MCS572 TCS Cray MPI Example Page.

    21. OpenMP Resource Page.

    22. man [command] (CR), when invoked in a UNIX-like system such as UNICOS, produces an on-line listing of the manual pages on the command [command], or similar function.

    23. Consultation concerning problems related to using the Crays can be obtained from Professor Hanson {718 SEO, X3-2142, hanson@uic.edu}. It is recommended that Professor Hanson contact TCS consultants for this class, if they are necessary.

    Return to TABLE OF CONTENTS?

     


    MPI Message Passing Programming on TCS.


    MPI or Message Passing Interface is a library of subroutines in Fortran (procedures in C) that facilitate message passing form of parallel programming in a distributed computer or network environment. At NPACI, MPI is especially useful for writing parallel programs for the Cray T3E (T3E) massively parallel processors. Eventually, MPI will replace PVM, but currently there is more information about PVM than for MPI. MPI is more abstract and complicated than PVM, since a lot of the features of MPI are hidden behind its functions and its own compile and execution commands. For relevant information on MPI, consult the following pages, especially the example page:
    Return to TABLE OF CONTENTS?

     

    UNDER RECONSTRUCTION


    UNIX Command Dictionary.

    Return to TABLE OF CONTENTS?

     

    UNICOS T90 Fortran90 (f90) Compile, Load and Execution Commands


      f90 -r3 -[other options] [source].f [other source files] (CR) : Compiles source file `[source].f' and `[other source files]' with the Cray level 3 report compiler option `-r3' both with the default full optimization (`noaggress bl noinline recurrence norecursion scalar vector ....... nozeroinc'), producing an object file `[source].o' and compiler annotated listing file `[source].l' with vectorization information:

      Marking                       Meaning 
         S          scalar loop optimization (major marker)
         V          vector optimization (major marker)
         P          Parallel optimization (major marker)
         Vs         short vector optimization
         W          unwound     (major marker) {short inner-most loops with trip 
                    counts of not more than 5 are collapsed or transformed to single
                    statements so that the next inner-most loop can be vectorized
                    provided there are no dependencies}
         b          bottom loading     {pre-fetching is used for the next
                    iteration of scalar loops, only and `-o nobl' kills it} 
         c          conditionally vectorized, {subject to run-time
                    determination of recurrence vector length}
         k          kernel scheduling
         i          unconditionally vectorized with IVDEP
         r          loop unrolling     {a set of loop iterations is
                    collapsed into one iteration that has been enabled by the `-e'
                    enabling option with its `m' loop marking sub-option}
         D          delete loop
      
      Use `-emx' in place of `-em' if you want a cross reference listing also. Use the `-b [binfile]' option to name the object file with a name other than the default `[source].o' name. Use `-o aggress' to turn on a more aggressive form of optimization, but be careful of the results. Use `-o inline' or `-I [inline-source]' to get inlining of subprograms to avoid their overhead. Use the compiler directives `NORECURRENCE' or `IVDEP' and `RECURRENCE' to turn off and on the optimization of loop recurrences. Use `-o recursion' to enable subprograms to be recursive. Use `-o zeroinc' if zero increments of do loops indices or constant increment variables (CIV) are used, because the default assumes there are none. Use `segldr' command to load the execution module, which then can be used to execute the program. See below and the last section for more on the options. It is much better to use makefiles for such commands.

      f90 -eS [source].f (CR) : Creates a Cray Assembly Language (CAL) file or calfile named `[source].s' for the Fortran program `[source].f' that can be used with the Cray Assembler or to determine how the Cray compiler has carried out the optimization, particularly how it has used the vector registers. The option `[name].s' can be used to name the calfile with something other than the default name. No object or binary file `[source].o' is produced, and a nasty message will be given instead.
      f90 -g [source].f (CR) : Compiles the f90 and generates a symbol table for the debugger, like `cdbx' (use `man cdbx'). See also `-G debug_lvl', where `-G 0' is the same as `-g'.
      segldr -o [executable-file] -l [library-list] [source].o (CR) : This segment loader links and loads the object module `[source].o' from the `f90' step into the execution module named `[executable-file]' by the `-o' option. Without the `-o' option, the executable is the standard `a.out' file. The library option may not be needed because many libraries are searched by default: Pascal (libp.a), I/O (libio.a), utility (libu.a), Fortran (libf.a), C (libc.a), Math (libm.a), and Science (libsci.a). Numerical Recipes in Fortran or C of Press et al. are not directly available in UNICOS.
      f90 [-options] -o [executable] [source].f (CR) : The `f90 -o [executable]' command combines both `f90' compile and `segldr' load functions in one command; e.g.,
      f90 -limsl [source].F (CR) : This Fortran90 parallel form is for using the IMSL mathematical and statistical library; if more than one processor is used, then `setenv NCPUS [nn]' must be executed first with `[nn]' is number of CPU's requested. For more information, click on:
      [executable-file] < [input-file] > [output-file] & (CR) : Executes the executable module taking input from the file `[input-file]' and redirecting output to `[output-file]' as a background process.

    Return to TABLE OF CONTENTS?

    UNICOS C Language Commands


      cc -o run [file].c (CR) : Compiles source [file].c, using the standard C compiler `scc2.0' and producing an executable named run. In place of `cc', use `scc3.0' or `scc' for the latest version of standard C or `pcc' for portable C.
      cc -c [file].c (CR) : Compiles source [file].c, using the standard C compiler `scc2.0' and producing an object file named [file].o.
      cc -hnoopt -o run [file].c (CR) : Compiles source [file].c, using the standard C compiler `scc3.0' and producing an executable file named run without scalar optimization or vector optimization while `hopt' enables scalar and vector optimization, Some other optimization related options are `-hinline' for inlining while `-hnone' is the default no inlining, `-hnovector' for no vector (vector is the default), and `-h listing' for a pseudo-assembler (CAL) listing. Some standard C options are `-htask3' for automatic parallelization (autotasking in "crayese") and `-hvector3' for more powerful vector restructuring. Other `-h' suboptions are `ivdep' for ignore vector dependence, `-hreport=isvf' generates messages about inlining (i), scalar optimization (s) and vector optimization (v), and `-hreport=isvf' writes same messages to `[file].v'. A commonly used form will be
      cc -o run -h report=isvf [file].c (CR)

      See `man cc' or `docview' for more information.
      #define fortran : Form of C header statement to permit the call to a fortran subroutine from a C program. For example:

      #include <stdio.h>
      #include <fortran.h>
      #define fortran
      main()
      {
               fortran void SUB();
               float x = 3.14, y;
               SUB(&x, &y);
               printf("SUB answer: y = %f for x = %f\n", x, y);
      }
      

      #pragma _CRI [directive] : Form of C compiler directive placed within the C code, where some example directives are `ivdep' for ignoring vector dependence, `novector' for turning off the default vectorization, `vector' for turning it back on, `inline' for procedure inline optimization, `shortloop', `noreduction', `getcpus [p]', `relcpus', `parallel ........', and `end parallel'. See `vector directives' for instance in `docview' for more information and examples.
      segldr -o [executable-file] -l [library list] [source].o (CR) : This segment loader links and loads the object module `[source].o' from the `f90' pure compile step into the execution module named `[executable-file]' by the `-o' option. Without the `-o' option, the executable is the standard `a.out' file. The library option may not be needed because many libraries are searched by default: Pascal (libp.a), I/O (libio.a), utility (libu.a), Fortran (libf.a), C (libc.a), Math (libm.a), and Science (libsci.a). Numerical Recipes in Fortran or C of Press et al. are not directly available in UNICOS.
      [executable-file] < [input-file] > [output-file] & (CR) : Executes the executable module taking input from the file `[input-file]' and redirecting output to `[output-file]' as a background process.

    Return to TABLE OF CONTENTS?

    UNICOS Performance Commands


      Cray Prof Profiling Facility:


        Cray Error Explaining Command:

          explain [error-message-code] (CR) : Elaborates on the command error message '[error-message-code]' for many commands; use `man explain' for a complete list.

        Cray Job Accounting (ja) Command:

          ja (CR)
          {[}executable] (CR)
          ja -csf (CR)
          : This command sequence enables Job Accounting storing the information in a file of the form `.jacct[jobid]', with options `c' giving a command report, `f' giving a command flow report, `s' giving a multitasking breakdown summary report. Note that the NPACI service unit charges are approximately one cpu hour on the T90 and one element hour on the T3E, assuming average memory (about 16MW) usage. Caution: In general, parallel processing on the YMP series like the T90 is very expensive.

        Cray Perftrace (perf) or Performance Trace Facilities:

          f90 -ef [source].f (CR)
          segldr -l perf [source].o (CR)
          a.out > [source].perf (CR)
          segldr - l perf perf[n] [source].o (CR)
          a.out >> [source].perf (CR)
          : Compiles the FORTRAN 77 program `[source].f' for use for the Cray Perftrace or Performance Trace facilities. (Flowtrace results are similarly found in the output of the executable file executed after loader statement.) The library suboption here is `perf' for referencing the libperf.a library, which has several levels, where `[n]' is the level `', `1', `2' or `3'.

        Cray Hardware Performance Monitor (hpm):

          hpm -g[n] -d [executable] > [source].hpm[n] (CR) : Simulates the Hardware Performance Monitor with `[executable]' and level `l' = `0' (scalar activity), `1' (hold issue conditions), `2' (memory use), or `3' (instruction and vector operations). The option `-d' means that a dedicated machine is simulated.

        Cray JumpTrace (jt) and JumpView (jumpview): JumpTrace and JumpView help gather performance statistics in the form of a report. Some use examples are:

          Fortran Example:
          f90 -ef [pgm].f
          jt ./a.out
          jumpview
          

          C Example:
          cc -ltrace -Gp [cpgm].c
          jt ./a.out
          jumpview -Luch >[cpgm].listing
          

          JumpView Main Menu:
          ----------------------------------------------------- MAIN MENU
          1  Master Summary          |  7  List by Average Time/Call
          2  Routines: List by Time  |  8  Operating Environment
          3  List by Megaflops       |  9  Long Report by Routine Name
          4  List by In-Line Factor  | 10  Detail Report by Symbol
          5  List by Name            | 11  Detail Report by Block
          6  List by Calls           | 12  Options
          ----------------------------------------
            H  HELP
            Q  QUIT
                    Enter Number/Letter of Action Desired
          ---------------------------------------------------------------
          

        Cray Autotasking Expert Performance System (atexpert):

          atexpert [options] (CR) : Autotasking expert performance system, needing X-windows display for full power. See also `atchop' and `atscope'.

    Return to TABLE OF CONTENTS?

    UNICOS makefile Commands


      make [-options] [step-name] (CR) : Makes the files [files] according to the template in the `makefile'. E.g., the file `makefile.unicos_2':
      # Use ``make -f make.unicos_2 mrun>& pgm.l &;
      run<data>out''.
      SOURCES = pgm.f
      OBJECTS = pgm.o
      FLAGS = -em    
      mrun : $(OBJECTS) 
      segldr -o run $(OBJECTS) 
                                               
      .f.o : f90  $(FLAGS)  $*.f
      
      {CAUTION: The commands, like `segldr' or `f90', must be preceded by a `Tab-key' tab as a delimiter, but the tab will not be visible in the UNIX listing.}
      fmgen -m [make-name] -c f90 -f [-flag] -o [executable] [source].f (CR) : Automatically generates a makefile for compiling under the `f90' compiler and loading up the executable file named `[executable]'. Invoke with `make -f [make-name] [executable](CR)' and the execute `[executable]'. Also produces steps for profiling, flow-traces, performance traces, and clean-up, in the heavily documented makefile. For example, `make -c f90 -f -r3 -o run pgm.f (CR)' produces a makefile named `makefile', executable named `run', an information listing named `[name in program statement].l' with loops marked by optimization type, etc.; the making is done with `make run (CR)'. Caution: the makefile only uses the source name only when that coincides with the name used in the Fortran `program' statement and only one type of `f90' flag can be used. These flaws can be corrected by editing the resulting makefile `[make-name]'.

    Return to TABLE OF CONTENTS?

    UNICOS Mail Commands


      mailx (CR) : Shows user`s mail; caution: `mailx' is close to the usual Unix mail, whereas the UNICOS `mail' command is NOT; use the subcommand `t [N](CR)' to list message number `[N]' , `s [N] mbox (CR)' to append message `[N]' to your mailbox `mbox' file or `s [N] [file](CR)' to append `[N]' to another file; `e [N] (CR)' to edit number [N] or look at a long file with `ex' {see Section on `EX' below}; `v [N] (CR)' to edit number [N] or look at a long file with `vi'; `d [N] (CR)' deletes {your own mail!} `[N]'; `m [user] (CR)' permits you to send mail to another account `[user]'; a `~m [N] (CR)' inside the message after entering a subject, permits you to forward message `[N]' to `[user]', `\d (CR)' to end the new message {see the send form below;`x' quits `mailx' without deleting {use this when you run into problems}; and `q (CR)' to quit.
      mailx [user] (CR) : Sends mail to user `[user]'; the text is entered immediately in the current blank space; carriage return to enter each line; enter a file with a `~r[filename] (CR)'; route a copy to user `[userid]' by `~c[userid] (CR)'; enter the `ex' line editor with `~e (CR)' or `vi' visual editor with `~v (CR)' (see Sections on EX and on VI) to make changes on entered lines, exiting `ex' with a `wq (CR)' or `vi' with a `:wq' (CR)'; exit `mailx' by entering `\d (CR)'. {A bug in the current version of Telnet does not allow you to send a copy using the `cc:' entry.
      mailx [name]@[machine].[dept].uic.edu < [filename] (CR) : Sends the UNICOS file `[filename]' to user `[name]' on some UNIX or other machine.

    Return to TABLE OF CONTENTS?

    UNICOS Network Queueing System (NQS)


      qsub [options] (CR) : Submit a batch job to the queue; see `man qsub (CR)' for more information. The option, for example, `-lM [16Mw]' permits running jobs with up to 16 mega words of memory, for example. The option `[myjob].script' provides the script instructions for running a background job. Note that NPACI users must specify a script line
        #QSUB -lM [memory-amount]
      specifies a memory of `[memory-amount]' bytes for a job using `Mw' to denote mega words, instead of an option of `qsub'; and also required is
        #QSUB -lT [CPU-time-amount]
      specifying the amount of wall (user plus system) clock time in seconds. In addition, T3E users must also specify
        #QSUB -l mpp_p=[t3e_procs],mpp_t=[t3e_time]
      giving the number of T3E processors and time on the T3E; and also
        #QSUB -q mpp
      giving the T3E queue name `mpp' (Caution: you must be in the `mpp' group to use this queue, but you can check it by the command
        grep [username] /etc/group (CR)
      on the T90, whereas the default queue is `batch'.

      For more information about batch processing with NQS, click on:


      qstat [options] (CR) : Display status of queued batch jobs; see `man qsub (CR)' for more information.
      /mpp/bin/mppstat (CR) : Not an NQS command, but displays the current T3E configuration and the number of available processors (PEs).
      /usr/local/adm/access/bin/qstatmpp (CR) : Not an NQS command, but displays the currently queued T3E jobs.

    Return to TABLE OF CONTENTS?

    T90 Fortran90 (f90) and other Extensions

    For optimization, it is recommended that your f90 program aid the f90 vector model, i.e. structure the code so that the compiler can automatically recognize as vectorizable. Usually only inner most loop is vectorizable. Avoid loop GOTOs and IFs. Avoid CALLs within loops. Avoid loop READs and WRITEs. Use vectorizable functions. Avoid data dependencies. Use compiler directives, such as `!DIR$ VECTOR' and `!DIR$ NOVECTOR'. Minimize vector strides. Tune code to Fortran column-wise environment in the physically linear memory. Don't even think about using tabs, except in makefiles.


    T90 Fortran90 (f90) Compiler Options

    See also Section ``Execution of Cray T90 Fortran90 (f90)'' and Subsection ``T90 UNICOS f90 Compile, Load and Execution Commands''. Also see the appropriate sections, `docview' and `man cc' for items on Cray Standard C.


    T90 Fortran90 (f90) Miscellaneous Extensions


      ``FORTRAN90 Array Notation'' {f90 allows Fortran90 extensions for array, making array statements like `AS =S', `C = A +B', `A(1:50) = B(1:100:2)' for appropriately dimensioned arrays AS, A, B and C, and scalar S (i.e., like AS(i,j) = S, for all i and j within subscript bounds); in general 'A([start]:[end]:[step])' references the single subscript array section for i = [start] to [end] in steps of [step]. Other examples are `a(i,:)' for the i-th row of array `a', `a(:,j)' for the j-th column, `a(1::2)' for the odd vector elements, `a(n:1:-1)' for the `n' vector elements of `a' in reverse order, and `z(1:n) = -log(z(1:n))' or `z(1:n) = ranf()'.}
      real [variables-list] {The f90 `real' declaration declares variables and array elements as 32-bit (4-byte) words with only 23-bits allotted to the fraction for IEEE precision. This is somewhat different from the old non IEEE precision Cray where real meant an 8 byte or 64 bit real. Thus in f90 code, use the built-in functions `abs', `sqrt', `exp', `amax1' and so forth. The IEEE precision f90 `double precision' declaration is 64-bit with a 54-bit fraction, and hence is entirely different from old Non-IEEE precision Cray `double precision'.}
      POINTER (P,A) {The f90 `pointer' statement declares that the declared integer (usually) variable holds (points to, for C-fans) the shifted initial (base) address of the declared array A.}
      ``Execution Time Allocation'' {f90 allows execution time storage of temporary arrays within subprograms, rather than at compile time; means that f90 will be less sensitive to array bounds over-runs.}
      open ([unit],file=`[fn]',status='unknown') {Format of f90 OPEN statement assigning unit number [unit] to filename [fn]; place in program after declarations; [unit] = 5 defaults to UNIX `stdin' as does [unit] = * for read statements or reads from the terminal unless it is redirected by an `open' or a `lt;'; [unit] = 6 defaults to UNIX `stdout' as does [unit] = * for write statements or writes to the terminal unless it is redirected by an `open' or a `>'; [unit] = 0 defaults to UNIX `stderr' or writes diagnostics to the terminal unless it is redirected by an `open' or a `>&'; note that file names are placed in quotes in the OPEN statement; see also `man' for UNICOS `assign' and `env' statements. }
      save [variable or array name list separated by commas] {The save statement is essential in f90 subroutines to save parameter variable values for later calls to a subroutine; the `-ev' option of f90 provides a better solution to this problem; if not used can lead logic errors, especially for users accustomed to F66 Fortran in which variables are saved after the RETURN statement is executed, but lost in f90.}
      recursive [function or subroutine]([subprogram arguments]) {The 'recursive' prefix is required on subprograms called recursively, but also the recursive suboption is needed in the compiler statement.}
      [statement] ! [embedded comment] {The line embedded comment is now legal in Cray Fortran.}
      intrinsic [f90-function1][,[f90-function2]] {An Intrinsic function is needed in `f90' to declare any Fortran90 intrinsics, such as ANY, DOT_PRODUCT, MAXVAL, RESHAPE, ALL, EOSHIFT, MINLOC, SPREAD, COUNT, FLOAT, MINVAL, SUM, CSHIFT, MATMUL, PACK, TRANSPOSE, MAXLOC, PRODUCT, UNPACK.}

    Return to TABLE OF CONTENTS?

    Fortran90 Array Construction Functions


      PACK([array],[mask-array][,[vector]]) {Transforms (packs) the array `[array]' into a vector `[vector]' (an optional argument, which if not present, the output goes to the value of the function) according to the true values of the `[array]'-conformable, logical mask `[mask-array]'. }
      UNPACK([vector],[mask-array],[field-array]) {Transforms (unpacks) the vector `[vector]' into the array `[field-array]' according to the true values of the `[field-array]'-conformable, logical mask `[mask-array]'. }
      SPREAD([array],[dim],[ncopies]) {Transforms (spreads) the source array `[array]' into the output value of the function with `[ncopies]' copies along the dimension `[dim]' (horizontal copies if `[dim]'=1 and vertical if `[dim]'=2. }
      RESHAPE([array],[shape][,[pad]][,[order]]) {Transforms (reshapes) the source array `[array]' into the output value of the function with shape `[shape]' with order `[order]' padding the array `[pad]'. }

    Fortran90 Array Reduction Functions

    The reduction functions reduce the input to a scalar output.


      SUM([array][,[dim][,[mask]]]) {The `SUM' function computes the sum of the elements of the array `[array]' along the dimension `[dim]' (by columns if `[dim]'=1 or by rows if `[dim]'=2) according to the true values in the conditional mask `[mask]', if present. This function makes the Cray sum function the same as the Connection Machine version. }
      PRODUCT([array][,[dim][,[mask]]]) {The `PRODUCT' function computes the product of the elements of the array `[array]' along the dimension `[dim]' (by columns if `[dim]'=1 or by rows if `[dim]'=2) according to the true values in the conditional mask `[mask]', if present. }
      MAXVAL([array][,[dim][,[mask]]]) {The `MAXVAL' function computes the maximum value of the elements of the array `[array]' along the dimension `[dim]' (by columns if `[dim]'=1 or by rows if `[dim]'=2) according to the true values in the conditional mask `[mask]', if present. }
      MINVAL([array][,[dim][,[mask]]]) {The `MINVAL' function computes the minimum value of the elements of the array `[array]' along the dimension `[dim]' (by columns if `[dim]'=1 or by rows if `[dim]'=2) according to the true values in the conditional mask `[mask]', if present. }
      COUNT([mask][,[dim]]) {The `COUNT' function computes the number of the true elements of the logical array `[mask]' along the dimension `[dim]' (by columns if `[dim]'=1 or by rows if `[dim]'=2), if present. }
      ANY([mask][,[dim]]) {The `ANY' function computes if there are any true elements in the logical array `[mask]' along the dimension `[dim]' (by columns if `[dim]'=1 or by rows if `[dim]'=2), if present, and returns a logical true or false answer. }
      ALL([mask][,[dim]]) {The `ALL' function computes if there are all true elements in the logical array `[mask]' along the dimension `[dim]' (by columns if `[dim]'=1 or by rows if `[dim]'=2), if present, and returns a logical true or false answer. }

    Fortran90 Array Manipulation Functions

    The manipulation functions rearrange the elements of the target matrix.


      TRANSPOSE([array]) {The `TRANSPOSE' function transposes the 2-subscript array `[array]' with the result array of reversed dimensions. }
      EOSHIFT([array],[shift][,[boundary][,[dim]]]) {The `EOSHIFT' function does an end-off shift on the array `[array]' along the dimension `[dim]' using the boundary value(s) `[boundary]' to fill in, if necessary. Caution: Connection Machine arguments have a different order. }
      CSHIFT([array],[shift][,[boundary][,[dim]]]) {The `CSHIFT' function does a circular shift on the array `[array]' along the dimension `[dim]' using the boundary value(s) `[boundary]' to fill in, if necessary. Caution: Connection Machine arguments have a different order. }

    Fortran90 Array Location Functions

    The location functions find the location of elements of the target matrix.


      MAXLOC([array][,[mask]]) {The `MAXLOC' function finds the first element of target array `[array]' having the maximum value, relative to the conditional mask `[mask]', if present. }
      MINLOC([array][,[mask]]) {The `MINLOC' function finds the first element of target array `[array]' having the minimum value, relative to the conditional mask `[mask]', if present. }

    Fortran90 Array Matrix Multiply Functions

    The matrix multiply functions compute the matrix products of the target matrices.


      MATMUL([array1][array2]) {The `MATMUL' function computes the matrix product of target arrays `[array1]' and `[array2]' commensurate for multiplication, with the result matrix of appropriate size. This function is also used for matrix-vector multiplication. }
      DOT_PRODUCT([vector1][vector2]) {The `DOT_PRODUCT' function computes the scalar, dot product of target vectors `[vector1]' and `[vector2]', with the scalar result. Caution: the Connection Machine function is `dotproduct'. }

    Fortran90 Array Functions TEST CODE

    T90 Fortran90 (f90) Differences:

    The following f90 code contains examples of use of many of the Fortran90 array intrinsic functions mentioned above. There are some rules:

    1. Intrinsic statement is needed for all f90 intrinsics within f90 codes.
    2. Constructors of the form b=(/1 2 3/) work with the f90 compiler.
    3. Fortran90 array intrinsics used within f90 will take no auxiliary markers or keywords like "dim=" or "mask=".
    4. array sections can not be used in print statements: NOT print*,b(1:3)
    5. How do you sum an entire array only subject to a mask, but with no dimension restrictions?
          If  b =  1  3  5            logical mask=b.gt.3
                   2  4  6
      
          then   s3=sum(b,1,mask)  or  s2=sum(b,2,mask) work when real s3(3),s2(2)
      
          but    isum=sum(b,mask)  or  isum=sum(b,,mask) or isum=sum(b,:,mask)
                 do NOT work.
          That is how do I enter a scalar dim for the whole array?
      
    Here is a sample T90 Fortran 90 code `pgm.f' = ` t90f90test.f' with many examples, heavily commented and followed by the actual output run on t90.npaci.edu using the commands
       f90 -O3 -r3 -o run pgm.f&
       run>&pgm.out&
    %%%%%%%%%%% pgm.f=t90f90test.f %%%%%%%%%
          program f90test
    code98:  compare ranf() and random_number pseudo random number generators
    code97:  update by removing old comments to cmfortran
    code96:  retest=f90test.f redone on borg = convex spp1200/xa-16
          integer, parameter :: m = 6
          integer, parameter :: n = 4
          integer :: i,j
          integer, dimension(2) :: s2, ctr1, ctr2, ctr3, b2
          integer, dimension(3) :: s3 ,at ,ar1 ,ar2 ,br1 ,br2
          integer, dimension(4) :: as(4)
          integer, dimension(2,2) :: c ,bi
          integer, dimension(2,3) :: b, a
          integer, dimension(3,2) :: ct
          integer, dimension(3,4) :: cs
          integer, dimension(4,3) :: cst
          logical, dimension(2,3) :: test
          logical, dimension(64,64) :: inmask
          real, parameter :: tol = 0.5e-5
          integer, parameter :: niter = 5000
          real :: diffav
          real, dimension(8,8) :: us
          real, dimension(64,64) :: u , du
          real :: ranf, xran
          real, dimension(m,n) :: uniranf, uniran
          real, dimension(n,m) :: truniranf, truniran
          intrinsic  sum,maxval,minval,product
         & ,dot_product,matmul,transpose
         & ,cshift,eoshift,spread
          data b/1,2,3,4,5,6/     !replace constructors initialization
          data as/2,3,4,5/
          data at/2,3,4/
    c --------------------Array Constructors:
           b(1,1:3) = (/1, 3, 5/)  ! initialize first row, along dimension 2.
           b(2,1:3) = (/2, 4, 6/)  ! initialize second row, along dimension 2.
          print*,'Note: constructors like "(/1,2/)" allowed in fc9.5'
          br1 = b(1,:)
          br2 = b(2,:)
          print60,br1,br2
    60    format(' b(2,3)'/(3i3))
    c --------------------Sum Function sum:
          isum = sum(b) ! => isum = 21; i.e., Front-End scalar.
          print61,' isum=sum(b)=',isum
    61    format(1x,a36,i4)
          isum = sum(b(:,1:3:2)) ! => isum = 14; sole ':' means all values '1:2'.
          print61,' isum = sum("b(:,1:3:2)")=',isum
          bi=b(:,1:3:2)
          isum=sum(bi)
          print61,' isum = sum("b(:,1:3:2)")=',isum
          print*,'CAUTION: "dim=", etc., markers= NOT allowed in intrinsics'
          s2 = sum(b,2) ! redeclared with the correct array section shape.
          print62,' s2 = sum(b,2)=',s2  ! => s2 = (/9,12/), row sums
    62    format(1x,a32,2i3)
          s3 = sum(b,1)  ! => s3 = (/3,7,11/); column sums.
          print63,' s3 = sum(b,1)=',s3
    63    format(1x,a32,3i3)
          print*,'CAUTION:  "mask=" marker= STILL not allowed either.'
          s3 = sum(b,1,b.gt.3) ! => s3 = (/0,4,11/); i.e., conditional col sum
          print63,' s3 = sum(b,1,"b.gt.3") =',s3  
          test=b.gt.3
          s3 = sum(b,1,test) ! => s3 = (/0,4,11/); i.e., conditional col sum
          print63,' s3 = sum(b,1,"b.gt.3") =',s3  
          s2 = sum(b,2,test) ! => s2 = (/5,10/); i.e., conditional row sum
          print62,' s2 = sum(b,2,b.gt.3) =',s2  
    cf8er:isum = sum(b,0,test) ! => isum = 18; i.e., add only elements
    cf8er:print61,' isum = sum(b,0,b.gt.3) =',isum ! that are greater than three.
          print*,' CAUTION:  If "sum(array[dim[,mask]])", CANT use zero (0)'
         &      ,' for [dim] for whole array when there is a mask.'
    c --------------------Maximum Value Function maxval:
          imax = maxval(b) ! => imax = 6; array maximum value.
          print61,' imax = maxval(b)=',imax
          s3 = maxval(b,1) ! => s3 = (/2,4,6/); column maximums.
          print63,' s3 = maxval(b,1)=',s3
          s2 = maxval(b,2) ! => s2 = (/5,6/); row maximums.
          print62,' s2 = maxval(b,2)=',s2
    c --------------------Minimum Value Function minval:
          imin = minval(b) ! => imin = 1; array minimum value.
          print61,' imin = minval(b)=',imin
    c --------------------Product Function product:
          s2 = product(b,2) ! => s2 = (/15,48/); products of column elements.
          print62,' s2 = product(b,2)=',s2
    c --------------------Dot Product Function dot_product:
          idot = dot_product(br1,br2) ! => idot = 44; dot product of row
          print61,' idot = dot_product(b(1,:),b(2,:))=',idot ! vectors of b.
          print*,' CAUTION:  Array syntax not allowed in actual arguments.'
    c --------------------Matrix Multiplication Function matmul:
          ! assuming array b of the previous section.
          ![Ans] = matmul([Array_1],[Array_2]) ! computes matrix multiplication
                                               ! of two rank two matrices.
          c = matmul(b(:,1:2),b(:,2:3)) ! => c(1,:)=(/15,23/);c(2,:)=(/22,34/).
          c=transpose(c)
          print623,'c=matmul(b(:,1:2),b(:,2:3))=',c
    623   format(1x,a36/(2i3))
          ![Ans] = transpose([Array]) ! transforms an array to its transpose.
          ct = transpose(b) ! => ct(1,:)=(/1,2/);ct(2,:)=(/3,4/);ct(3,:)=(/5,6/).
          ctr1 = ct(1,:)
          ctr2 = ct(2,:)
          ctr3 = ct(3,:)
          print623,'ct = transpose(b)=',ctr1,ctr2,ctr3
    c --------------------Circular Shift Function cshift:
            ! assume b is again initialized as
            !        b =  1 3 5
            !             2 4 6
          a = cshift(b,1,2)  ! => a = 3 5 1
                             !        4 6 2
    cshift  EG1:
          ar1 = a(1,:)
          ar2 = a(2,:)
          print633,'a = cshift(a,1,2)=',ar1,ar2
    633   format(1x,a36/(3i3))
        ! i.e., b(i,(j+shift) "mod" n) -> a(i,j) for j=1:2, etc.;
        ! nonstandard modulus fn: 0 "mod" n = n; 1 "mod" n = 1; ...;  n "mod" n = n
        ! i.e., the result is computed from shifting subscript in specified
            ! dimension of the source array by the specified shift.
          a = cshift(b,-1,2)  ! => a = 5 1 3
                              !        6 2 4
    cshift  EG2:
          ar1 = a(1,:)
          ar2 = a(2,:)
          print633,'a = cshift(b,-1,2)=',ar1,ar2
            ! i.e., b(i,(j+shift) "mod" n) -> a(i,j) for j=2:3, etc.
    cshift  EG3:
          s2(1) = 1
          s2(2) = 2
          a = cshift(b,s2,2)  ! a = 3 5 1
                              !     6 2 4
            ! i.e., an array-valued shift, or shift per row.
          ar1 = a(1,:)
          ar2 = a(2,:)
          print633,'a = cshift(b,(/1,2/),2)=',ar1,ar2
    cshift Laplace Example:
            ! Jacobi Iteration for a 5-star discretization of 
            !        2D Laplace's equation:
          u = 0
          u(1,:)=2
          u(64,:)=2
          u(:,1)=2
          u(:,64)=1
          inmask = .FALSE.
          inmask(2:63,2:63) = .TRUE.
          diffav = 1
          iter=0
          do while (diffav.gt.tol.and.iter.lt.niter)
             iter=iter+1
             du = 0
             where(inmask)
                du = 0.25*(cshift(u,1,1)+cshift(u,-1,1)+cshift(u,1,2)
         &          +cshift(u,-1,2)) - u
                u = u + du
             end where
             du = du*du
             diffav = sqrt(sum(du)/(62*62))
          end do 
            ! which is the main program fragment of laplace.fcm.
          print*,'CAUTION: array sections not allowed in print'
          us = u(1:64:9,1:64:9)
          us=transpose(us)
          print66,'u = laplace-shift(u)= ; iter=',iter,'; av-diff ='
         &       ,diffav,us
    66    format(1x,a36,i5,a11,e10.3/(8f8.4))
    c --------------------End Off Shift Function eoshift:
          a = eoshift(b,-1,0,1) ! a = 0 0 0 note default boundary value is 0.
                                !     1 3 5
          ar1 = a(1,:)
          ar2 = a(2,:)
          print633,'a = eoshift(b,-1,0,1)=',ar1,ar2
          s2=(/-1,0/)
          b2=(/7,8/)
          a = eoshift(b,s2,b2,2) ! => a = 7 1 3
                                 !        2 4 6
          ar1 = a(1,:)
          ar2 = a(2,:)
          print633,'a = eoshift(b,(/-1,0/),(/7,8/),2)=',ar1,ar2
          a = eoshift(b,2,0,2) ! => a = 5 0 0
                               ! =>     6 0 0
          ar1 = a(1,:)
          ar2 = a(2,:)
          print623,'a = eoshift(b,2,2)=',ar1,ar2
    c --------------------Spread Function spread:
          cs = spread(as,1,3)
             ! contents of cs:
             !        2 3 4 5
             !        2 3 4 5
             !        2 3 4 5
          cst = transpose(cs)
          print64,'as =',as
    64    format(1x,a32,4i3)
          print643,'cs = spread(as,1,3)=',cst
    643   format(1x,a36/(4i3))
    c --------------------
          cs = spread(at,2,4)
             ! contents of c:
             !        2 2 2 2
             !        3 3 3 3
             !        4 4 4 4
          cst = transpose(cs)
          print63,'at =',at
          print643,'cs = spread(at,2,4)=',cst
    c ---------------------------------------------------------------------------
    ! i.e., b=spread(a,d,c)  =>
    ! a(n_1,n_2,...,n_(d-1),n_d,...,n_r) -> b(n_1,n_2,...,n_(d-1),c,n_d,...,n_r)
    ! where r is the rank of source array a and n_i is the size of dimension i;
    ! noting that a new dimension of size c is added before dimension d.
    c ---------------------------------------------------------------------------
    ! Initialize scalar xran with a pseudo random number
          call random_number(harvest=xran)
          call random_number(uniran)
    ! xran and uniran contain uniformly distributed random numbers
          truniran = transpose(uniran)
          write(6,65) xran, truniran
    65    format(' f90 uniform random_number(): xran =',f14.10/ 
         &   ' and f90 subroutine random_number() uniform random array:'
         &         /(4f14.10))
    ! standard UNICOS random number generator ranf:
          do i = 1, m
             do j = 1, n
                uniranf(i,j) = ranf()
               enddo
          enddo
          truniranf = transpose(uniranf)
          write(6,651) truniranf
    651   format(' UNICOS function ranf() uniform random array:'/(4f14.10))
          stop
          end
    %%%%%%%%%%% end pgm.f=t90f90test.f %%%%%%%%%
    

    Here is the output t90f90test.output:
    %%%%%%%%%%% begin pgm.output = t90f90test.output %%%%%%%%%
     Note: constructors like "(/1,2/)" allowed in fc9.5
     b(2,3)
      1  3  5
      2  4  6
                             isum=sum(b)=  21
                isum = sum("b(:,1:3:2)")=  14
                isum = sum("b(:,1:3:2)")=  14
     CAUTION: "dim=", etc., markers= NOT allowed in intrinsics
                       s2 = sum(b,2)=  9 12
                       s3 = sum(b,1)=  3  7 11
     CAUTION:  "mask=" marker= STILL not allowed either.
             s3 = sum(b,1,"b.gt.3") =  0  4 11
             s3 = sum(b,1,"b.gt.3") =  0  4 11
               s2 = sum(b,2,b.gt.3) =  5 10
      CAUTION:  If "sum(array[dim[,mask]])", CANT use zero (0) for [dim] for whole array when there is a mask.
                        imax = maxval(b)=   6
                    s3 = maxval(b,1)=  2  4  6
                    s2 = maxval(b,2)=  5  6
                        imin = minval(b)=   1
                   s2 = product(b,2)= 15 48
       idot = dot_product(b(1,:),b(2,:))=  44
      CAUTION:  Array syntax not allowed in actual arguments.
             c=matmul(b(:,1:2),b(:,2:3))=
     15 23
     22 34
                       ct = transpose(b)=
      1  2
      3  4
      5  6
                       a = cshift(a,1,2)=
      3  5  1
      4  6  2
                      a = cshift(b,-1,2)=
      5  1  3
      6  2  4
                 a = cshift(b,(/1,2/),2)=
      3  5  1
      6  2  4
     CAUTION: array sections not allowed in print
            u = laplace-shift(u)= ; iter= 4730; av-diff = 0.499E-05
      2.0000  2.0000  2.0000  2.0000  2.0000  2.0000  2.0000  1.0000
      2.0000  1.9762  1.9479  1.9090  1.8491  1.7440  1.5208  1.0000
      2.0000  1.9573  1.9068  1.8387  1.7387  1.5836  1.3402  1.0000
      2.0000  1.9469  1.8844  1.8014  1.6836  1.5141  1.2817  1.0000
      2.0000  1.9469  1.8844  1.8014  1.6836  1.5141  1.2817  1.0000
      2.0000  1.9573  1.9068  1.8387  1.7387  1.5836  1.3402  1.0000
      2.0000  1.9762  1.9479  1.9090  1.8491  1.7440  1.5208  1.0000
      2.0000  2.0000  2.0000  2.0000  2.0000  2.0000  2.0000  1.0000
                   a = eoshift(b,-1,0,1)=
      0  0  0
      1  3  5
       a = eoshift(b,(/-1,0/),(/7,8/),2)=
      7  1  3
      2  4  6
                      a = eoshift(b,2,2)=
      5  0
      0  6
      0  0
                                 as =  2  3  4  5
                     cs = spread(as,1,3)=
      2  3  4  5
      2  3  4  5
      2  3  4  5
                                 at =  2  3  4
                     cs = spread(at,2,4)=
      2  2  2  2
      3  3  3  3
      4  4  4  4
     f90 uniform random_number(): xran =  0.5801136486
     and f90 subroutine random_number() uniform random array:
      0.9505127350  0.3056509439  0.0986253383  0.6938844384
      0.7863714253  0.6891007107  0.2765484551  0.9344770142
      0.2976202640  0.3826622387  0.6204460278  0.2120929553
      0.4536999003  0.1329027055  0.0835029668  0.1306527482
      0.0062619416  0.8318579032  0.9903771206  0.8625969805
      0.2757364264  0.5829797958  0.9793469434  0.8189092940
     UNICOS function ranf() uniform random array:
      0.5407187129  0.0187994091  0.3141160167  0.7651821004
      0.9415271082  0.2893071356  0.5849975196  0.9030257778
      0.8866798463  0.4966670053  0.3964840582  0.8718218141
      0.9311052262  0.5954839343  0.2096123584  0.8881281192
      0.4641396487  0.6280308383  0.4467249313  0.4578495774
      0.2349011311  0.7635970977  0.5911920675  0.4438340178
     STOP   executed at line 222 in Fortran routine 'F90TEST'
     CPU: 1.827s,  Wallclock: 0.533s,  24.5% of 14-CPU Machine
     Memory HWM: 308988, Stack HWM: 37805, Stack segment expansions: 0
    %%%%%%%%%%% end pgm.output = t90f90test.output %%%%%%%%%
    

    Cray T3E f90 Differences:

    Here is a sample code with many examples, heavily commented and followed by the actual output run on t3e.npaci.edu using the commands

       f90 -O3 -r3 -Xm -o fpgm fpgm.f &
       mpprun -n1 fpgm  >& fpgm.output &
    
    %%%%%%%%%%% pgm.f=cf97test.f %%%%%%%%%
    


    f90 Library Functions


      [variable] = ssum (n,a(m),k) {The optimized scientific library SCILIB function `ssum' computes the sum of `n' elements of array `a' starting from element `m' in steps of `k'; the equivalent but not optimal, do loop is {sum=0; do 1 i=m,n+n-1,k; 1 sum=sum+a(i)}; e.g. `sum=ssum(n,a,1)' returns the sum of the first `n' elements of the array 'a'; if `a' is an m X n 2-subscript array, use `t = ssum(m*n,a(1,1),1)'; use `man ssum' for more information; the `-l libsci.a' option in the `segldr' should be optional. UPDATE: In cf77 version 6.0, `ssum' has been replaced by the Fortran90 `sum' function.}
      [variable] = sdot (n,a,1,b,1) {The optimized SCILIB function `sdot' returns the calculated value of the dot product of `n' elements of the vectors `a' and `b' in steps of 1; the `-l libsci.a' option of the `segldr' should be optional. UPDATE: In cf77 version 6.0, `sdot' has been replaced by the Fortran90 `DOT_PRODUCT' function.}
      call mxm (a,m,b,kmax,c,n) {The optimized SCILIB subroutine returns the calculated value of the full matrix by matrix product of the `m X kmax' array `a' and the `kmax X n' array `b' into the `m X n' output array `c'; use `mxma' for multiplication of sub-matrices when the matrices are not full; use `man mxm' (ignore UNICOS function of the same name) or 'man mxma' for more information; the `-l libsci.a' option of the `segldr' should be optional. UPDATE: In cf77 version 6.0, `mxm' has been replaced by the Fortran90 `MATMUL' function.}
      call mxv (a,m,b,n,c) {The optimized SCILIB subroutine returns the calculated value of the full matrix by vector product of the `m X n' array `a' and the `n' vector `b' into the `m X n' output vector `c', by rolling up the `j' loop; use `man mxv'; the `-l libsci.a' option of the `segldr' should be optional. UPDATE: In the T90 cf77 version 6.0, `mxv' has been replaced by the Fortran90 `MATMUL' function.}
      call random_number([HARVEST=][variable]) {F90 Pseudo-random number generator on [0,1], as intrinsic subroutine rather than intrinsic function, that gets the first or next random number or array a stores it in the user output variable or array `[variable]'. For example:
        real s, r(100,100)
        call random_number(harvest=s)
        call random_number(r)
        
      See `man random_number' for more information, or `man rand_seed' for changing the random sequence. }
      [random-variable] = ranf() {UNICOS Pseudo-random number generator on [0,1] that gets the first or next random number, e.g.,
        real s, r(100,100)
        s = ranf()
        do i = 1,100
           do j = 1,100
              r(i,j) = ranf()
           enddo
        enddo
        
      or use `r(1:n) = ranf()' in Fortran90 notation; use `x(1:n) = -log(r(1:n))' to convert to an exponential distribution; change the random generator seed using `call ranset([new-seed]), but it is not necessary to start with a seed; `ranf' is a great random number generator since it properly vectorizes in loops; use `man ranf' for more information, including use for C/C++ as `_ranf()' which requires the following include and declaration statements:
        #include 
        double _ranf(void);
        
      }
      call wheni[reln] ([nfind],[iarray],[inc],[itarget],[index],[nval]) (CR) {Finds all integer array (`[iarray]') elements in relation (`[reln]') to the integer target (`[itarget]'); `[reln]' = `lt', `le', `gt' or `ge'; `[n]' is the number of elements to be searched in increments of `[inc]'; `[index]' is the integer array of the indices of the output; and `[nval]' is the number in indices found.}

    T90 Fortran90 (f90) Compiler Vector Toggling Directives

    These statements are placed in the Fortran source just before the loop or other entity they are to effect, but they stay in effect until the opposite directive is given. However, for every toggling compiler directive that turns some action on, there is another directive with an `NO' prefix appended that turns that action off. The leading `C' must be in column 1 and a blank must be in column 6. For more information, see F90 Vol. 1: Fortran Reference Manual, Sect. 1.6 Compiler Directives.


      !DIR$ VECTOR {Compiler directive causes all following inner DO loops to be vectorized unless the loop is known to have only one iteration and until superceded by another directive that alters vectorization.}
      !DIR$ NOVECTOR {Directive turns off vectorization at next DO loop until turned back on.}
      !DIR$ VSEARCH {Directive permits optimization of loops that can have a premature exit, as with convergence of an iteration. !DIR$ NOVSEARCH directive turns it off.}
      !DIR$ INLINE {Directive turns on inlining, inline code generation, of subprograms if `-I' or `-o inline' f90 options are used; !DIR$ NOINLINE turns inlining off.}

    T90 Fortran90 (f90) Compiler Scalar Optimization Directives

    These directives effect scalar optimization at the point at which the directive appears and only affects the local program unit, such as the loop it appears in.


      !DIR$ BL {Directive turns on bottom loading for loops, pre-fetching data for the next loop iteration; !DIR$ NOBL turns BL off.}
      !DIR$ NOSIDEEFFECTS [subprogram-name] {Allows keeping data in registers across subprograms, if no global data (i.e., arguments of common blocks) are changed.}
      !DIR$ SUPPRESS [variable-list] {Directive temporarily suppresses scalar optimization on variables in loops containing the directive.}

    T90 Fortran90 (f90) Compiler Loop Directives

    These compiler directives hold only for the loop immediately following the directive.


      !DIR$ IVDEP {Directive causes compiler to Ignore Vector DEPendencies in only the next inner most DO. Disabled by the NOVECTOR directive.}
      !DIR$ NEXTSCALAR {Directive causes only the very NEXT DO loop to be executed in SCALAR Mode with vectorization resuming if on. Disabled by the NOVECTOR directive.}
      !DIR$ SHORTLOOP {Directive reduces vectorization overhead for the very next loop, presumed SHORT or has less than 64 iterations. Disabled by the NOVECTOR directive.}
      !DIR$ RECURRENCE {Directive turns on vectorization of reduction loops (e.g., sum loops); !DIR$ NORECURRENCE turns it off. Disabled by the NOVECTOR directive.}

    T90 Fortran90 (f90) Compiler Storage Directives

    These compiler directives alter the way memory is handled.


      !DIR$ VFUNCTION [external-function-list] {Directive declares Vector version of an external CAL FUNCTION, where CAL is the Cray Assembler Language, but the function can not be declared in an External statement; works with list of CAL functions separated by commas. See the f90 Vol. 1 for other restrictions. }
      !DIR$ AUXILIARY [array-list] {Storage directive allows assignment to the secondary disk storage for the Cray Y-MP only.}

    T90 Fortran90 (f90) Compiler Diagnostic Directives

    These are used the same way as the vector directives.


      !DIR$ BOUNDS [array-names] {Allows the checking of array subscript bounds, but inhibits vectorization; applies to all arrays unless particular one are listed as arguments.}
      !DIR$ NOBOUNDS {Prevents checking of subscript bounds.}
      !DIR$ FLOW {Turns on Flow-trace and `!DIR$ NOFLOW' turns it off.}

    For more information on compiler directives and other f90 statements, refer to the `Cray Fortran (CFT) REFERENCE MANUAL'. Addition information on SCILIB functions can be found in the Cray Library Reference Manual, a copy of which is found in the UIC Supercomputing Support Office along with many other manuals.


    Return to TABLE OF CONTENTS?

    T90 Fortran (f90) Multitasking Options

    The Cray supercomputers now have parallelization or tasking features in additions to vectorization features. However, the cost of running Cray Fortran is extremely large, because the user is charged for time on all processors utilized. In contrast, the user is not charged for each vector element with vectorization. HENCE THE USER SHOULD ONLY USE THE MULTITASKING FEATURES WHEN ABSOLUTELY NECESSARY.} Macrotasking refers to large grain or subroutine level parallelization. Microtasking refers to parallel loop optimization through compiler directives. Autotasking refers to automatic microtasking by the Fortran Preprocessor `fpp', i.e., through automatic code generation for multitasking. Compiler f90, preprocessor fpp and mid-processor fmp are currently version 4.0 at NPACI. More information is found on the NPACI Cray T90, in the directory `/usr/local/doc' files or subdirectories such as `cf77_50.rn' release notes or `unicos.7.0' sections.

      A typical job accounting execution sequence might be
      ja (CR)
      ${TMP}/[fn] < [data] > & [output] & (CR)

      with the job accounting information appearing in a file of the form `.jacct[jobid]'. Including the pass through option `-Wd"-l [fn].ml" ' will also produce an fpp summary listing in `[fn].ml' (but no executable) with the markers `P` for autotasked, `V' for vectorized, `N' for not chosen or not optimized, adn `D' for data dependent.
      f90 -O full -M [fn].f > [fn].m & (CR) : The `-M' option results in the intermediate Fortran file `[fn].m' with microtasking directives automatically inserted into the `[fn].f' source using the dependence analysis of the `fpp' preprocessor; no object or executable file is produced; the user can insert additional compiler directives into `[fn].m' and compile it with the Cray Fortran multitasking processor `fmp', the translator of the directives, by `fmp [fn].m > [fn].j (CR)'; the intermediate expanded file `[fn].j' is further assembled, linked and loaded by `sld -o [exec] [fn].j (CR)'.


    Return to TABLE OF CONTENTS?

    Return to TABLE OF CONTENTS?

    Cray T90 f90 and cc Timing Utility Functions.


    T90 Fortran90 (f90) Timing Utility Functions

      [time-variable] = second() : The standard UNIX Fortran seconds timer utility, whose output value is user cpu time in seconds, as opposed to system ``cpu time'' and wall clock time (the sum of user and system times); also exists in the format `call second([time-variable])';for timing large loops, `second' overhead should be negligible; for most sizes of loops, the timing part of the code, with `!' marking comments on the statement line, might look like:

                 real tv(100),cputim()                                         
                 character*24 tchar(100)                                       
                 kt = 1                                                        
                 tv(kt) = second()      ! first 2 calls get the overhead       
                 kt = kt + 1                                                   
                 tv(kt) = second()      ! initial time                         
        code-continues
                  ...  more code ...
        code-continues
                 kt = kt + 1
                 tchar(kt) = `loop [999]'
                 tv(kt) = second()      
                   do [999] i = 1, [1000]
        code-continues
             ... rest of do .... 
        code-continues
        999      continue
                 kt = kt + 1
                 tv(kt) = second()      !tv(kt) - tv(kt-1) = do-cputime
        code-continues
           ... more do loops and more timing step pairs .... 
        code-continues
                 kt = kt +1
                 tv(kt) = second()            !final time
                 overhd = tv(2) - tv(1)      !timer second overhead
                   do [99999] ks =3, kt - 2      !cpu-time for each timed loop
                 cputim(ks) = tv(ks+1) - tv(ks) -- overhd
                 write(6,[99998]) ks, cputim(ks), tchar(ks)
        Comment:  writes hinder vector optimization, so save writes until last
        99999    continue
        99998    format(1x,i3,' time =',f12.7,' for ',a)
                 cputot = tv(kt) - tv(2) - (kt-2)*overhd
        Caution: due to overhead variability, total can be off for small job
                 write(6,*) 'total cpu-time =',cputot
        
      For timing small loops, put the small loop inside another loop that just does a large number of repetitions of the small loop, say N, then divide the time difference by N; use `man second' for other information.
      [flag] = gettimeofday(&tp,&tzp); : C/C++ microsecond wall clock timer and timezong utility. Requires the special header include statement: `#include ' and that the following structures be declared:
        struct timeval tp ;
        /* timeval is a structure with pointer name tp and having      */
        /* unsigned long  tp.tv_sec giving  seconds since Jan. 1, 1970 */
        /* long  tp.tv_usec giving microseconds                        */
        struct timezone tzp;         /* needed only for time zone data */
        
      See `man gettimeofday' for more information and Cray T90 C Starter example: `t90startcc.c'.}

      [time-variable] = tsecnd() : f90 task timer utility giving the cpu time for a task during multitasking.


    cc Timing Utility Function


      gettimeofday
      Example C program using the gettimeofday function:
        #include 
        #include 
        #define NTime 20
        
        main()
        {
        /* Time variables */
           struct timeval tp ;
        /* timeval is a structure with pointer name tp and having */
        /* unsigned long  tp.tv_sec giving  seconds since Jan. 1, 1970 */
        /* long  tp.tv_usec giving microseconds */
           struct timezone tzp; /* needed only for time zone data */
           int gtod;
           long int tsecs[NTime], tmicrosecs[NTime];
           long int ttot[NTime], ttotmoh[NTime];
           float ts1, tt1, tu1, tu2, ts2, tt2, tu3, ts3, tt3;
           double ttotf;
        
        /* begin main code */
           if (gettimeofday(&tp,&tzp) == -1) { perror("gettimeofday failed"); exit(1);}
           kt = 1;
        /* gettimeofday = Microsecond Wall Timer C function;                         */
        /* WallTime = UserTime + SystemTime, Undecomposed;                           */
        /* gettimeofday returns gtod = 0 if successful;                              */
        /* tv_sec in secs since 1/1/70;                                              */
        /* tv_usec in added microseconds;                                            */
        /* tzp gives the timezone;                                                   */
           gtod=gettimeofday(&tp,&tzp);
           tsecs[0] = tp.tv_sec;
           tmicrosecs[0] = tp.tv_usec;
           ++kt;
           gtod=gettimeofday(&tp,&tzp);
           tsecs[1] = tp.tv_sec;
           tmicrosecs[1] = tp.tv_usec;
        /*   ...... MUCH DELETED CODE ........ */
        
        /*   ...... MUCH DELETED CODE ........ */
        /* Clock: Elapsed Total Time: */
           ++kt;
           gtod=gettimeofday(&tp,&tzp);
           tsecs[kt] = tp.tv_sec;
           tmicrosecs[kt] = tp.tv_usec;
        /* Total Elapsed Time Including Clock Overhead*/
           ttot[kt] = (tsecs[kt]-tsecs[1])*1000000+(tmicrosecs[kt]-tmicrosecs[1]);
        /* Total Elapsed Time Minus Clock Overhead */
           ttotmoh[kt] = ttot[kt] - (tmicrosecs[1] - tmicrosecs[0]);
           printf("\nIntermediate Raw Timing Output:");
           printf("\ntmicrosecs[(0,1,kt)]=(%12d,%12d,%12d), in microseconds",
              tmicrosecs[0],tmicrosecs[1],tmicrosecs[kt]);
           printf("\ntsecs[(0,1,kt)]=(%12d,%12d,%12d), in seconds",
              tsecs[0],tsecs[1],tsecs[kt]);
           printf("\n(ttot[kt],ttotmoh[kt])=(%12d,%12d), in microseconds",
              ttot[kt],ttotmoh[kt]);
           if (ttot[kt] < 0){printf("\n  Error:Negative Times:Bad Clock:Rerun Job\n");}
           ttotf = ttotmoh[kt]/1.e6;
           printf("\n T90 Starter C Problem Output");
           printf("\n  Timing Output:");
           printf("\n   final total time=%12.4e, in seconds\n",ttotf);
        /* Change:  Extra output statements: */
        }
        

    Table of T90/T3E Timers

      T90 (perhaps MPP) Timer Summary ... MCS572 F95/FBH
      Timer     TimeMeasured      Units          Comments
      -----     ------------      -----          --------
      clock     System&User       Microseconds
      cpused    User              ClockTicks     RTC ticks
      gettimeofday    WallTime    Microseconds   plus many other things from TOD; C fn
      ja        ElapsedUserSys    Seconds        plus more;on T90;only mppexec for T3E
      rtclock   User              ClockTicks     current RTC ticks
      RTC       RealTimeClock     ClockTicks     float version
      IRTC      RealTimeClock     ClockTicks     int version
      second    User              Seconds        Coarse, not useful for small timings 
      secondr   ElapsedWall       Seconds        Coarse, not useful for small timings
      sysclock  RealTimeClock     ClockTicks     plus #wraps (overflows)
      timef     ElapsedWall       Milliseconds   Fn. gives elapsed time since 1st call
      times     Process&Child     ClockTicks     needs include 
      timex     ElapsedUserSys    Seconds        depends on opts in timex [opts] [cmd]
      tsecnd    ElapsedTask       Seconds        for current multithreaded task
      
    Notes: There are several other timers, but not appropriate for scientific computing. For actual use, consult the timer man page. Ideally, a timer should give usertime in intervals a small as microseconds. Hence, an ideal timer for the T3E would have to be designed from an rtc clock. Job accounting ja is done on T90, but gives mppexec time (must be T3E time). Using the C routine `gettimeofday' would be rough approximation, suggested on now extinct Thinking Machines Corp. CM-5.

    T3E MPI Wall Timer

    The Cray T3E at NPACI has a wall timer MPI_Wtime in seconds that works with MPI parallel programming codes for both f90 and cc codes. See the following information on MPI__Wtime and related functions:


    The best way to learn these commands is to use and test them in an actual computer session on the TCS Cluster.

    Good luck.

    Return to TABLE OF CONTENTS?

    Please report to Professor Hanson any problems or inaccuracies:


    Web Source: http://www.math.uic.edu/~hanson/tcs03guide.html