MCS 471 --- Computer Problem 0 --- Spring 1996

Floating Point Arithmetic


Single program with computer output is due Monday 05 Feb 1996 in class.

Write a single program that combines the two following finite precision problems. You may use WATFOR77, regular Fortran, Pascal or C for this problem. However, you must hand in both copies of your program source code and your computer output.
  1. Determine the single precision machine epsilon,

                  MIN(EPS | EPS>0 & 1+EPS>1)

         using the modification that follows of a "bisectioning approximation" (TEPS is the temporary approximate machine epsilon) code fragment:
                      TEPS=1
                1     PRINT,TEPS,TEPS+1
                      IF(TEPS+1.LE.1.) GOTO 2
                      TEPS=TEPS/2
                      GOTO 1
                2     EPS=2*TEPS
    		  PREC=1-LOG(EPS)/LOG(2.)
    
         Modify this code to print out the results
    1. Name of Your Computer System with Processor if Known
    2. Intermediate values of "TEPS" and "TEPS+1"
    3. Final Approximation the Machine Epsilon in "EPS"
    4. Final Approximation of the Precision in "PREC"
    5. The Same Three (3) Items from a modified copy of the above code fragment, modified for Double Precision (64bit=8bytes)
         Note that you need results in both single (32bit) and double (64bit) precision floating arithmetic. Note that your answer is in statement 2 where the failed previous value is corrected since it failed the "TEPS+1>1" test. The variable "PREC" gets the approximate number of binary digits in the machine precision of the floating point fraction. Does this value correspond to the theoretical precision derived in class?

  2. Demonstrate a technique for avoiding catastrophic cancellation with the QUADRATIC FORMULA by printing out a TABLE of C, X11,X21,X12,X22 and the RELATIVE ERRORs in the XIJ's for A=SQUAREROOT(3.), B=LOG(10.), C=EXP(1.)/10.**M with M=1 to 12 (Caution: if you are using more than 32bit precision than this "12" value needs to be changed in proportion) and with I and J = 1 to 2 for the formulas for the XIJ below:
          X11 = (-B+SQUAREROOT(B*B-4AC))/2A,
          X21 = (-B-SQUAREROOT(B*B-4AC))/2A,
          X12 = (-B-SIGN(B)SQUAREROOT(B*B-4AC))/2A
          X22 = C/(A*X12),
    
         which you must code in error free syntax in which ever language you are using. For instance, you must find the correct functions for "SIGN" and "SQUAREROOT" in Fortran/WatFor, C or Pascal. The RELATIVE ERRORs are to be computed in the form (DXIJ-XIJ)/DXIJ where DXIJ is the corresponding formula for XIJ converted completely to "DOUBLE PRECISION" and is used as an approximation for the EXACT (INFINITE PRECISION) result. Use 7 digit E-FORMAT in Fortran for the XIJ's and 3 digit E-FORMAT for C and the RELATIVE ERRORs in the XIJ. In other languages use the corresponding exponential format. Note that the "XIJ" are just notation here for X11, X21, X12, or X22; they must be replaced by 4 legal variables. Also note that ERRORs should always be PRINTed in exponential format. The result of each binary (pair) operation should be double precision and built-in functions used should be double for DOUBLE PRECISION variables. Why are the first two formulas (XI1's) not as accurate as the second two (XI2's) for I=1,2?



Web Source: http://www.math.uic.edu/~hanson/M471/mcs471cp0.html

Email Comments or Questions to hanson@math.uic.edu