Appendix 5



Floating-point representation is the term used to describe a certain method of representing numbers within a computer. This method is akin to the notation by which such a number as .000000000135 is represented by writing 1.35 x 10-10.

The purpose of such a notation is to enable very small or very large numbers to be represented without the need to write long strings of figures. In the same way, the purpose of floating-point representation within the computer is to enable very small and very large numbers to be represented by a limited number of digits - by one computer word, in fact.

This representation can be used in any 803, but in a basic machine it is necessary to use subroutines to perform even such simple operations as addition, subtraction etc. However, if an automatic floating-point unit is fitted, it is possible to carry out arithmetic on floating-point numbers by means of special functions provided for the purpose.


2.1 General

Any given number A can be represented in many ways by a pair of numbers (a, b) which satisies the equality

A = a x 2b

in which a is called the mantissa or argument and b the (binary) exponent. Such a representation is termed (binary) floating-point, by contrast with the forms discussed in the main part of this manual, e.g. integers, fractions, etc, all of which are fixed-point.

For example, the number 6 can be represented by (.75, 3) or by (12, -1).

The usefulness of the notation lies, however, in the fact that any given non-zero number can be represented as a pair of numbers in which the magnitude of the mantissa lies between ½ and 1 and the exponent is integral. In practice when using this notation in the computer there are upper and lower limits on the value of the exponent, and so there must be upper and lower limits on the magnitude of the non-zero numbers which can be represented.

Zero can, of course, be represented by any pair in which the mantissa is 0.

2.2 Standard Floating-Point Representation with the 803

In the 803 computers fitted with automatic floating-point units the details of the representation used are as follows:

  1. b is an integer, -256 ≤ b ≤ 255
  2. if A > 0, ½ ≤ a < 1
    if A = 0, a=0, b=-256, always
    if A < 0, -1 ≤ a < -½
  3. 30 digits representing the mantissa, a, and 9 digits representing the exponent, b, are "packed" together in the same way as a normal fixed point fraction: that is to say, the left hand digit is the sign digit, the second is the 2-1 digit, the third is the 2-2 digit, and so on down to the thirtieth, which is the 2-29 digit; the remaining 9 digits represent the integer (b+256) directly; this must satisy the relation 0 ≤ (b+256) ≤ 511.

    2.3 Examples of floating-point numbers

    For ease of reading, the numbers in the examples below have been printed with the sign digit, the fractional digits of the mantissa and the digits of the exponent in separate groups.

    0 00000000000000000000000000000 000000000
    1 00000000000000000000000000000 100000000
    0 10000000000000000000000000000 100000001
    +120, i.e. 15/16 x 28 [ed: should be +240]
    0 11110000000000000000000000000 100001000
    -.078125, i.e. (-5/8) x 2-3
    1 01100000000000000000000000000 011111101
    +(1-2-29) x 2255 the largest possible positive number, about 5.8 x 1076
    0 11111111111111111111111111111 111111111
    -(½+2-29) x 2-256, the smallest possible negative number, about -4.3 x 10-76
    1 01111111111111111111111111111 000000000

    Observe that zero has the same form in floating-point representation as in fixed-point, and that the sign digit of any positive number or zero is 0, while that of any negative number is 1.

    2.4 Accuracy, Range and Round-Off

    The accuracy of any representation is determined by the number of significant figures employed. Here this is 29 binary digits, so the accuracy is slightly less than that which would be expected from 9 significant decimal figures.

    The last two examples given above show extreme values of numbers which can be repesented. The actual range is defined by:

    Zero represented exactly and unambiguously.
    Largest positive number: (1-2-29) x 2255 )
    Smallest positive number: ½ x 2-256 ) representation accurate to
    Largest negative number: -1 x 2255 ) 29 significant binary digits
    Smallest negative number: -(½ + 2-29) x 2-256 )

    This may be summarised approximately by saying that zero is represented exactly and that any number satisfying

    4.3 x 10-78 ≤ |A| ≤ 5.8 x 1076

    can be represented to an accuracy approximately equal to that obtained by 9 significant decimal figures.

    Certain rational numbers in this range can, of course, be represented exactly. These include, in particular, all integers in the range -536 870 912 ≤ n ≤ 536 870 911.

    In each floating-point operation executed by the computer the result is rounded off without bias in such a way that its mantissa will not differ by more than 2-29 from the correct result. If the true result of any floating-point operation upon two numbers, which are represented exactly, is itself capable of exact representation, then the result actually produced by the computer will be exactly correct.

    It will be appreciated that the value of the exponent determines the absolute magnitude of the smallest increase or decrease in any number which can be represented within the computer. Thus, in general, the greater the magnitude of any particular number, the greater the step to the next representable number.

    2.5 Floating-Point Underflow and Overflow

    The Floating-Point Overflow Lamp

    1. If the computer is called upon to execute any floating-point Arithmetic operation, the correct result of which is of such small magnitude that it cannot be represented, the actual result will be zero. This effect is termed floating-point underflow, and does not affect the running of the computer, or any indicator lamp.
    2. If the computer is called upon to execute any floating-point arithmetic operation, the correct result of which is of such great magnitude that it cannot be represented, it will stop, and the floating-point overflow lamp on the keyboard will be lit. The computer can be restarted by depressing the operate bar.
    3. If and only if the computer is called upon to carry out floating-point division by zero, then, in addition to the effect described in (ii) above, the (fixed-point) overflow will be set, and will remain set until a 43 or 47 instruction is obeyed.

    2.6 Difference between automatic and programmed floating-point representation. Cautionary note

    It should be observed that, before the advent of the 803 automatic floating-point unit, many subroutines were written by means of which floating-point operations could be performed, and that these remain in use on 803's not fitted with automatic floating-point units. In most of these programmes, the representation employed differs to some extent from the representation described above.

    This description applies to the automatic floating-point unit only.


    3.1 Group 6 Functions : Automatic Floating-Point Arithmetic

    In function 60 to 64 inclusive the computer treats both a and n as standard floating-point numbers and produces a standard floating-point result, which replaces the old content of the accumulator.

    The content of the auxiliary register is cleared by any of these operations, but the store is not affected.

    Instruction Effect a' Time Taken
    60 N Add n to a a + n 3 x 288 μsec.
    61 N Subtract n from a a - n 3 x 288 μsec.
    62 N Negate a and add n n - a 3 x 288 μsec.
    63 N Multiply a by n an 17 x 288 μsec.
    64 N Divide a by n a/n 34 x 288 μsec.
    65 4096 Convert the fixed-point integer in the accumulator to floating-point form 2 x 288 μsec.


    In obeying the instruction 65 4096 the computer treats the existing content of the accumulator as a fixed-point integer to scale x 2-38, and converts this to standard floating point form.

    For example:

    0 00000000000000000000000000000000001111
      which is the fixed-point representation of the integer 15 is converted to
    0 11110000000000000000000000000 100000100
    in which a=15/16 and (b+256) = 260
    so that a x 2b = 15/16 x 24 = 15

    It should be noted that function 65, with values of N ≠ 4096, is used for other purposes in certain special versions of 803, and that 65 4096 is the only permitted form of the instruction to convert integer to floating-point representation.

    Negating Observe that if C(Acc) is a floating-point number, it may be negated by means of the instruction 62 0.

    3.2 Other Functions which can be used straightforwardly with Floating-Point Numbers

    Functions Remarks
    00 10 20 30 The effect is normal
    03 13 23 33 Can be used to separate the exponent and mantissa: otherise not of much practical value.
    06 16 26 36 The effect is normal
    41 42 45 46 Will work equally well when C(A) is a floating-point number

    3.3 Other Functions which will not, in general, produce sensible results in floating-point form

    These functions are:

    01, 11, 21, 31,
    02, 12, 22, 32,
    04, 14, 24, 34,
    05, 15, 25, 35,
    07, 17, 27, 37,
    43, 47,
    All Group 5

    Where the effect of the above instruction is to perform arithmetic on one or more words which represent floating-point numbers, they will not, in general, produce sensible results. The functions 43 and 47 refer to the fixed-point overflow indicator, which is not usually affected by floating-point overflow: see 2.6(iii).

    It is, however, possible to carry out such actions as using 22 N to double a floating-point number or using a fixed-point add or subtract instruction to double or halve several times by adding or subtracting an integer (thereby changing A=a.2b to a.2b+n = A.2n). It should be noted that these operations would not be subject to the normal rules of floating-point overflow and underflow.


    4.1 Floating-Point Decimal Notation

    Just as with fixed-point programming, constants and data which are to be operated on in binary, are expressed in decimal on programme sheets and when writing out data.

    While the actual details of the conventions used will vary from one subroutine or programme to another, the general written notation for a floating-point number if ± a/b, representing ± a x 2b. b must be an integer, a may be a fraction, integer or mixed number.

    Thus we could express the floating-point number

    by any of these:
    1. +.12345/3
    2. +12345/-2
    3. +123.45/

    Of these, the first is sometimes called the standard (decimal) floating-point form, whic is roughly defined by sayiong that the mantissa, a, obeys the decimal inequality

    .1 ≤ |a| < 1
    In the second example, where there is no decimal point, the mantissa is assumed to be an integer. In the third case, where there are no figures after the /, the exponent is assumed to be zero.

    4.2 Floating-Point Translation Input Routine

    An augmented translation input routine is available, which will read and store floating-point constants written in any of the above notations. This routine occupies more storage space than the standard form, so programmes written for automatic floating-point computers can conveniently commence in location 256.

    <4.3 Floating-Point Subroutines

    Subroutines which will perform the functions of reading and printing floating-point numbers, including if desired the conversion from or to fixed-point form, are available. So also are subroutines to evaluate mathematical functions of floating-point numbers.



    At the time (May 1961) of writing this text, the subroutines to which reference is made exist only in a provisional form. It should therefore be noted that final details may differ from those assumed here.

    5.1 We assume that the following are available:

    1. An augmented translation input routine, of about 220 words length, which will read and store floating-point constants.

      This may also be entered from any programme by the instruction pair 73 170   40 52 whereupon it will read one number (fixed or floating-point) from the input tape and exit with this in the accumulator.

    2. A print routine which will, among other things, print a floating-point number in standard decimal floating-point form. The entry is standard, and the parameter 00 0 00 n must be written in the location immediately following that holding the entry instructions, to specify that n decimal digits are required in the mantissa of the printed number.

      Storage requirements: less than 200 locations.

    3. A floating-point square-root subroutine, which will extract the square root of the floating-point number in the accumulator. Entry is standard, and not more than 25 locations are required.

    5.2 The problem to be Programmed

    Given:    The integer r, the surface areas of r spheres.

    Calculate and print, to five significant figures, the volume of each sphere and the mean volume per sphere.

    5.3 Notation, Formulae, etc

    The integer r is the number of spheres.

    r' is the standard floating-point form of r.

    If the area of a sphere is A, then its volume V is determined by

    V = A √A / 6 √π
    and we note that 6 √π = 10.6347231

    If T is the total volume, then
    T = ΣV = 0 + V1 + V2 + ... + Vr

    and if U is the mean volume, then
    U = T / r'

    5.4 Preliminary Work

    Blocks 1 Main Programme
    2 Constants
    3 Workspace
    4 Print subroutine
    5 Square-root subroutine

    Data Sequence

    The following numbers will be punched on the data tape, in this sequence:

    The integer r

    The floating-point numbers, representing the surface areas, one after another.

    Output Format

    The individual volumes will be printed in a column, and the mean volume will be printed to the right of the last individual volume.

    5.6 Scheme, or Flow Diagram

      Read r
    Set count -(r-1)
    Form r' and store it
    Set T to zero
    Read an area A <-----------------------+
    Form V                                  |
    Add V to T                                  |
    Print V on a new line                                  |
    Count and test : if r spheres have not                                  |
        yet been dealt with, go back --------------------------+
    Otherwise: divide T by r' to form U
    Print U

    5.6 Programme

    BLOCK   @           DIRECTORY
    1 +256 Main Programme
    2 +280 Constants
    3 +285 Workspace
    4 +290 Print subroutine
    5 +490 Square root subroutine
    * BLOCK 1
    0,1 73 170 40 53 Read r
    1,1 21 0,3 22 0,3 -(r-1) in 0,3
    2,1 65 4096 20 1,3 r' in 1,3
    3,1 26 2,3 74 27 T to zero, punch fs1
    4,1 73 170 40 53 Read an area A
    5,1 20 3,3 00 0 Copy of A in 3,3
    6,1 73 0,5 40 1,5 Form √A
    7,1 63 3,3 64 0,2 Form V
    8,1 10 2,3 60 2,3 Add to T
    9,1 10 2,3 00 0 (V back in accumulator)
    10,1 74 29 74 30 Punch cr lf1
    11,1 73 0,4 40 1,4 Punch V to
    12,1 00 0 00 5 ) 5 significant figures
    13,1 32 0,3 41 1,4 Count spheres and test
    14,1 30 2,3 64 1,3 Form U
    15,1 74 28 74 28 Punch sp sp1
    16,1 73 0,4 40 1,4 Punch U to
    17,1 00 0 00 5 ) 5 significant figures
    18,1 40 18,1 00 0 Stop
    * BLOCK 2
    0,2 +10.6347231/
    * BLOCK 3
    0,3 +0 Count of number of spheres
    1,3 +0 r'
    2,3 +0 T
    3,3 +0 A
    * BLOCK 4
    Copy Floating-Point  
    Print subroutine  
    without )  
    * BLOCK 5
    Copy Floating-Point  
    Square Root Subroutine  
    with )  

    Note 1: The first character to be punched is fs, to ensure that the results will be printed as figures. Before each V, the characters cr lf are punched to make the teleprinter "start a new line" before printing V. Before U we have sp sp: the student may work out why for himself.

    Previous Appendix Contents