Floating-point representation is the term used to describe a certain method of representing numbers within a computer. This method is akin to the notation by which such a number as .000000000135 is represented by writing 1.35 x 10-10.
The purpose of such a notation is to enable very small or very large numbers to be represented without the need to write long strings of figures. In the same way, the purpose of floating-point representation within the computer is to enable very small and very large numbers to be represented by a limited number of digits - by one computer word, in fact.
This representation can be used in any 803, but in a basic machine it is necessary to use subroutines to perform even such simple operations as addition, subtraction etc. However, if an automatic floating-point unit is fitted, it is possible to carry out arithmetic on floating-point numbers by means of special functions provided for the purpose.
Any given number A can be represented in many ways by a pair of numbers (a, b) which satisies the equality
in which a is called the mantissa or argument and b the (binary) exponent. Such a representation is termed (binary) floating-point, by contrast with the forms discussed in the main part of this manual, e.g. integers, fractions, etc, all of which are fixed-point.
For example, the number 6 can be represented by (.75, 3) or by (12, -1).
The usefulness of the notation lies, however, in the fact that any given non-zero number can be represented as a pair of numbers in which the magnitude of the mantissa lies between ½ and 1 and the exponent is integral. In practice when using this notation in the computer there are upper and lower limits on the value of the exponent, and so there must be upper and lower limits on the magnitude of the non-zero numbers which can be represented.
Zero can, of course, be represented by any pair in which the mantissa is 0.
In the 803 computers fitted with automatic floating-point units the details of the representation used are as follows:
For ease of reading, the numbers in the examples below have been printed with the sign digit, the fractional digits of the mantissa and the digits of the exponent in separate groups.
|+120, i.e. 15/16 x 28 [ed: should be +240]|
|-.078125, i.e. (-5/8) x 2-3|
|+(1-2-29) x 2255 the largest possible positive number, about 5.8 x 1076|
|-(½+2-29) x 2-256, the smallest possible negative number, about -4.3 x 10-76|
Observe that zero has the same form in floating-point representation as in fixed-point, and that the sign digit of any positive number or zero is 0, while that of any negative number is 1.
The accuracy of any representation is determined by the number of significant figures employed. Here this is 29 binary digits, so the accuracy is slightly less than that which would be expected from 9 significant decimal figures.
The last two examples given above show extreme values of numbers which can be repesented. The actual range is defined by:
|Zero||represented exactly and unambiguously.|
|Largest positive number:||(1-2-29)||x 2255||)|
|Smallest positive number:||½||x 2-256||) representation accurate to|
|Largest negative number:||-1||x 2255||) 29 significant binary digits|
|Smallest negative number:||-(½ + 2-29)||x 2-256||)|
This may be summarised approximately by saying that zero is represented exactly and that any number satisfying
can be represented to an accuracy approximately equal to that obtained by 9 significant decimal figures.
Certain rational numbers in this range can, of course, be represented exactly. These include, in particular, all integers in the range -536 870 912 ≤ n ≤ 536 870 911.
In each floating-point operation executed by the computer the result is rounded off without bias in such a way that its mantissa will not differ by more than 2-29 from the correct result. If the true result of any floating-point operation upon two numbers, which are represented exactly, is itself capable of exact representation, then the result actually produced by the computer will be exactly correct.
It will be appreciated that the value of the exponent determines the absolute magnitude of the smallest increase or decrease in any number which can be represented within the computer. Thus, in general, the greater the magnitude of any particular number, the greater the step to the next representable number.
The Floating-Point Overflow Lamp
It should be observed that, before the advent of the 803 automatic floating-point unit, many subroutines were written by means of which floating-point operations could be performed, and that these remain in use on 803's not fitted with automatic floating-point units. In most of these programmes, the representation employed differs to some extent from the representation described above.
In function 60 to 64 inclusive the computer treats both a and n as standard floating-point numbers and produces a standard floating-point result, which replaces the old content of the accumulator.
The content of the auxiliary register is cleared by any of these operations, but the store is not affected.
|60 N||Add n to a||a + n||3 x 288 μsec.|
|61 N||Subtract n from a||a - n||3 x 288 μsec.|
|62 N||Negate a and add n||n - a||3 x 288 μsec.|
|63 N||Multiply a by n||an||17 x 288 μsec.|
|64 N||Divide a by n||a/n||34 x 288 μsec.|
|65 4096||Convert the fixed-point integer in the accumulator to floating-point form||2 x 288 μsec.|
In obeying the instruction 65 4096 the computer treats the existing content of the accumulator as a fixed-point integer to scale x 2-38, and converts this to standard floating point form.
|which is the fixed-point representation of the integer 15 is converted to|
|0 11110000000000000000000000000 100000100|
|in which a=15/16 and (b+256) = 260|
|so that a x 2b = 15/16 x 24 = 15|
It should be noted that function 65, with values of N ≠ 4096, is used for other purposes in certain special versions of 803, and that 65 4096 is the only permitted form of the instruction to convert integer to floating-point representation.
Negating Observe that if C(Acc) is a floating-point number, it may be negated by means of the instruction 62 0.
|00 10 20 30||The effect is normal|
|03 13 23 33||Can be used to separate the exponent and mantissa: otherise not of much practical value.|
|06 16 26 36||The effect is normal|
|41 42 45 46||Will work equally well when C(A) is a floating-point number|
These functions are:
01, 11, 21, 31,
02, 12, 22, 32,
04, 14, 24, 34,
05, 15, 25, 35,
07, 17, 27, 37,
All Group 5
Where the effect of the above instruction is to perform arithmetic on one or more words which represent floating-point numbers, they will not, in general, produce sensible results. The functions 43 and 47 refer to the fixed-point overflow indicator, which is not usually affected by floating-point overflow: see 2.6(iii).
It is, however, possible to carry out such actions as using 22 N to double a floating-point number or using a fixed-point add or subtract instruction to double or halve several times by adding or subtracting an integer (thereby changing A=a.2b to a.2b+n = A.2n). It should be noted that these operations would not be subject to the normal rules of floating-point overflow and underflow.
Just as with fixed-point programming, constants and data which are to be operated on in binary, are expressed in decimal on programme sheets and when writing out data.
While the actual details of the conventions used will vary from one subroutine or programme to another, the general written notation for a floating-point number if ± a/b, representing ± a x 2b. b must be an integer, a may be a fraction, integer or mixed number.
Thus we could express the floating-point number
Of these, the first is sometimes called the standard (decimal) floating-point form, whic is roughly defined by sayiong that the mantissa, a, obeys the decimal inequality
An augmented translation input routine is available, which will read and store floating-point constants written in any of the above notations. This routine occupies more storage space than the standard form, so programmes written for automatic floating-point computers can conveniently commence in location 256.
Subroutines which will perform the functions of reading and printing floating-point numbers, including if desired the conversion from or to fixed-point form, are available. So also are subroutines to evaluate mathematical functions of floating-point numbers.
At the time (May 1961) of writing this text, the subroutines to which reference is made exist only in a provisional form. It should therefore be noted that final details may differ from those assumed here.
This may also be entered from any programme by the instruction pair 73 170 40 52 whereupon it will read one number (fixed or floating-point) from the input tape and exit with this in the accumulator.
Storage requirements: less than 200 locations.
Given: The integer r, the surface areas of r spheres.
Calculate and print, to five significant figures, the volume of each sphere and the mean volume per sphere.
r' is the standard floating-point form of r.
If the area of a sphere is A, then its volume V is determined by
V = A √A / 6 √π
and we note that 6 √π = 10.6347231
If T is the total volume, then
T = ΣV = 0 + V1 + V2 + ... + Vr
and if U is the mean volume, then
U = T / r'
The following numbers will be punched on the data tape, in this sequence:
The integer r
The floating-point numbers, representing the surface areas, one after another.
The individual volumes will be printed in a column, and the mean volume will be printed to the right of the last individual volume.
|Set count -(r-1)|
|Form r' and store it|
|Set T to zero|
|Read an area A||<-----------------------+|
|Add V to T||||
|Print V on a new line||||
|Count and test : if r spheres have not||||
|yet been dealt with, go back||--------------------------+|
|Otherwise: divide T by r' to form U|
|5||+490||Square root subroutine|
|1,1||21||0,3||22||0,3||-(r-1) in 0,3|
|2,1||65||4096||20||1,3||r' in 1,3|
|3,1||26||2,3||74||27||T to zero, punch fs1|
|4,1||73||170||40||53||Read an area A|
|5,1||20||3,3||00||0||Copy of A in 3,3|
|8,1||10||2,3||60||2,3||Add to T|
|9,1||10||2,3||00||0||(V back in accumulator)|
|10,1||74||29||74||30||Punch cr lf1|
|11,1||73||0,4||40||1,4||Punch V to|
|12,1||00||0||00||5||) 5 significant figures|
|13,1||32||0,3||41||1,4||Count spheres and test|
|15,1||74||28||74||28||Punch sp sp1|
|16,1||73||0,4||40||1,4||Punch U to|
|17,1||00||0||00||5||) 5 significant figures|
|0,3||+0||Count of number of spheres|
|Square Root Subroutine|
Note 1: The first character to be punched is fs, to ensure that the results will be printed as figures. Before each V, the characters cr lf are punched to make the teleprinter "start a new line" before printing V. Before U we have sp sp: the student may work out why for himself.