Chapter 6. Data representation: Floating point
Representing numbers as fixed-length integers has some notable limitations. It isn't suitable for very large numbers that don't fit into 32 bits, like 6.02 × 1023, nor can it handle numbers that have a fraction, like 3.14. We'll now turn to systems for handling a wider range of numbers.
One possibility for handling numbers with fractional parts is to add bits after the decimal point: The first bit after the decimal point is the halves place, the next bit the quarters place, the next bit the eighths place, and so on.
Suppose that we want to represent 1.625(10). We would want 1 in the ones place, leaving us with 0.625. Then we want 1 in the halves place, leaving us with 0.625 − 0.5 = 0.125. No quarters will fit, so put a 0 there. We want a 1 in the eighths place, and we subtract 0.125 from 0.125 to get 0.
So the binary representation of 1.625 would be 1.101(2).
The idea of fixed-point representation is to split the bits of the representation between the places to the left of the decimal point and places to the right of the decimal point. For example, a 32-bit fixed-point representation might allocate 24 bits for the integer part and 8 bits for the fractional part.
To represent 1.625, we would then write
The first three bytes give the 1, and the last byte gives the representation of 0.625.
Fixed-point representation works well as long as you work with numbers within the given range. The 32-bit fixed-point representation described above can represent any multiple of 1/256 from 0 up to 224 ≈ 16.7 million. But programs frequently need to work with numbers from a much broader range. For this reason, fixed-point representation isn't used very often in today's computing world. [Financial software is a notable exception. Here, designers often want all computations to be precise to the penny, and in fact they should always be rounded to the nearest penny. There is no reason to deal with very large amounts (like trillions of dollars) or fractions of a penny. Such programs use a variant of fixed-point representation that represents each amount as an integer multiple of 1/100, just as the fixed-point representation represents numbers as an integer multiple of 1/256.]
6.1. Floating-point representation
Floating-point representation is an alternative technique based on scientific notation. Because we're working with computers, we'll base our scientific notation on powers of 2, not 10 as is traditional. For example, the binary representation of 5.5(10) is 101.1(2). When we convert this to binary scientific notation, we move the decimal point to the left two places, giving us 1.011(2) × 22. (This is just like converting 101.1(10) to scientific notation: It would be 1.011(10) × 102.)
To represent a number written in scientific notation in bits, we'll decide how to split up the representation to fit it into a fixed number of bits. First, let us define the two parts of scientific representation: In 1.011(2) × 102, we call 1.011(2) the mantissa (or the significand), and we call 2 the exponent. In this section we'll use 8 bits to store such a number, divided as follows.
We use the first bit to represent the sign (1 for negative, 0 for positive), the next four bits for the sum of 7 and the actual exponent (we add 7 to allow for negative exponents), and the last three bits for the fraction of the mantissa. Note that we omit the digit to the left of the decimal point: Since the mantissa has only one nonzero bit to the left of the decimal point, and the only nonzero bit is 1, we know that the bit to the left of the decimal point must be a 1. There's no point in wasting space in inserting this 1 into our bit pattern. We include only the bits of the mantissa to the right of the decimal point.
We call this a floating-point representation because the values of
the mantissa bits float
along with the decimal point, based on
the exponent's given value.
This is in contrast to fixed-point representation, where the
decimal point is always in the same place among the bits
given.
Continuing our example of 5.5(10) = 1.011(2) × 22, we add 7 to 2 to arrive at 9(10) = 1001(2) for the exponent bits. Into the mantissa bits we place the bits following the decimal point of the scientific notation, 011. This gives us 0 1001 011 as the 8-bit floating-point representation of 5.5(10).
Suppose we want to represent −96(10).
- First we convert our desired number to binary: −1100000(2).
- Then we convert this to binary scientific notation: −1.100000(2) × 26.
- Then we fit this into the bits.
- We choose 1 for the sign bit if the number is negative. (It is, in this case.)
- We add 7 to the exponent and place the result into the four exponent bits. (In this case, we arrive at 6 + 7 = 13(10) = =1101(2).)
- The three mantissa bits are the first three bits following the leading 1: 100. (If there are more than three bits, then rounding will be necessary.)
Conversely, suppose we want to decode the number 0 0101 100.
- We observe that the number is positive, and the exponent bits represent 0101(2) = 5(10). This is 7 more than the actual exponent, and so the actual exponent must be −2. Thus, in binary scientific notation, we have 1.100(2) × 2−2.
- We convert this to binary: 1.100(2) × 2−2 = 0.011(2).
- We convert the binary into decimal: 0.011(2) = 1/4 + 1/8 = 3/8 = 0.375(10).
6.1.1. Alternative conversion algorithm
The process described above for converting from decimal to binary representation relies implicitly on the repeated subtraction algorithm for converting from decimal to binary. For example, we arrive at 101.1(2) for 5.5(10) by subtracting 4, then 1, then 0.5. If we wanted to convert 2.375(10), we would choose a 1 for the 2's place (leaving 0.375), the 1/4's place (leaving 0.125), and the 1/8's place (leaving 0), giving us 10.011(2).
Alternatively, we can use a process inspired by the repeated division algorithm. Here, we convert to binary in two steps.
We take the portion to the left of the decimal point and use the repeated division algorithm for converting from decimal to binary. In the case of 2.375(10), we would take the 2 to the left and repeatedly divide until we reach 0, reading the remainders to arrive at 10(2).
We take the portion to the right of the decimal point and repeatedly multiply it by 2, each time extracting the bit to the right of the decimal point, until we reach 0 (or until we have plenty of bits to fill out the mantissa bits). In the example of 2.375(10), the fractional part is .375(10). We repeatedly multiply this by 2, each time taking out the integer part as the next bit of the binary fraction, arriving at 011.

These bits are the bits following the decimal point in the binary representation. Placed with the bits before the decimal point determined in the previous step, we would conclude with 10.011(2).
6.1.2. Representable numbers
This 8-bit floating-point format can represent a wide range of both small numbers and large numbers. To find the smallest possible positive number we can represent, we would want the sign bit to be 0, we would place 0 in all the exponent bits to get the smallest exponent possible, and we would put 0 in all the mantissa bits. This gives us 0 0000 000, which represents
To determine the largest positive number, we would want the sign bit still to be 0, we would place 1 in all the exponent bits to get the largest exponent possible, and we would put 1 in all the mantissa bits. This gives us 0 1111 111, which represents
Thus, our 8-bit floating-point format can represent positive numbers from about 0.0078(10) to 480(10). In contrast, the 8-bit two's-complement representation can only represent positive numbers between 1 and 127.
But notice that the floating-point representation can't represent all of the numbers in its range — this would be impossible, since eight bits can represent only 28 = 256 distinct values, and there are infinitely many real numbers in the range to represent. Suppose we tried to represent 17(10) in this scheme. In binary, this is 10001(2) = 1.0001(2) × 24. When we try to fit the mantissa into the mantissa portion of the 8-bit representation, we find that the final 1 won't fit: We would be forced to round. In this case, computers would ignore the final 1 and use the pattern 0 1011 000. [Computers generally round to the nearest possibility. But, when we are exactly between two possibilities, as in this case, most computers follow the policy of rounding so that the final mantissa bit is 0. This detail of exactly how the rounding occurs is not important to our discussion, however.] That rounding means that we're not representing the number precisely. In fact, 1 1011 000 translates to
Thus, in our 8-bit floating-point representation, 17 equals 16! That's pretty irritating, but it's a price we have to pay if we want to be able to handle a large range of numbers with such a small number of bits.
While a floating-point representation can't represent all numbers precisely, it does give us a guaranteed number of significant digits. For this 8-bit representation, we get a single digit of precision, which is pretty limited. To get more precision, we need more mantissa bits. Suppose we defined a similar 16-bit representation with 1 bit for the sign bit, 6 bits for the exponent plus 31, and 9 bits for the mantissa.
This representation, with its 9 mantissa bits, happens to provide three significant digits. Given a limited length for a floating-point representation, we have to compromise between more mantissa bits (to get more precision) and more exponent bits (to get a wider range of numbers to represent). For 16-bit floating-point numbers, the 6-and-9 split is a reasonable tradeoff of range versus precision.
6.2. IEEE standard
Nearly all computers today follow the the IEEE 754 standard, published in 1980, for representing floating-point numbers. This standard is similar to the 8-bit and 16-bit formats we've explored already, but the standard deals with longer lengths to gain more precision and range. There are three major varieties of the standard, for 32 bits, 64 bits, and 79 bits.
sign exponent mantissa exponent significant format bits bits bits excess digits Our 8-bit 1 4 3 7 1 Our 16-bit 1 6 9 31 3 IEEE 32-bit 1 8 23 127 6 IEEE 64-bit 1 11 52 1,023 15 IEEE 79-bit 1 15 63 16,383 19
All of these formats use an offset for the exponent, called the excess. In all of these formats, the excess is halfway up the range of numbers that can fit into the exponent bits. For the 8-bit format, we had 4 exponent bits; the largest number that can fit into 4 bits is 24 − 1 = 15, and so the excess is 7 ≈ 15/2. The IEEE 32-bit format has 8 exponent bits, and so the largest number that fits is 255, and the excess is 127 ≈ 255/2.
In programming, the 32-bit and 64-bit versions are most common.
Java represents these with float and double.
(C has these types too, but as with the integer types, it doesn't
specify their length or representation. Most commonly, however,
a float uses 32-bit IEEE format, and a double uses
64-bit IEEE format.)
The IEEE standard formats generally follow the rules we've outlined so far, but there are two exceptions: the denormalized numbers and the nonnumeric values. We'll look at these next.
6.2.1. Denormalized numbers
The first special case is for dealing with very small values. Let's go back to the 8-bit representation we've been studying. If we plot the small numbers that can be represented exactly on the number line, we get the distribution illustrated below.
The smallest representable positive number is 2−7 = 1/128 (bit pattern 00000000), and the largest representable negative number is −2−7 = −1/128 (bit pattern 10000000). These are small numbers, but when we look at the above number line, we see an anomaly: There is a relatively large gap between them. And — notice — there is no exact representation of one of the most important numbers of all: zero!
To deal with this, the IEEE standard defines the denormalized numbers. The idea is to take the most closely clustered numbers illustrated in the number line above (illustrated in blue) and spread them more evenly across 0. This will give us the below diagram instead.
Each of these blue
numbers have exponent bits that are
all 0.
To arrive at the red
numbers instead, we'll change the meanings
of these numbers as follows: When all exponent bits are 0, then the
exponent is −6, and the mantissa has an implied 0 before
it. [The word denormalized comes from the fact that the
mantissa is not in its normal form, where a nonzero
digit is to the left of the decimal point.]
Consider the bit pattern 0 0000 010: In a floating-point format
incorporating the denormalized case, this represents
0.010(2) × 2−6 = 1 × 2−8 = 1/256.
(Without the denormalized case, this would
represent 1.010(2) × 2−7.
The changes are in the bit before the mantissa's
decimal point and in the exponent of −7.)
Suppose we want to represent 0.005(10) in our 8-bit floating-point format with a denormalized case. We first convert our number into the form x × 2−6. In this case, we would get 0.320(10) × 2−6. Converting 0.320(10) to binary, we get approximately 0.0101001(2). In the 8-bit format, however, we have only three mantissa bits, and so we would round this to 0.011(2). Thus, we have 0.011(2) × 2−6, and so our bit representation would be 0 0000 011. This is just an approximation to the original number of 0.005(10): It is about 0.00586(10). Without the denormalized case, the best approximation would be much further off (0.00781(10)).
How would we represent 0? We go through the same process: Converting this into the form x × 2−6, we get 0.0 × 2−6. This translates into the bit representation 0 0000 000.
Why −6 for the exponent? It would make more intuitive sense to use −7, since this is what the all-zeroes exponent is normally. We use −6, however, because we want a smooth transition between the normalized values and the denormalized values. The smallest positive normalized value is 1 × 2−6 (bit pattern 0 0001 000). If we used −7 for the denormalized exponent, then the largest denormalized value would be 0.111(2) × 2−7, which is roughly half of the smallest positive normalized value. By using the same exponent as for the smallest normalized case, the standard spreads the denormalized numbers evenly from the smallest positive normalized number to 0. The second number line diagrammed above illustrates this: The open circles, representing values handled by the denormalized case, are spread evenly between the solid circles, representing the numbers handled by the normalized case.
The denormalized case works the same for the IEEE standard floating-point formats, except that the exponent varies based on the format's excess. In the 32-bit standard, for example, the denormalized case is still the case when all exponent bits are zero, but the exponent it represents is −126 (since the normalized case involves an excess-127 exponent, and so the lowest exponent for normalized numbers is 1 − 127 = −126).
6.2.2. Nonnumeric values
The IEEE standard's designers were concerned with some special cases — particularly, computations where the answer doesn't fit into the range of defined numbers. To address such possibilities, they reserved the all-ones exponent for the nonnumeric values. They designed two types of nonnumeric values into the IEEE standard.
- If the exponent is all ones and the mantissa is all zeroes, then the number represents infinity or negative infinity, depending on the sign bit. Essentially, these two values are to represent numbers that have gone out of bounds. This value results from an overflow; for example, if you doubled the largest positive value, you would get infinity. Or if you divide 1 by a tiny number, you would get either infinity or negative infinity.
- If the exponent is all ones, and the mantissa has some non-zero
bits,
then the number represents
not a number,
written as NaN. This represents an error condition; some situations where this occurs include finding the square root of −1, computing the tangent of π/2, and dividing 0 by 0.
6.3. Analysis
6.3.1. Laws of arithmetic
You might wish that the standard laws of arithmetic would apply to numeric representations. For example, it would be convenient if the commutative law (x + y = y + x) applied to addition.
With two's-complement arithmetic, all the standard laws apply except those that involve division. Identities involving division fail, of course, because division often results in non-integers, which can only be approximated in a two's-complement representation.
With IEEE floating-point numbers, however, many laws fall apart. The commutative law still holds. But the associative law of addition (x + (y + z) = (x + y) + z) does not: Suppose x were the largest possible numeric value, y were 1, and z were −1. Then the value of x + (y + z) would be x, since y + z is 0 and x + 0 is x. But the value of (x + y) + z would be positive infinity, since x + y results in overflow, which is equivalent to infinity in the IEEE format, and any finite number added to infinity results in infinity again.
The associative law fails for addition even without resorting to such special cases. Consider x = 1, y = 230, z = −230 in our 16-bit representation. If you evaluate x + (y + z), you get 1, since y + z = 0. But (x + y) + z is 0, since the value 1 + 230 is 230 exactly because the added 1 can't be represented within the mantissa bits of 230.
Another example of something that should be true but isn't in IEEE floating-point arithmetic is the equation
The reason this fails is that 1/6 has no precise representation with a fixed number of mantissa bits, and so it must be approximated with the IEEE format. As it happens, it underestimates slightly. When we add this underestimation to itself 6 times, the total ends up being less than 1.
6.3.2. Conclusion
We have seen again how design principles can influence languages:
Whereas C leaves the details of representation to the architecture,
while Java chooses to specify the exact representation of its numbers.
Thus, you can safely treat each number in Java as a two's-complement
or IEEE format, with a known length, and be sure that it will work on
any architecture. Java chooses this because of its emphasis on
portability across computers. With C, however, efficiency and accurate
representation of the architecture is a higher priority, and so C's
definition intentionally omits defining the representation of an
int or a float.
In any fixed-length representational system, there will be a limited number of representable values, and this leads to the possibility of overflow and roundoff errors. These are not abstract phenomena — they cause real and unexpected errors in many situations. Computer designers have worked hard to create representations in which such problems are rare, but they are still common enough to demand consideration, especially when a program works with floating-point computation.


