Sunday, February 8, 2015

Floating-point is finite

The amazing thing about the human mind is that it can conceive of infinite things, including numbers. In this area the human brain far outstrips our computers.

First, a short review of numbers, as seen from the computer's point of view. We have two important points:

Modern computers are binary, which means that they use only the digits one and zero when calculating. While they can display numbers in decimal for us humans, they store numbers and perform operations on binary values.

Computers use numbers with a finite number of digits. We humans can think of numbers with any number of digits, starting with '1' and going up as high as we like. Computers are different -- they can go up to a certain number, and then they stop. (The exact number depends on the computer design and the representation used. More on that later.)

For example, a computer designed for 8-bit numbers can handle numbers, in the computer's internal format, from 00000000 to 11111111. That is only 256 possible combinations, and we typically assign the values 0 to 255 to the internal ones:

00000000 = 0
00000001 = 1
00000010 = 2
00000011 = 3
...
11111110 = 254
11111111 = 255

That's not a lot of numbers, and even the early PCs allowed for more. How do we get more numbers? By increasing the number of bits (binary digits), just as we humans get more numbers by adding digits to our decimal numbers.

Sixteen bits allows for 65,536 combinations:

00000000 00000000 = 0
00000000 00000001 = 1
00000000 00000010 = 2
00000000 00000011 = 3
...
11111111 11111110 = 65534
11111111 11111111 = 65535

Observant readers will note that the highest value (255 for 8-bit and 65535 for 16-bit) is one less than the number of combinations. That's because in each case, zero takes one combination. This is common in computer numeric systems. (There are a few systems that don't allow a slot for zero, but they are rare. Zero is rather useful.)

Modern computer systems allow for 32-bit and 64-bit numbers, which allow for greater numbers of numbers. Thirty-two bit systems can hold values in the range 0 to 4,294,967,295 and 64-bit systems can range from 0 to 18,446,744,073,709,551,615. Surely those are enough?

It turns out that they are not.

One obvious deficiency is the lack of negative numbers. We need negative numbers, because sometimes we deal with quantities that are less than zero.

Computer architects long ago decided on a hack for negative numbers, a hack that was so good that we still use it today. We use one bit to represent the sign of the number, and the rest of the bits for the value. Thus, for 8-bit numbers:

10000000 = -127
10000001 = -126
...
11111110 = -2
11111111 = -1
00000000 = 0
00000001 = 1
00000010 = 2
00000011 = 3
...
01111110 = 127
01111111 = 128

(The same applies to 16-bit, 32-bit, and 64-bit numbers.)

Now we have negative numbers! Whee!

Yet before we celebrate for two long, we notice that our maximum value has decreased. Instead of 256 (for 8-bit numbers), our maximum value is 128, or half of our previous maximum. (There are similar reductions for other bit sizes.)

We gained some negative numbers, at the cost of some positive numbers. The "number of numbers" is still 256, but now half are used for positive values and half for negative values.

Computing is full of trade-offs like this.

So we have the following:

bit size        unsigned range                                 signed range
8-bit            0 to 256                                            -127 to 128
16-bit          0 to 65535                                        -32766 to 32767
32-bit          0 to 4,294,967,295                           -2,147,483,648 to 2,147,483,647
64-bit          0 to 18,446,744,073,709,551,615    −9223372036854775808 to 9223372036854775807

Well, perhaps the loss of values isn't so bad. If we need a value greater than 128, we can use a 16-bit number, or a 32-bit number.

Do we have enough numbers? Yes, for most computations, with one condition. We have no fractional values here. We have no 1/2, no 2/3, no 1/10. Integers are sufficient for many calculations, but not every calculation.

How can we represent fractions? Or if not fractions, decimals?

One solution is to use an implied decimal point. It's implied because we do not store it, but imply its existence. I'll write that as '[.]'

Let's start with our friend the 8-bit numbers, and give them a single decimal place:

1000000[.]0 = -63.0
1000000[.]1 = -62.5
...
1111111[.]0 = -1.0
1111111[.]1 = -0.5
0000000[.]0 = 0.0
0000000[.]1 = 0.5
0000001[.]0 = 1.0
0000001[.]1 = 1.5
...
0111111[.]0 = 63.0
0111111[.]1 = 63.5

The first observation is that our fractions are not the decimal fractions (0.1, 0.2, 0.3...) we use with decimal numbers but are limited to either '.0' or '.5'. That is due to the nature of the binary system, which uses powers of 2, not powers of 10. In the decimal system, our digits to the left of the decimal point are the units (or 'ones'), tens, hundreds, thousands, ... but in binary our digits are units, twos, fours, eights, ... etc. To the right of the decimal point the decimal system uses tenths (10^-1), hundreths (10-2), and thousandths (10-3), or tens to negative powers. The binary system uses twos to negative powers: 2^-1 (or one half), 2^-2 (one quarter), 2^-3 (one eighth).

One decimal place in the binary system gets us halves, but nothing more precise. If we want more precision, we must use more decimal places.

But wait! you cry. Using one digit for decimals cost us more numbers! Yes, using one digit for a decimal has reduced our range from [-127, 128] to [-63.0, 63.0]. We still have the same number of numbers (256) but now we are assigning some to fractional values.

You may console yourself, thinking that we can use larger bit sizes for numbers, and indeed we can. But the "roughness" of our fractions (either .0 or .5, nothing in between) may bother you. What if we want a fraction of 0.25? The answer, of course, is more bits for the decimal.

100000[.]00 = -31.00
100000[.]01 = -30.75
...
111111[.]10 = -1.0
111111[.]11 = -0.5
000000[.]00 = 0.0
000000[.]01 = 0.25
000000[.]10 = 0.50
000000[.]11 = 0.75
000001[.]00 = 1.00
...
011111[.]10 = 32.50
011111[.]11 = 32.75

Now we have fractional parts of 0.25, 0.50, and 0.75. (At the cost of our range, which has shrunk to [-31.00, 32.75]. Larger bit sizes can still help with that.

The interesting aspect of our fractions is this: a fraction is the sum of inverted powers of two. The value 0.25 is 2^-2, or 1/4. The value 0.5 is 2^-1, or 1/2. The value 0.75 is 2^-1 + 2^-2, or 1/2 + 1/4.

Were we to increase the number of digits in our implied decimal, we would also use 1/8, 1/16, 1/32, etc. We would not use values such as 1/10 or 1/3, because they are not represented in binary digits. (Only powers of two, mind you!)

Also, when representing a number, we can use a power of two only once. For plain integers, the value 10 must be represented as 2 + 8; it cannot be represented by 6 + 4 (6 is not a power of two) nor by 2 + 2 + 2 + 2 + 2 (we can use a value only once).

For fractions, the same rules apply. The fraction 3/4 must be represented by 1/2 + 1/4, not 1/4 + 1/4 + 1/4.

These rules impose limits on the values we can represent. In our example of 8-bit numbers and two implied decimal places, we cannot represent the values 100, -2000, or 1.125. The first two are outside of our allowed range, and the third, while within the range, cannot be represented. We cannot add any combination of 1/2 and 1/4 to get 1/8 (the .125 portion of the value).

This is a hard lesson. Our representations of numbers, integer and floating point, are not infinite. We humans can always think of more numbers.

Astute readers will note that I have described not floating-point arithmetic, but fixed-point arithmetic. The observation is correct: I have limited my discussion to fixed-point arithmetic. Yet the concepts are the same for floating-point arithmetic. It, too, has limits on the numbers it can represent.

No comments: