Fitzpatrick's Fabulous Future: floating point

Showing posts with label floating point. Show all posts

Thursday, February 12, 2015

Floating-point arithmetic will die

Floating-point arithmetic is popular. Yet despite its popularity, I foresee its eventual demise. It will be replaced by arbitrary-precision arithmetic, which offers more accuracy.

Here is a list of popular languages (today) and the availability of calculations with something other than floating-point arithmetic:

COBOL COMP-3 type (decimal fixed precision)
FORTRAN none
C none (C does not support classes)
C++ GMP package (arbitrary precision)
Java BigDecimal class (arbitrary precision)
C# decimal type (a bigger floating-point)
Perl Rat type (arbitrary precision) ('rat' for 'rational')
Ruby BigDecimal class (arbitrary precision)
Python decimal class (arbitrary precision)
JavaScript big.js script (arbitrary precision)
Swift none native; can use GMP
Go 'big' package (arbitrary precision)

Almost all of the major languages support something 'better' than floating-point arithmetic.

I put the word 'better' in quotes because the change from floating-point to something else (arbitrary-precision or decimal fixed-precision) is a trade-off. Floating-point arithmetic is fast, at the expense of precision. The IEEE standard for floating-point is good: it allows for a wide range of numbers in a small set of bits and the math is fast. Most computer systems have hardware co-processors for floating-point operations, which means they are very fast.

Arbitrary-precision arithmetic, in contrast, is slow. There are no co-processors to handle it (at least none in mainstream hardware) and a software solution for arbitrary precision is slower than even a software-only solution for floating-point.

Despite the costs, I'm fairly confident that we, as an industry, will switch from floating-point arithmetic to arbitrary-precision arithmetic. Such a change is merely one of along line of changes, each trading computing performance for programmer convenience.

Consider previous changes of convenience:

Moving from assembly language to higher-level languages such as COBOL and FORTRAN
Structured programming, which avoided GOTO statements and used IF/ELSE and DO/WHILE statements
Object-oriented programming (OOP) which enabled encapsulation and composition
Run-time checks on memory access
Virtual machines (Java's JVM and .NET's CLR) which allowed more run-time checks and enhanced debugging

Each of these changes was made over the objections of performance. And while the older technologies remained, they became niche technologies. We still have assembly language, procedural (non-OOP) programming, and systems without virtual machines. But those technologies are used in a small minority of projects. The technologies that offer convenience for the programmer became mainstream.

Floating-point arithmetic costs programmer time. Code that uses floating-point types must be carefully designed for proper operation, and then carefully reviewed and tested. Any changes to floating-point code must be carefully reviewed. All of these reviews must be done by people familiar with the limitations of floating-point arithmetic.

Not only must the designers, programmers, and reviewers be familiar with the limitations of floating-point arithmetic, they must be able to explain them to other folks involved on the project, people who may be unfamiliar with floating-point arithmetic.

When working with floating-point arithmetic, programmers are put in the position of apologizing for the failings of the computer. Failings that are not easily understood; any schoolage child knows that 0.1 + 0.2 - 0.3 is equal to zero, not some small amount close to zero.

I believe that it is this constant need to explain the failings of floating-point arithmetic that will be its undoing. Programmers will eventually start using arbitrary-precision arithmetic, if for no other reason than to get them out of the explanations of rounding errors. And for most applications, the extra computing time will be insignificant.

Floating-point, like other fallen technologies, will remain a niche skill. But it will be out of the mainstream. The only question is when.

Tuesday, February 10, 2015

The Egyptians were better at fractions than our floating-point standard

When it comes to fractions, the Egyptians were just as good, and possibly better, than our computing standards for floating-point math.

Our floating-point arithmetic has a lot in common with the Egyptian methods. Both build numbers as a series of fractions. Not just any fractions, but fractions in the form "1/n". Both limit the series to unique values; you cannot repeat a value in the series. When writing the value 3/5. the Egyptians would write not "1/5 + 1/5 + 1/5" but instead "1/2 + 1/10".

Our floating point algorithms enforce unique values by the nature of binary representations. A number stored in floating-point format has a number of decimal places, values to the right of the decimal point. Those values are 1/2, 1/4, 1/8, 1/16, ... the negative powers of two, and one bit is allocated to each. The bit for a value may be one (on, indicating that the value is part of the series) or zero (off; the value is not part of the series) but it may be used only once.

The Egyptian system was more flexible than our floating-point system. The values for our floating-point system are restricted to the negative powers of two; the Egyptian system allowed any fraction in the form "1/n". Thus the Egyptians could have values 1/5, 1/7, and 1/10.

Our floating-point system does not have values for 1/5, 1/7, or 1/10. Instead, a sum of the permitted values must be used in their place. And here is where things get difficult for the floating-point system.

As odd as it may appear, the value 1/10 cannot be represented exactly in our floating-point representation. Let's look at how a value is constructed:

We must build the value for 0.10 with a set of binary fractions, with at most one of each. The possible values are:

1/2 0.5
1/4 0.25
1/8 0.125
1/16 0.0625
1/32 0.03125
1/64 0.015625
1/128 0.0078125
1/256 0.00390625

The first three values are larger that 0.10, so those bits must be zero. Therefore, the first part of our binary representation must be "0.000" (an initial zero and then three zero bits for 1/2, 1/4, and 1/8).

The value 0.0625 can be used, and in fact must be used, so our value becomes "0.0001". But we're not at 1/10, so we must add more values.

The value 0.03125 can be added without pushing us over the desired value, so we add it. Now our series is 1/16 + 1/32, or 0.00011.

But 1/16 + 1/32 has the value 0.09375, which is not 1/10. We must add more values.

The value 1/64 would push us over 1/10, as would 1/128, so we do not add either. We can add 1/256, giving us 1/16 + 1/32 + 1/256 (or 0.00011001 in binary, or 0.09765625).

You may be getting a bit tired at this point. How many more iterations of this algorithm must we endure? They answer is shocking and daunting: an infinite number!

It turns out that the value 1/10 cannot be represented, exactly, in our floating-point system of binary fractions. (Just as the value 1/3 cannot be represented in our decimal system of fractions.)

The popular programming languages C, C++, Java, and C# all use floating-point numbers with binary fractions. They all represent the value 1/10 inexactly. For any of those languages, the following sequence will show the error:

double a = 0.1;
double b = 0.2;
double c = 0.3;
double d = a + b - c;
// substitute the proper function to print in your language
print(d);

Our modern systems cannot handle 1/10 exactly.

Yet the Egyptians had the value 1/10.

Sunday, February 8, 2015

Floating-point is finite

The amazing thing about the human mind is that it can conceive of infinite things, including numbers. In this area the human brain far outstrips our computers.

First, a short review of numbers, as seen from the computer's point of view. We have two important points:

Modern computers are binary, which means that they use only the digits one and zero when calculating. While they can display numbers in decimal for us humans, they store numbers and perform operations on binary values.

Computers use numbers with a finite number of digits. We humans can think of numbers with any number of digits, starting with '1' and going up as high as we like. Computers are different -- they can go up to a certain number, and then they stop. (The exact number depends on the computer design and the representation used. More on that later.)

For example, a computer designed for 8-bit numbers can handle numbers, in the computer's internal format, from 00000000 to 11111111. That is only 256 possible combinations, and we typically assign the values 0 to 255 to the internal ones:

00000000 = 0
00000001 = 1
00000010 = 2
00000011 = 3
...
11111110 = 254
11111111 = 255

That's not a lot of numbers, and even the early PCs allowed for more. How do we get more numbers? By increasing the number of bits (binary digits), just as we humans get more numbers by adding digits to our decimal numbers.

Sixteen bits allows for 65,536 combinations:

00000000 00000000 = 0
00000000 00000001 = 1
00000000 00000010 = 2
00000000 00000011 = 3
...
11111111 11111110 = 65534
11111111 11111111 = 65535

Observant readers will note that the highest value (255 for 8-bit and 65535 for 16-bit) is one less than the number of combinations. That's because in each case, zero takes one combination. This is common in computer numeric systems. (There are a few systems that don't allow a slot for zero, but they are rare. Zero is rather useful.)

Modern computer systems allow for 32-bit and 64-bit numbers, which allow for greater numbers of numbers. Thirty-two bit systems can hold values in the range 0 to 4,294,967,295 and 64-bit systems can range from 0 to 18,446,744,073,709,551,615. Surely those are enough?

It turns out that they are not.

One obvious deficiency is the lack of negative numbers. We need negative numbers, because sometimes we deal with quantities that are less than zero.

Computer architects long ago decided on a hack for negative numbers, a hack that was so good that we still use it today. We use one bit to represent the sign of the number, and the rest of the bits for the value. Thus, for 8-bit numbers:

10000000 = -127
10000001 = -126
...
11111110 = -2
11111111 = -1
00000000 = 0
00000001 = 1
00000010 = 2
00000011 = 3
...
01111110 = 127
01111111 = 128

(The same applies to 16-bit, 32-bit, and 64-bit numbers.)

Now we have negative numbers! Whee!

Yet before we celebrate for two long, we notice that our maximum value has decreased. Instead of 256 (for 8-bit numbers), our maximum value is 128, or half of our previous maximum. (There are similar reductions for other bit sizes.)

We gained some negative numbers, at the cost of some positive numbers. The "number of numbers" is still 256, but now half are used for positive values and half for negative values.

Computing is full of trade-offs like this.

So we have the following:

bit size unsigned range signed range

8-bit 0 to 256 -127 to 128

16-bit 0 to 65535 -32766 to 32767

32-bit 0 to 4,294,967,295 -2,147,483,648 to 2,147,483,647

64-bit 0 to 18,446,744,073,709,551,615 −9223372036854775808 to 9223372036854775807

Well, perhaps the loss of values isn't so bad. If we need a value greater than 128, we can use a 16-bit number, or a 32-bit number.

Do we have enough numbers? Yes, for most computations, with one condition. We have no fractional values here. We have no 1/2, no 2/3, no 1/10. Integers are sufficient for many calculations, but not every calculation.

How can we represent fractions? Or if not fractions, decimals?

One solution is to use an implied decimal point. It's implied because we do not store it, but imply its existence. I'll write that as '[.]'

Let's start with our friend the 8-bit numbers, and give them a single decimal place:

1000000[.]0 = -63.0
1000000[.]1 = -62.5
...
1111111[.]0 = -1.0
1111111[.]1 = -0.5
0000000[.]0 = 0.0
0000000[.]1 = 0.5
0000001[.]0 = 1.0
0000001[.]1 = 1.5
...
0111111[.]0 = 63.0
0111111[.]1 = 63.5

The first observation is that our fractions are not the decimal fractions (0.1, 0.2, 0.3...) we use with decimal numbers but are limited to either '.0' or '.5'. That is due to the nature of the binary system, which uses powers of 2, not powers of 10. In the decimal system, our digits to the left of the decimal point are the units (or 'ones'), tens, hundreds, thousands, ... but in binary our digits are units, twos, fours, eights, ... etc. To the right of the decimal point the decimal system uses tenths (10^-1), hundreths (10-2), and thousandths (10-3), or tens to negative powers. The binary system uses twos to negative powers: 2^-1 (or one half), 2^-2 (one quarter), 2^-3 (one eighth).

One decimal place in the binary system gets us halves, but nothing more precise. If we want more precision, we must use more decimal places.

But wait! you cry. Using one digit for decimals cost us more numbers! Yes, using one digit for a decimal has reduced our range from [-127, 128] to [-63.0, 63.0]. We still have the same number of numbers (256) but now we are assigning some to fractional values.

You may console yourself, thinking that we can use larger bit sizes for numbers, and indeed we can. But the "roughness" of our fractions (either .0 or .5, nothing in between) may bother you. What if we want a fraction of 0.25? The answer, of course, is more bits for the decimal.

100000[.]00 = -31.00
100000[.]01 = -30.75
...
111111[.]10 = -1.0
111111[.]11 = -0.5
000000[.]00 = 0.0
000000[.]01 = 0.25
000000[.]10 = 0.50
000000[.]11 = 0.75
000001[.]00 = 1.00
...
011111[.]10 = 32.50
011111[.]11 = 32.75

Now we have fractional parts of 0.25, 0.50, and 0.75. (At the cost of our range, which has shrunk to [-31.00, 32.75]. Larger bit sizes can still help with that.

The interesting aspect of our fractions is this: a fraction is the sum of inverted powers of two. The value 0.25 is 2^-2, or 1/4. The value 0.5 is 2^-1, or 1/2. The value 0.75 is 2^-1 + 2^-2, or 1/2 + 1/4.

Were we to increase the number of digits in our implied decimal, we would also use 1/8, 1/16, 1/32, etc. We would not use values such as 1/10 or 1/3, because they are not represented in binary digits. (Only powers of two, mind you!)

Also, when representing a number, we can use a power of two only once. For plain integers, the value 10 must be represented as 2 + 8; it cannot be represented by 6 + 4 (6 is not a power of two) nor by 2 + 2 + 2 + 2 + 2 (we can use a value only once).

For fractions, the same rules apply. The fraction 3/4 must be represented by 1/2 + 1/4, not 1/4 + 1/4 + 1/4.

These rules impose limits on the values we can represent. In our example of 8-bit numbers and two implied decimal places, we cannot represent the values 100, -2000, or 1.125. The first two are outside of our allowed range, and the third, while within the range, cannot be represented. We cannot add any combination of 1/2 and 1/4 to get 1/8 (the .125 portion of the value).

This is a hard lesson. Our representations of numbers, integer and floating point, are not infinite. We humans can always think of more numbers.

Astute readers will note that I have described not floating-point arithmetic, but fixed-point arithmetic. The observation is correct: I have limited my discussion to fixed-point arithmetic. Yet the concepts are the same for floating-point arithmetic. It, too, has limits on the numbers it can represent.

Fitzpatrick's Fabulous Future