Wednesday, November 11, 2020

Flipping bits

Apple has introduced a new line of Mac computers which use Apple's new M1 system-on-a-chip. Apple made several claims about the performance, but those claims were vague -- at least in terms of measured performance. (Claims were in the form of "up to 2 times faster", "up to five times faster", "the fastest Macbook yet".)

Apple's M1 chip contains a processor that is based on ARM architecture, which is quite different from Intel's x86 architecture. How does one compare them?

One of the oldest comparisons of processors is MIPS (millions of instructions per second). Dating back to the mainframe age, it is simple in concept: count the number of instructions in each second of processing. But when comparing different architectures, a comparison of instruction to instruction is not useful. The instructions sets are different, and a computation on one processor might use one instruction, and the same computation on a different processor might take three or four instructions.

A later metric, FLOPS (floating-point operations per second) is somewhat better when the two systems are performing tasks that use a lot of numeric processing. This metric focusses on the instructions used for processing numeric values, specifically floating-point values, and looks at operations and not necessarily discrete instructions. But its focus is its weakness: Tasks that use integer arithmetic and fixed-point data representations aren't usefully measured with FLOPS metrics.

Thinking about the differences between Intel and ARM processors, perhaps we need a different metric, one that is robust for different architectures and different types of processing. And I have thought of one, although it may not be practical.

My idea for measuring performance is to execute a task or function and count the number of bits that are changed (or "flipped") during the process.

The advantage of this metric is that it is neutral to architectures and also neutral to the type of processing.  MIPS doesn't measure different instruction sets well and FLOPS omits integer operations. Bit-flipping, in contrast, works with all architectures and all data types. (I'll get to the downside of the bit-flipping metric later.)

The idea of bit flipping is simple: count each bit that changes state, either from 0 to 1 or from 1 to 0. If a register holds 0x0001 and is cleared, then one bit is changed (from 1 to 0). If a register holds 0x001f and is cleared, then five bits are changed. (If a register holds 0x0000 and it is cleared, then no bits are changed.)

Such a measure has the advantage of working with any type of data. We can use this measure with floating-point data and with integer data. We can even use it with BCD (binary converted decimal) data. We can use it with text data and with objects.

Notice that this measure does not measure time. A comparison of the number of bits changed during a process is a count, not a time. One processor might be flipping more bits but running faster. A higher flip count does not necessarily mean slower performance.

One might think that such a limitation makes the bit-flip metric useless for performance comparisons. If one is interested in speed, and only speed, then the point is valid. Bit-flipping measures the efficiency of the processor: the same task with a lower bit-flip count is more efficient -- and therefore uses less energy. That may be useful to system designers.

One challenge with the bit-flip metric is deciding which bits count. Even though bits are all the same (a single value of either 1 or 0) different bits are stored in different locations and used for different purposes.

One simple difference is memory and registers. Processors store values in memory ("RAM") and also store values in registers, a small set of locations that are part of the processor. Registers are used for computations and change often; memory is used for storage and changes less frequently. Should the bit-flips for memory count the same as bit-flips in registers?

If we agree to count registers and memory differently, then we open the door to other distinctions in bits. Modern processors have caches, usually three layers, which store values too. Caches exist between registers and memory. If registers and memory have different "weights" for bit-flipping, caches should probably have different weight too.

We've been looking at data, but processors also use addresses. In order to read or write a value in memory, the processor must put an address on the buss to tell memory which cell it wants. Those bits change; should we count those changes too?

Addresses have more complications. Modern processors use an advanced form of memory management, which allows for multiple processes. Memory management changes the address that the process requests (a "logical" address) to the real address that holds the value (the "physical" address). Should those bit-flips count? (Keep in mind that physical memory pages are assigned at random, so two separate runs could have different physical pages and therefore different memory addresses and therefore different bit-flip counts.)

After careful consideration, I think the best solution is to count all of the bits. ("All the bits!") Categorize them if you want, report them in groups if you want, assign different weights to the different groups if you want, but count all of them.

We could compare similar tasks on Windows and Mac, on Intel and ARM, even Microsoft Word and Office Libre.

Which brings me to the one drawback to bit-flip counts.

It's hard to do.

Counting bit-flips requires access to memory, registers, and all sorts of innards of the processor. It's not something an application can do, or even a device driver or kernel extension. This kind of observation must occur in the processor; nothing else will do.

To effectively use bit-flipping, we need the cooperation of Intel and ARM (and Apple, who designs the M1 off of the ARM design). We also need clear and precise descriptions of the data we want to collect, so that the counts on Intel processors count the same things as counts on ARM processors.

We also need an answer to the meta-question about data collection: Counting bit-flips in itself changes data values (the counters for bit-flip operations). Do we count those bit-flips too? (For the record, I am against counting those bit flips. Because if we count the bit-flip counter flips, then we have to count the counters for those bit flips, and the count the bit flips for those counters, and on and on. Enough is enough!)

Bit-flipping is a nice thought experiment. Implementation requires changes to processors and the cooperation of manufacturers, which is a tall order. I suspect it will remain a thought experiment.

Unless!

Unless... we limit bit-flips to something simple (such as only the CPU registers, and nothing else) and we leverage virtual modes, letting the "real" processor count the bit flips in the "virtual" processor. (Bit-flipping is a simple operation: XOR the "before" and "after" values, and then count the bits in the result.)

Such an arrangement might just be possible. It still needs cooperation from the processor designers, but such a scaled-down version might be possible.

No comments: