The Value of Fixed-Point Calculations

Back when I was working on my thesis, there was a very good reason I used fixed-point calculations in the DSP core that represented the primary technology put to work therein: so I didn’t have to explain floating point to my committee.

But there are a lot of things to recommend them, the most obvious one being speed.  Modern computers create the illusion of doing everything instantaneously, until they don’t.  In CS classes, they tell you “integer math is faster than floating-point, so try to do things with integers or fixed-point” and often leave it at that.  If you’re like me, you wonder “so… what’s fixed-point?”

The short answer is, it’s integers, but you divide everything by some power of two.

Something like 10 years ago, I discovered that binary could have a decimal point in it.  For some reason, up until then, I had assumed other bases of numbers like hexadecimal and binary didn’t come with decimal points, and could only be integers.  I was also fuzzy on them being negative, because this two’s compliment business was something I wouldn’t really nail down firmly in my head until I had to teach it, but that’s a whole other blog post.

But of course, binary decimal points work just the same way as decimal decimal points, which is to say for each step to the right of the decimal point, you shrink in size by a factor equal to the base.  So just as in base ten you go 1/10, 1/100, 1/1000 and so on, in base two you go 1/2, 1/4, 1/8, etc.

So the number 1010.1010 is  4 + 2 + 1/2 + 1/8 or 12.625.  If I’m specifying unsigned 8-bit numbers, I could “fix” the decimal point at the midpoint of the byte and have a number range from 0 to 15.938, with a granularity of 1/16, or 0.0625.

(Incidentally, binary pi is a thing: 11.001001000011111101…)

Adding together fixed point numbers works exactly the same, and multiplying is only a little different: If I multiply 4 * .75, in our fixed-point specification above, that’s 0100.0000 * 0000.1100, which if you multiply them as though they were integers, gives you 12*64=768.  However, you don’t need to go looking for a hardware fix to do this, you can just shift the output downwards.  Multiplication and division using an integer ALU have readily available bit-shifting faculties, and in ARM they’re free, thanks to a neat trick in the hardware.  So if I shift all outputs of multiplication over by 8 bits, I get correct multiplications.

In our example above this works out nicely to 768/256 =3.

That sounds like a lot of work, and it’s not trivial, but remember that the alternative is keeping track of and correctly manipulating what amounts to binary scientific notation.  Scientific notation isn’t enormously complex, but the hardware to do it is usually going to be slower than an equivalent integer calculation.  It will also be larger and use more power.

It will also have this weird, glitchy behavior where big numbers are less accurate than small numbers.  This is because while the value of the exponent might change, the number of significant figures in floating point is fixed.  In decimal, you can think of it like this: with three significant figures, I know the number 154 to an accuracy of +/- 1, but the number 1.54 to +/- 0.01.  If I have lots of numbers of different sizes, this can cause a lot of weird, mathematically complex behaviors.

The above is the real reason I didn’t want to explain floating point to my committee.  Floating point processing isn’t super hard, and it’s not super hard to justify building one into an ALU, as they’re very useful.  But if you’re doing DSP mathematics with them, you want to be very sure you have a good grip on how your numbers work.  And it is honestly pretty tricky to navigate the signal processing implications of floating point.

But fixed point is pretty easy, or at least, well understood.  In DSP, there is a large, deep body of work on the impacts of quantization noise, or the exact thing that happens to a number when being translated from a continuous range onto a fixed number of intervals.  If your calculations are done using fixed point, you know that every operation exists within the constraints of quantization error and digital truncation, so normal rules of DSP apply.

So not only are fixed point calculations faster, but they use physically smaller hardware, and keep you grounded in a world where your numbers stay the same amount of accurate through all the transformations you might do to them.

There are definitely some weird things that can happen, and later I might show them off, but at least with fixed point calculations you know that you’re definitely dealing with one of the best-understood sources of nonlinearity in all of DSP rather than something that is generally avoided, more or less exactly because it’s complex and poorly understood.