# Let's Design and Build a (mostly) Digital Theremin!

(^^^: yeah, one of my co-workers used to call me "perfessor" - though I don't feel my mental meaderings are really at that level...)

**Floats...**

Still thinking about floats, though it's winding down. Had to really look into this feeling I was getting that there was some optimal representation for both storage and calculation, but I now believe that feeling is groundless. In fact, it is the differing needs of storage and calculation that make a single optimally efficient float representation for both an unlikely prospect.

If you implement floats in software you see a lot of the same stuff going on almost regardless of the function. Zeros are denorms, so there's the checking for input zeros and the handling of those special cases. If there are two input values there may be some comparisons between them to pick the right path. Then there's the central function itself performed via iteration, polynomial, etc. The output often needs to be normalized, and if output zero is possible it must be dealt with somehow (conceptually quite a tough nut in a fundamentally exponential representation). Float packing and unpacking overhead can take as much time as the algorithm itself, really slowing things down, so you either have special hardware do pretty much everything or nothing. But even though the steps for the various functions are very similar, they aren't completely similar, and this really complicates the floating point hardware.

Floats are nasty but almost impossible to avoid. And I'm in the position where I could actually add the hardware if I wanted to. The fact that an unpacked float is composed of three values (sign, exponent, magnitude) makes any ALU or other specialized hardware that operates on them slow and problematic from an I/O standpoint. For example, adding two unpacked floats via hardware would require 6 writes and 3 reads, or 9 cycles minimum - whereas the software approach takes ~19 cycles maximum.

So I believe the best software solution is to employ a float representation that is primarily efficient when it comes to calculation, and pack/unpack them to/from memory if space is a problem.

Almost all algorithms operate on magnitude & sign separately, and this includes those which you might imagine would work better with natural 2's complement I/O, such as addition and multiplication. For instance, the addition of two non-negative numbers can overflow but not underflow, but subtraction is the opposite. At the start of all this I recall spending weeks looking for a division algorithm that would natively work on 2's complement numbers, only to come up short. I thought those who recommended the sign & magnitude approach (convert to unsigned, divide, restore the sign) were possibly just being lazy, but that approach is actually most efficient and consistent with most (possibly all) other functional algorithms.

Anyway, one thing I believe I've resolved (for my stuff) is the representation of zero. Doing it via a large negative exponent is the most correct mathematically, but I'm going with a zero magnitude here. The zero sign is out because it's redundant, and redundancy just means one more thing to check and assign special case values to. Several days of work for a minor tweak, I believe that's deep in diminishing returns territory. I must say though that this exercise has really expanded my fundamental understanding of numerical representation. E.g.: one's complement is a very natural way to pack the sign and magnitude, but it's a poor basis in which approach the calculation. (Kind of hard to believe early computers actually used it for integer math, the "wrap around carry" must be rather slow in hardware.)

=========

To more efficiently support float exponential calculations, I looked into adding a variable width signed saturation opcode to the Hive ALU. A fixed width saturation is doable in two clocks, but variable would probably need three clocks and would consume a fair amount of hardware. Fixed could either be 6 bit or 16 bit. Saturating 6 bit values with fixed 16 bit hardware would require a left shift of 10, saturate, right signed shift of 10. I'm kind of leaning toward the 6 bit (to limit shift distance). I also looked into removing the variable width sign extension to make room for an immediate saturation, but realized it can be used to detect saturation. And I also looked into hardware limiting the shift distances, but that's slow and not a panacea for all cases. Hmmm...

=========

Ran across an interesting 2014 article yesterday by Walter Bright (the author of D) on writing your own language: LINK.

Oooh, Mr. Bright is an interesting guy! * "I'll note here that working on stuff that I needed has fared quite a bit better than working on stuff that I was told others need."* (link) Almost nothing I've worked on for employers has even made it to market - not the way to spend one's life.

Dew said: *“Almost nothing I've worked on for employers has even made it to market - not the way to spend one's life.”*

You are doing it again and I am not saying this to be mean. If ever I pick on you it is as my little brother. You have great knowledge and what frustrates me I was rarely able to tap into it.

I did assembly in Z80 and 8080 processors back in the day because processors were slow and memory was expensive. Someone can know how to use every opcode available but if they have poor logic their skills will not amount to much.

As you may remember I have been writing VBA code in Excel the last 18 months. You kind of put it down like some people put down Walmart. Everything has its purpose.

I am going to send you a private IM.

Christopher

*"I did assembly in Z80 and 8080 processors back in the day because processors were slow and memory was expensive. Someone can know how to use every opcode available but if they have poor logic their skills will not amount to much." - Christopher*

I agree completely. Digital/logical/arithmetic design is a deceptively deep art. It's even trickier when you get to specify the opcodes.

*"As you may remember I have been writing VBA code in Excel the last 18 months. You kind of put it down like some people put down Walmart. Everything has its purpose."*

VBA "works" but is clunky, verbose, and not fun at all (for me) to code in. There are much "better" scripting languages out there, MS picked VBA for whatever dull corporate reasons they do everything. The suits are always killing any fun there is to be had.

And Walmart sucks! ;-)

So I'm farting around with the floating point functions generated by the "Megawizard Plug-in Manager" in the FPGA tool (Quartus 9.1sp2). This spits out canned, pre-optimized verilog components, saving us the hard work of reinventing the wheel. One can adjust the depth of the pipelining registers to trade latency for speed. To get the floating point add up to the processor core frequency (~200MHz) the logic has to be 13 clocks deep, which consumes 994 LEs (logic elements) and runs at 212.68MHz. The floating point multiply must be at least 10 clocks deep, which consumes 364 LEs and runs at 224.57MHz.

The add needing ~3x the LEs of the multiply is partially a book keeping thing, as most of the "action" in the multiply is going on in the hardware multipliers, which are attached to the block RAM elements rather than the LEs. But it's also partially the reality of addition being more involved in terms of float setup and teardown (one of the inputs must be shifted pre-add, and the add itself may actually be a subtract). The core is currently < 3k LEs, adding just the floating point add would increase this by 33%. And the floating point multiply in software only takes 8 clocks max, which isn't terribly onerous, so I'm not too inclined to add that either.

I don't think I'm at the point of increasing the Hive core pipeline depth to accommodate this kind of floating point hardware. One could, I suppose, supply the floating point inputs via the stacks and read the results via the register set interface. This would give a bit of wrap-around in the core pipe and another clock or two in which to perform the calculations. No easy answers here, you can see why the floating point units in modern processor ALUs are independent of the integer units.

You must be logged in to post a reply. Please log in or register for a new account.