Let's Design and Build a (mostly) Digital Theremin!

Posted: 9/18/2017 2:28:13 AM 1251

From: Northern NJ, USA

Joined: 2/17/2012

32/16/8 Float

I'm going through my various floating point math subroutines and optimizing them for the unpacked float type under consideration. The float consists of a 32 bit unsigned normalized (MSb=1) magnitude with decimal place to the left of the MSb, a 16 bit signed non-offset exponent (power of 2), and an 8 bit sign (1 or -1), all stored in separate processor registers / memory slots. Forcing the output of these subroutines to produce normalized results, plus a very well defined non-norm zero (MAG=0, EXP=-0x8FFF, SGN=1) means a lot of the up-front processing can be removed. I'm hoping that allowing the exponent to "roam free" in the 32 bit space (in calculations between memory storage and retrieval) will reduce the cycles needed to implement the math I need to do, and I have a specific "lim_f" subroutine that reigns it back in to 16 bits when desired (7 cycles max). Floating point multiplication now takes 8 cycles max, and floating point addition takes 16 cycles max. I wrote functions that convert floats to / from signed ints, and they take 9 and 8 cycles max., respectively. Onward and upward!

Posted: 9/24/2017 3:33:26 PM 1252

dewster

From: Northern NJ, USA

Joined: 2/17/2012

threads - posts

Strong Typing

One way programming languages are characterized is by how strongly typed they are. Typing is a way of assigning a label to a constant or variable so that operations on it (arithmetic, logical, storage, printing, etc.) automatically behave a certain way. Common types are (signed) ints, unsigned ints, floats, characters, etc. (One horrible thing about early C is the vagueness of the bit width specified by the common types of int, long, etc. as well as modulo behavior. You often desire a certain modulo along with the natural behavior of it under 2's complement operations, and indeed newer extensions to C++ have explicit width types such as int32_t, uint32_t, etc.)

Typing is an interesting way to automate things, but you often need to override the default action, and this is done through casting, which syntactically uses the type as a sort of function. For example, int32_t(x) returns the signed 32 bit int representation of x. Often the value returned isn't different, only the type label has been changed so that it gets operated on as expected for that type. (Another horrible thing about C is the way it handles signed functions, where all inputs have to be signed for the signed operation to be employed, otherwise the operation is unsigned, which leads to all kinds of unexpected error situations.)

I get the typing thing for bit width, but not so much for signed behavior. It seems to me that the operation rather than data should be typed, and this is what I do in my assembly language. Less than comparisons between variables are also typed (using signed or unsigned subtract, equality is sign neutral, less than comparisons to zero are always signed).

==========

Still working on the basic math subroutines. EXP2_F is done (35 cycles max) and I thought LOG2_F was wrapped up but now I'm seeing some wasted headroom opportunities in the polynomial coefficients. I could improve the error by 1/2 bit or so worst case, but it might take days to tune the thing. Same issue with the float version of COS. There seems to be more headroom available in the COS polynomial vs. the SIN polynomial, another reason to use COS as the basis for both.

This stuff can be fairly complex at the algorithmic level, and incredibly nuanced at the numeric level. One thing that's really paying off is the hardware limiting of shifts, that alone massively cleaned up the EXP2_F code.

==========

In the processor I removed all the mixed signed / unsigned ops and installed a "reverse subtract" op. Add, multiply, and, or, xor - none of these care about operand order, but subtract does. The reverse version swaps (commutes) the operands. This shaves off an instruction or two here and there, particularly for the immediate ops, not sure if it will stay.

Posted: 9/30/2017 3:02:18 PM 1253

dewster

From: Northern NJ, USA

Joined: 2/17/2012

threads - posts

Algorithms

If anyone is interested in my latest Hive math algorithms, I put a spreadsheet describing them here:

http://www.mediafire.com/file/9o1e1r879g9vtpw/HIVE_algorithms_2017-09-28.xls

I've test-benched each one, they're all polished up and ready to go. It really helps to work out the algorithms in something interactive like Excel, you get a lot of insight by actually seeing what's going on at each step. Coding them up in HAL and watching them run in the simulator is also quite informative. And of course the results of both should match exactly.

==========

# include <filename>

The best way to maintain these algorithms is to have a single copy in a file somewhere, and pull it into your program via a pre-processor directive. C uses the "# include" syntax and that's what I've decided to do in the HAL assembler. It's kind of thorny adding this feature because:

1. Errors can now happen in the other files that get pulled in, so the file name needs to track here for meaningful error reporting.

2. With multiple files the line number isn't unique anymore, so this also needs to track for proper error reporting.

3. There can be files within files within files etc. so we need a system to keep track of the read position in any files we will be returning to to read from some more. This is very similar to pushing the subroutine return address.

4. The commenting syntax should also operate on these directives so they can be commented out, which means the discard of block comments, and the separation of live code and end-of-line line comments, must happen at the same time we are evaluating the directives and pulling any other file contents in. Looking at the current code, the separation state machine can be simplified and modified to function on a per-line basis, with only a bool state (in_block_comment flag) to be preserved and passed from one line to the next.

I've been punting forever on line numbers, using blank lines in the intermediate files to preserve them and such, but it seems that now is the time to face this issue head on. For as useful as they are from a diagnostic standpoint, I'm going to move away from intermediate files and use an indexed struct or object that has the source file name, line number, code text, and end of line comment for each source line. This will nail down any error reporting issues for good, and will allow more flexible pre-processing as the addition of lines won't throw off the line counting and labeling system.

Use of the newer containers in C++ (vectors, hashed storage) has really streamlined the code and taken it to the next level.

==========

Pre-processor directives aren't something you want to use in a higher level language, though C++ uses them extensively to include code, which leads to clunky directive guard tests that attempt to include the code only once. What you want is a package system that allows you to pick and choose what functionality you want include in and expose to your code. A package can be a single file with multiple sections, or multiple files, that can be nested. SV has a really nice packaging system where you can stick all your magic numbers and other stuff in one place, and then refer to them anywhere as many times as you like without conflict, and limiting it to defined subsections of that shared data if desired.

==========

In revisiting the math subroutines, I found the new byte-addressed processor refreshingly easy to program, mostly due to the optional operand move now available with all immediate operations. Moves are inherently wasteful, so in the past I'd find myself spending hours trying to remove them, particularly from critical sections of code. This is a huge problem with pure stack machines (Hive is a hybrid register/stack machine) and stack languages, the programmer knows in his heart that moves, swaps, dupes, drops, etc. - any stack manipulation that doesn't involve a functional change to the data itself - is an inefficiency to be minimized, but this effort isn't in any way related to the problem at hand (writing a program to do something) so it's unwelcome mental overhead. I like puzzles as much as the next person, but not so much when they seriously impede the solving of a bigger puzzle, nor when the sub puzzle solving is an exercise in the minimization of something bad rather than the elimination of it.

The unsigned comparisons were really needed but never fit in the opcode old encoding, so it's nice to have those too.

Posted: 10/4/2017 9:08:05 PM 1254

dewster

From: Northern NJ, USA

Joined: 2/17/2012

threads - posts

Got the new #include system going several days ago. Using a vector of structs in a LIFO configuration (C++ vectors have LIFO like controls, so I didn't need to add much code to them in order to do this) to hold the state info as it traverses the various input files. Had to open the files in binary mode for the "tellg" and "seekg" stream pointer functions to work correctly, but other than that it ran shockingly well right off the bat. The only real wrinkle to iron out was what to do with the #include statement lines - I was attempting to discard them, but didn't want yet another thing going on to keep track of. It's actually much simpler to just stick them in the store and ignore / delete them later. The data from the input files is stored in a vector of structs, with each indexed struct holding one line of info separated out (file name, file line number, code, end-of-line comment). I keep them around until the end so the error system has data to pull from. Now that I am explicitly tracking file and line, I was able to throw in a processing step that discards empty lines. There are 7 steps in the assembly process, 3 to pull in the data and massage it (pre-processing), 3 label processes, and a final process that interprets the result to a model of Hive memory.

An #include system is an amazing thing! All my math subroutines are in separate int and float files, and now I'm going through the CLI (command line interface) code and cleaning it up (byte addressing makes all kinds of things much easier). Files can be #included inside of other files, which can be #included inside other files, ad infinitum/nauseam. Sticking code in separate files can really clean up the view of the thing you're working on, and the separation process helps to define the overall modularity of the project. I've got a math package (which contains int and float packages), a char package (ha ha), a string package, a CLI tokenizer file, a CLI parser file, and a CLI top file so far. I'd forgotten how much code I'd written! It could all use a second look and a polish anyway.

What would be really useful would be something that somehow determines what can be safely left out when pulling things in, which would automatically minimize the memory image footprint. I suppose functions could be enclosed in brackets, and the labeling system could look for orphaned references? Something to think about.

I changed my string format a bit. These are small strings for display and such so they don't need to hold a whole encyclopedia or anything. The chars are obviously 8 bit, but at the head of the store the first byte is reserved for the fullness index (post inc write pointer), and the second byte is the fullness limit. Having a separate limit means you don't have to allocate powers of 2 string buffers, which is nice for shorter fixed messages, and for nutty stuff like the 20 line LCD display. Having a byte index means you can only store 255 chars max (there are 256 states associated with 255 chars, you need one more for "empty"). So this string type is "str8". Whenever I need longer strings I can easily make a new type with indices of 16 or 32 bits and call it "str16" or "str32".

Posted: 10/9/2017 10:56:07 PM 1255

dewster

From: Northern NJ, USA

Joined: 2/17/2012

threads - posts

{Fight Club}

Programming language scoping is namespace containment. You're free to re-use variable and subroutine names when in differing scopes and you can expect those names to remain local to the scope. Kind of like Fight Club, what happens in there stays in there. Namespace pollution can happen quite easily in larger coding projects, and can often be tricky to debug, so you see things in C++ code like "std::cout << blah" where the cout function name is referenced to an explicit library (though I think peppering your code with "std::" pretty much ruins readability).

Scoping in C isn't super sophisticated, rather it's quite mechanically controlled by the {} braces. An open brace { takes you into another scope, and a close brace } takes you back to the one you were in before the { came along. So if I write:

{ int tst = 5; }

cout << tst;

I'll get an error if tst wasn't defined somewhere previously in the root scope:

int tst = 3;

{ int tst = 5; }

cout << tst;

which prints 3. And this also works just fine:

int tst = 5;

{ cout << tst; }

The behavior then is that scopes can "see" names from their own scope all the way back to the root scope. The determination of which one to use is simply the one in the scope closest to where it is being referenced for use:

int tst = 5; cout << tst;

int tst = 4; { int tst = 5; cout << tst; }

int tst = 3; { int tst = 4; { int tst = 5; cout << tst; }}

will all print '5'

int tst = 4; { cout << tst; }

int tst = 3; { int tst = 4; { cout << tst; }}

will all print '4', etc.

Every time an opening brace is encountered, a new scope is created and the names within it must be unique among themselves, but not so among the names in all the scopes looking back towards the root. All scopes share the root scope, which makes it a global resource, but other than that scoping depth isn't necessarily an indication of shared name spaces, because two scopes can have the same depth but only share the root scope:

Here scope 2 and scope 4 have the same depth but only share the root:

scope 0 { scope 1 { scope 2 }}

scope 0 { scope 3 { scope 4 }}

Here scope 2 and scope 3 share scope 1 and the root:

scope 0 { scope 1 { scope 2 }}

scope 0 { scope 1 { scope 3 }}

Naming is one of the hardest things programmers have to do. Names should be descriptive, yet terse, and locally unique and consistent. Most of my programming time is spent coming up with names for things, coming up with better names for things and renaming them, etc. Language features that reduce the amount of naming can really facilitate the coding process, and scoping is a simple and powerful way to do this. With scoping we can use common iterator names like 'i' over and over again, and we don't have to make the names in similar subroutines different so they don't clash.

Scoping also naturally contains things like subroutine guts:

return_type sub_name (port_type port_name, ...) { sub_guts...; return(return_value) }

The port can be seen as a bridge between the inside and outside scopes. And the return is necessary so as not to "fall through" the closing brace (all of these mechanisms are more mechanical than you might initially think).

=============

Since my last post I've implemented scoping support in the assembler, and WOW is it useful! Now I don't have to worry about variable names so much (because there really aren't any) but label names are a different matter. Not having the local labels "fighting" globally with all the others anymore is an immense relief. The labels can also now be more descriptive, so I don't need to explain as much regarding program flow control in the end-of-line comments, and anything that cuts down on the need for micro documentation is generally a step in the right direction. These are the assembler steps:

1. Read in files, split code & comments, implement the #include directive.

2. Implement #define (forward text-based search and replace) and #undef (removal from table).

3. Convert to lower case & insert spaces around certain text groupings.

4. Pre-process certain HAL constructs to make them easier to decode later, kill blank code lines.

5. Replace all explicit labels with addresses, kill their address assignment statements.

6. Flatten label name space by traversing {scope} hierarchy and assigning globally unique aliases.

7. Kill scopes anchored by orphaned LH labels (recursively), kill all braces.

8. Process implicitly assigned label tokens, replace with physical addresses.

9. Parse final assembly to memory array.

If you notice in step 7 the process is now removing dead code, which is code that is written like this:

@some_label { some code... }

where the label isn't referenced by anything in-scope. You have to do this recursively, or in a loop, because removing some dead code can remove the only reference to some other code, which is now dead, etc. What's great about this is I can now have packages of subroutines that get pulled in with an include statement, but only the ones that actually get called will make it all the way through the assembly process.

============

I also decided not to go too crazy on pre-processor directives, and I've changed from the C/C++ syntax of #<directive name> to the SystemVerilog syntax of `<directive name>. The SV syntax is actually safer because directive names when used must have the prefix ` so the namespace is kept separate from everything else.

Posted: 10/19/2017 6:28:41 PM 1256

dewster

From: Northern NJ, USA

Joined: 2/17/2012

threads - posts

JAL

Last week I was revisiting the CLI (command line interface) HAL assembly code, puzzling over what to stick where in terms of the register stacks while doing the tokenizing and token interpretation. And it struck me that I could easily incorporate JAL or "jump and link" opcodes - relative subroutine commands - in the processor, which would avoid the use of a register for the subroutine start address. The subroutine start address would then be a signed offset to the current thread PC, given by immediate 8 or 16 bit value, or by the value at the top of stack B. And I could do the same for memory accesses, which otherwise requires a register to specify the memory base address. The inclusion of these opcodes would really help clean up the CLI assembly code, and would slightly compact it and make it run slightly faster as well. I've read both the Hennessy & Patterson and Patterson & Hennessy texts so I have no excuse for not thinking to include JAL earlier! :-) (Though I named my JAL "JSB" for "jump to subroutine" to match my "GSB" or "go to subroutine". I briefly thought of renaming all my "JMP" opcodes "JTO" to match my "GTO" opcodes, but "JMP" has a clearer meaning to my eyes.)

In doing this I also re-addressed the HAL syntax for jumps and subroutines. I removed any explicit invocation of immediates here, as calculating the PC offset manually is dangerous and prone to error. All jumps automatically calculate the opcode immediate value given the current PC and an address (or label converted to an address), so the JMP syntax looks more like a GTO, and JSB more like a GSB. And I decided to remove my primary JSB / GSB syntax:

pc := @lbl, sa := pc // jsb_8 or jsb_16

pc := @lbl, sa := pc, pb // jsb_8 or jsb_16, w/ pop b

pc += sb, sa := pc // jsb

pc := sb, sa := pc // gsb

and promote the secondary syntax which is looking cleaner to me now:

sa := pc := @lbl // jsb_8 or jsb_16

sa := pc := @lbl, pb // jsb_8 or jsb_16, w/ pop b

sa := pc += sb // jsb

sa := pc := sb // gsb

For the memory access opcodes, I changed the immediate offset to a strictly byte offset. The offset used to be an increment that was sized to track and index the access width (in terms of bytes) but that hems one in too much for wider accesses relative to the PC, where one has no real control over the byte "phase" of the PC. This actually frees things up generally in terms of grouping multiple width reserved data blocks together.

The target FPGA has limited block RAM resources, which limits the main memory address to only 12 bits wide. This is easily reachable in a direct sense via a 16 bit immediate, but I've been studiously avoiding the inclusion of any opcode that directly addresses some subset of memory. Offset or relative addressing is totally fair game as it operates locally and doesn't rely on anything weird (like paging mechanisms, etc.). I don't want to paint the design of HIVE into a corner that doesn't treat all 32 bits of address space as equally as possible.

This morning I added a second highlight in the simulator listing that isn't necessary "glued to" the current thread's PC, so I can observe a section of memory while the threads are off doing their thing. The keys up / pg-up / dn / pg-dn and the 'g' command "unglue" the highlight, hitting the home key "glues" it back. Minor coding change but pretty handy.

Posted: 10/31/2017 3:24:41 PM 1257

dewster

From: Northern NJ, USA

Joined: 2/17/2012

threads - posts

Worked on the `include logic a bit more in the assembler. Files can be plopped directly into code by including them, but what if one file includes another, and vice versa? This would lead to an endless recursive loop that could crash the assembler. Initially I had the logic go back through the files that were opened to get to the current one, examining the file names, and erroring out on a match to the new one. Then I though I could get tricky and do a poor-man's package system by doing this globally over all the include file names, and simply excluding those that were already included. But the include system should be fairly brain dead, so I've returned too the original method, but I've retained the global accounting to note how many files have been repeatedly (and non-recursively, obviously) included.

It took me a bit to reorient my thoughts regarding the include logic, particularly now that scoping has been implemented. Include takes place very early on in the gauntlet (pre-tokenization, actually) so it is unaware of scoping. With scoping, the same code can legitimately be included multiple times without label namespace conflicts. This is good and bad, sometimes you really want multiple copies of some function / data, sometimes you really don't. The scoping mechanism finds and removes most dead code, but it can't deal with all situations. A person could retire working on these kinds code optimization issues. I'd still like some sort of packaging system though. I can see why C++ doesn't have one yet.

For all the good that scoping brings - and it's a lot of good! - it often confounds assembly error reporting. I see this in my C++ coding as well. Leave off a brace or include one too many and all you usually get is a cryptic error message way off in la la land.

I've got the command line HAL assembly code polished and back up and running on the FPGA processor, and am working on the remaining LCD, encoder, SPDIF, and DPLL interface HAL code. The addition of scoping, include, jump and link, and relative memory access to the hardware and language have brought this recoding process to a higher, somewhat more abstract level. Before when I had a pile of code it felt like the bigger it got the less manageable it became. I'm not feeling that so much now, which is a relief.

Unsuccessfully staring down a couple of weeks of heavy volunteering, so it's dead slow ahead for the USS Theremin.

[EDIT] Forgot to mention that I finally got around to incorporating ASCII escape characters as used in C into the assembly process. I can now write things like:

s0 := '\n'

and not have to refer to an ASCII to hex table. This really helps with things like my "token" type, which is a tiny (up to 4 character) ASCII string held in a 32 bit int:

s7 := '\nfab'

Posted: 11/3/2017 6:03:50 PM 1258

dewster

From: Northern NJ, USA

Joined: 2/17/2012

threads - posts

Polished on the actual pitch and volume acquisition assembly code yesterday and today. The prototype is now displaying pitch and volume numbers as 32 bit HEX on the LCD. As before, the data is differentiated twice to form the second order downsampling CIC filter with the hardware DPLL, and then run through a 60Hz first order CIC filter to kill hum. The acquisition and filtering takes place at the PCM rate of 48kHz via interrupt, and the LCD update rate is set to roughly 10Hz. The assembler (built into the sim) tells me the hum filters are consuming roughly 50% of the available RAM. Rather expensive, but worth it IMO.

These are just raw operating numbers that need the linearization treatment. Thread 0 is handling pitch, thread 1 is handling volume, and thread 7 is handling the I/O (LCD, encoders, pushbuttons) via interrupt and the CLI via normal processing. I'm thinking of having the float-based linearization math for the pitch and volume happen via normal processing on their respective threads. 8 threads, each with their own interrupt, is a fairly powerful thing.

Posted: 11/5/2017 3:24:03 PM 1259

dewster

From: Northern NJ, USA

Joined: 2/17/2012

threads - posts

1. The simplest handling of the pitch (and volume) number is subtraction from a constant, which is numeric hetertodyning if you think about it (analog Theremins subtract a changing pitch from a fixed one). This gives you a value that increases as you approach the antenna, zero at null, and negative values out past null. Exactly like analog heterodyning, the nature of the value obtained is roughly exponential with hand position, so you can feed it directly to a linear oscillator (or NCO) to generate your tone. But to view it with a linear response pitch display you need to take the log2 of the number first.

2. Somewhere around the second simplest is my d = K/C linearization scheme, where we work backwards through the LC resonance equation, perform some constant aggregations, and obtain:

d ~= K3 * ( F^2 ) / ( K2 - F^2 )

Compared to the denominator, F (and F^2) in the numerator doesn't vary much, so we can safely remove it from the numerator without significant effect:

d ~= K4 / ( K2 - F^2 )

Note that d increases with increasing hand distance. What we want is an operating point number that increases with decreasing hand distance, or as the hand approaches the pitch antenna. We could subtract d from a number that is the value of d when we are standing at the playing position with our pitch hand retracted. Or we could swap the denominator factors to make d negative, and add the offset:

Pitch (linear) = K5 + [ K4 / ( F^2 - K2 ) ]

On top of all this we add offset and gain controls to translate or move the entire response up and down and to "rotate" the response about some middle point without translation. These give independent control over playing range and sensitivity. K5 already functions as an offset, so we just multiply the whole thing by gain factor K6. Translating an additional time by K5 allows us to control the rotation axis point. So the simplest final form is:

Pitch (linear) = K5 + [ K6 * ( K5 + [ K4 / ( F^2 - K2 ) ] ) ]

What is our real-time budget? Ideally we want to do this calculation at the 48 kHz PCM rate. There are 8 threads, and if the core is clocked at 160MHz this gives us 160M / 8 / 48k = 416 cycles.

Working our equation from the inside out, the ( F^2 - K2 ) term can be done as integers. Conversion to float, float inverse, and conversion back to int takes around 33 cycles. K4 is a bulk fixed large gain term that can be implemented by adding a constant to the float exponent pre conversion to int. K5 and K6 can be done in the int domain as well. This gives us ints as parameters, which is handy, and consumes around 60 cycles max. Doing the CC_U acquisition and CIC hum filtering takes approximately 38 cycles per axis (volume, pitch). So it seems we can acquire, hum filter, and linearize both axes with a single thread interrupt routine:

2 * ( 60 + 38 ) = 186 cycles

with plenty of room to spare in our 416 cycles!

The nature of the value obtained with this second approach is roughly linear with hand position, so you can view it directly with a linear response pitch display, but to convert it to a tone via linear oscillator (or NCO) you need to take the exp2 of the number first. The float and int versions of EXP2 take 35 and 27 cycles, respectively.

=======

Of the two methods described above:

1. The first has only one adjustment, null, which strongly influences linearity. Multiplication of the pitch number can give pitch offset, but to change the sensitivity one needs to resort to power manipulation of the pitch number (log2, add/mult, exp2 is the most generic) which can be expensive (in terms of real-time).

2. The second is naturally linear, so simple and inexpensive (add/mult) manipulations alter pitch offset and sensitivity, and the exp2 at the end is less inexpensive than log2, though the initial manipulations are more expensive than the simple subtract of the first method. There are three user adjustments: null, offset, and gain. The null adjustment here also strongly influences linearity, though setting it is perhaps more difficult with the other two parameters influencing the actual pitch null. In simulation, it seems once null is set correctly then offset and gain track (i.e. one can use null to correct for stray C in various environments and the other two settings should more or less fall in line).

=======

Yesterday, on top of (hopefully) finalizing linearization, I developed a COS16 assembly subroutine. It takes a signed 16 bit integer input and produces a signed 16 bit cosine, with a period of 2^16. Getting +/- 1 bit accuracy takes three polynomial terms, two of which can be immediate values, so the subroutine takes 21 cycles max.

It's pretty trivial to generate ramps and triangle waves from the accumulator of an NCO, but sine waves require a function like COS16 (or an interpolated look-up table, or a tracking filter, or etc.) so 21 lines of code / cycles is fairly cheap as these things go.

My ultimate goal is to do much more than a tired old DSP emulation of a tired old analog synth, but I need a sound generator of some sort in the interim.

Posted: 11/7/2017 12:43:09 AM 1260

dewster

From: Northern NJ, USA

Joined: 2/17/2012

threads - posts

I know everyone (including me) is sick to death of linearity at this point, but this is kind of weird. Surprisingly, all three of these give essentially identical linearity in simulation:

Pitch = K4 + [ K3 * ( K2 - [ ( F^2 * K1 ) / ( K0 - F^2 ) ] ) ]

Pitch = K4 + [ K3 * ( K2 - [ K1 / ( K0 - F^2 ) ] ) ]

Pitch = K4 + [ K3 * ( K2 - [ K1 / ( K0 - F ) ] ) ]

(I'm playing a bit fast and loose with the constant names, they wouldn't have the same values for the same response across all three equations, but I think you get the general idea.)

As described in the post previous to this, the first equation is working backwards through the LC resonance equation, assuming C = K/d, and with sensitivity and offset controls K3 and K4 added. F^2 in the numerator changes so little compared to the denominator it can be safely removed from the numerator, giving us the second equation. But we can actually not square F in the denominator and get essentially the same response (in terms of linearity) with the third equation! Again, I suppose it's because F changes so little that there isn't much difference between F and F^2? Still, kind of a shock.

The third equation is actually functionally analogous to how the open.theremin pitch side works: two frequencies are analog heterodyned to obtain the difference frequency ( K0 - F ) and the period of that is measured [ K1 / ( K0 - F ) ]. So now we know why this approach works so well, it is a highly efficient simplification of the first equation, which is entirely grounded in physics.

And another piece of the Theremin design puzzle falls into place. [EDIT] Not exactly, see my next post.