Building Teensy 3.6 based digital theremin

Posted: 4/5/2019 7:46:37 AM
Buggins

From: Theremin Motherland

Joined: 3/16/2017


Yes, the LUT I/O can vary, making the LE count an apples-to-oranges comparison.  LUT I/O count "tunes" the logic for certain applications, some will go faster, but if all the I/O aren't utilized then LUT resources are wasted -  it's a waste / speed tradeoff.

Historically, Xilinx used to have no block RAM resources but instead used LUTs for RAM, which gobbled them up like crazy and made them rather slow due to all the interconnect (interconnect wire delays usually dominate total delay in a path).  Xilinx was also slow to adopt true PLLs for clock managment, instead relying on DLLs which were less feature-rich.  Though Altera's had their problems too.


Of course, using slower LUT6 as LUT2 with slower interconnects may cause slowdown of design.
But if well-designed, Xilinx based implementation may be interesting.
When writing verilog keeping hardware capabilities in mind, it's possible to achieve high performance solutions.
Spartan 6 already is good enough and cheap. Although, Vivado does not support S6. Series 7 looks too expensive at first sight.

Interesting features of Xilinx FPGAs:
LUT6 can be used as 2 LUT5 sharing the same inputs.
There are additional MUXes in slice - allow to mux LUT outputs w/o adding interconnect delays.
MUX 16x1 can be implemented in single slice (4 LUTs).
When using LUT6 as distributed ram, it has 4x bigger capacity than LUT4.
E.g. one slice (4LUTs) can work as 64x6 simple dual port RAM - register bank for 32x36bit registers can be built using only 6 slices.
DSP is interesting - you can make ALU almost w/o additional external resources. Series 7 DSP supports logical operations.
I believe it's possible to design really usable 32/36bit softcore with 16-32 registers and multiplication support (although 32x32 multiplication will require 4 clock cycles), with program and data stored in dual port BRAM. Low end FPGA could contain 32 soft cores while leaving most of logic resources for other usage.


Posted: 4/5/2019 1:01:02 PM
dewster

From: Northern NJ, USA

Joined: 2/17/2012

"When writing verilog keeping hardware capabilities in mind, it's possible to achieve high performance solutions."  - Buggins

Yes, the longer you do FPGA work the more you are aware of the abilities and limitations of the target hardware.  I use a lot more intermediate registering now.

"Spartan 6 already is good enough and cheap. Although, Vivado does not support S6. Series 7 looks too expensive at first sight."

I agree, the low end offerings are really adequate now and fairly inexpensive.  The high end is always crazy.

"Interesting features of Xilinx FPGAs:
LUT6 can be used as 2 LUT5 sharing the same inputs.
There are additional MUXes in slice - allow to mux LUT outputs w/o adding interconnect delays.
MUX 16x1 can be implemented in single slice (4 LUTs).
When using LUT6 as distributed ram, it has 4x bigger capacity than LUT4."

Believe it or not, the SW tools have much more impact on the design than the HW architecture, and both Altera and Xilinx offer much the same thing at different levels.  IIRC they stopped suing each other over every little thing and entered a non-compete sitation.  I switched to Altera for my own stuff a while back because, for comparably featured FPGAs given a real design, Xilinx top speed was significantly lower (~75%) and priced higher.

"E.g. one slice (4LUTs) can work as 64x6 simple dual port RAM - register bank for 32x36bit registers can be built using only 6 slices."

A processor register bank usually needs a triple port memory, two reads and one write.  I've seen this implemented as a double bank therefore needing 2x the resources.

"DSP is interesting - you can make ALU almost w/o additional external resources. Series 7 DSP supports logical operations."

I've looked into them a bit, but otherwise am ignoring the DSP blocks, waiting for them to make it to the very low end.

"I believe it's possible to design really usable 32/36bit softcore with 16-32 registers and multiplication support (although 32x32 multiplication will require 4 clock cycles), with program and data stored in dual port BRAM. Low end FPGA could contain 32 soft cores while leaving most of logic resources for other usage."

I imagine sufficient program and data store BRAM for those 32 soft cores would be the bottleneck.  And then you run into interconnect issues: should it be a star, a mesh, a shared bus, etc.  The 32x32 mult taking 4x clocks means you need several more to mux and register, which is why Hive has an 8 deep pipeline. And if you're pipelining you may as well make a barrel processor IMO.  Because much of the state resides in the pipeline, barrel processors aren't that much more complicated than conventional processors.  They avoid many of the hazards, they fully utilize the ALU, they naturally intercommunicate via the memory, they share a single set of peripherals, and their timing model is completely deterministic.  I think they're quite well suited to embedded applications.

Posted: 4/9/2019 6:47:24 AM
Buggins

From: Theremin Motherland

Joined: 3/16/2017

Teensy Theremin PCB is soldered.


PCB Bottom view - theremin sensor

PCB Top view - connecting inductors and antennas

Assembled, bottom view from rear side:

Assembled, bottom view from theremin sensor side:

Jacks: Line In/Out, Phones Out, Teensy 3.6 USB, Power supply +5V.


MCU parts: Teensy 3.6, Audio Board, LCD Touch Screen.


Next steps:
    Build acrylic cabinet using laser cut.
    H/W testing, get first working version of software - it should be possible to calibrate and play using any simple synthesizer.

Posted: 4/11/2019 8:25:59 AM
Buggins

From: Theremin Motherland

Joined: 3/16/2017

"When writing verilog keeping hardware capabilities in mind, it's possible to achieve high performance solutions."  - Buggins

Yes, the longer you do FPGA work the more you are aware of the abilities and limitations of the target hardware.  I use a lot more intermediate registering now."Spartan 6 already is good enough and cheap. Although, Vivado does not support S6. Series 7 looks too expensive at first sight."I agree, the low end offerings are really adequate now and fairly inexpensive.  The high end is always crazy.

"Interesting features of Xilinx FPGAs:LUT6 can be used as 2 LUT5 sharing the same inputs.There are additional MUXes in slice - allow to mux LUT outputs w/o adding interconnect delays.MUX 16x1 can be implemented in single slice (4 LUTs).When using LUT6 as distributed ram, it has 4x bigger capacity than LUT4."Believe it or not, the SW tools have much more impact on the design than the HW architecture, and both Altera and Xilinx offer much the same thing at different levels.  IIRC they stopped suing each other over every little thing and entered a non-compete sitation.  I switched to Altera for my own stuff a while back because, for comparably featured FPGAs given a real design, Xilinx top speed was significantly lower (~75%) and priced higher.

"E.g. one slice (4LUTs) can work as 64x6 simple dual port RAM - register bank for 32x36bit registers can be built using only 6 slices."

A processor register bank usually needs a triple port memory, two reads and one write.  I've seen this implemented as a double bank therefore needing 2x the resources."DSP is interesting - you can make ALU almost w/o additional external resources. Series 7 DSP supports logical operations."I've looked into them a bit, but otherwise am ignoring the DSP blocks, waiting for them to make it to the very low end.

"I believe it's possible to design really usable 32/36bit softcore with 16-32 registers and multiplication support (although 32x32 multiplication will require 4 clock cycles), with program and data stored in dual port BRAM. Low end FPGA could contain 32 soft cores while leaving most of logic resources for other usage."

I imagine sufficient program and data store BRAM for those 32 soft cores would be the bottleneck.  And then you run into interconnect issues: should it be a star, a mesh, a shared bus, etc.  The 32x32 mult taking 4x clocks means you need several more to mux and register, which is why Hive has an 8 deep pipeline. And if you're pipelining you may as well make a barrel processor IMO.  Because much of the state resides in the pipeline, barrel processors aren't that much more complicated than conventional processors.  They avoid many of the hazards, they fully utilize the ALU, they naturally intercommunicate via the memory, they share a single set of peripherals, and their timing model is completely deterministic.  I think they're quite well suited to embedded applications.

One/two BRAM instances per CPU core for program + fast accessible data would be enough for most of audio DSP applications.
If core does some calculations once per sample, working at 100MHz and 48KHz sample rate it could execute 100000000/48000 1 16 bit mux
00: ALU output low 16 bits
01: ALU output high 16 bits
10: external data bus
11: output of one of register file channels

Total resources used:
    60 LUTs (0.34%)
1 DSP (1.25%)

Minimal CPU will require additionally
    Barrel shift logic or at least one bit right shift signed/unsigned
Divider emulation (will take 16-32 cycles for division)
Result clamping support is very useful for DSP applications: optional ALU output values limiter: (v > max)? max : ( v  3 register values read -> sign extension of ALU ports A, B -> ALU -> reg file input multiplexer -> register file write data
According to timing simulation, this path w/o additional pipelining registers may work at 100MHz

w/o additional resources usages, pipelining registers may be added at reg file output, ALU inputs, ALU multiplier, ALU output (in this case, free accumulator + right shifter >> 17 is available).
Although, for instruction sequence where next instruction depends on results of previous operation, wait cycles will be added.
Write port may work as read/write, e.g. first half cycle - read + latch result, second half -> write

16 bit operatons will take 1 clock cycle
32 bit add/sub will take 2 clock cycles
32x32 bit multiplication will take 4 clock cycles
E.g. 32 bit fixed point +7.24 operations:
  additions will take 2 cycles and multiplications - 4 cycles.

Adding one more DSP could reduce 32x32 multiplication to 2 clock cycles.
100MHz for 16 bit operations and 50MHz for 32bit operations emulated on 16(18) bit CPU with low FPGA resources is probably enough for real world applications.


Fast interrupt handling.

If register bank is based on Xilinx series 7 distributed ram blocks, minimal number of registers is 32. Probably, it's too much, especially if we try

Some thoughts about ISA (for both 32 or 16 bit soft cores).

It makes sense to try fitting ISA to 18 bits for better utilization of BRAM resources and have dense code.
3 address (registers) arythmetics might speed up execution and avoid unnecessary register-register transfers.
E.g. R1 = R2 + R3 + carry
    R1 = R1 + (R2 * R3) + carry

For 16 bit core, it makes sense to have 32 data instructions support (e.g. sequential pair of registers is used when 32bit value is needed).

3 registers (for 16 registers bank) require 3*4 = 12 bits in instruction. For 16bit instruction, only 4 bits are left - it's obviously not enough to encode all 3 address operations we want (e.g. only 8..15 may fit there).
But since BRAM has native width 18 bits (16+2), it makes sense to utilize all 18 bits for instruction.

Since we have twice more registers than required for ISA (32 vs 16), we can divide them into two banks - and use second half for interrupt handling by simple setting of bit 4 in register number coming to register bank.

In 16bit architecture 32x32 multipliers may use register bank address to do multiplexing of high / low 16 bits of 32bit value.
For 32bit architecture 32->16 multiplexing would require additional mux which may cause delay in data processing path. But it's possible to utilize in-slice hardware muxes after LUT outputs - and have free (no additional slices / delays) mux for low half of register bank outputs to have high or low part of register value in low 16 bits of register bank output(s).

Posted: 4/11/2019 6:13:05 PM
dewster

From: Northern NJ, USA

Joined: 2/17/2012

"Minimal CPU will require additionally
Barrel shift logic or at least one bit right shift signed/unsigned"  - Buggins

If you have a high speed multiplier then you just use a shifted one as the other input and a mux at the output to get a bi-directional barrel shifter.  An OR gate at the output instead of a mux gives you a uni-directional rotator (which you can make bi-directional in assembly).  A high speed multiplier is absolutely essential for real computing IMO as it facilitates polynomial use, and everything except for inverse (Newton's method) boils down to a polynomial.

"Divider emulation (will take 16-32 cycles for division)"

If you have a high speed multiplier you can do a floating point inverse (1/x) in 25 cycles.  I do my best to avoid inverse / division, particularly integer as it's a precision killer.

"Result clamping support is very useful for DSP applications: optional ALU output values limiter: (v > max)? max : ( v  3 register values read -> sign extension of ALU ports A, B -> ALU -> reg file input multiplexer -> register file write data"

Yes, limiting (unsigned) and saturation (signed) should be opcodes for DSP work as they get used a lot.

"According to timing simulation, this path w/o additional pipelining registers may work at 100MHz"

IMO you want to get the logic clocking as fast as the slowest element will allow (probably the 33 x 33 = 65 multiply), which means pipelining.  The FPGA fabric registers are there anyway, might as well use them.  For single threaded processors, pipelining means branch prediction, stalls, the whole crazy nine yards.

"16 bit operatons will take 1 clock cycle
32 bit add/sub will take 2 clock cycles
32x32 bit multiplication will take 4 clock cycles
E.g. 32 bit fixed point +7.24 operations:
  additions will take 2 cycles and multiplications - 4 cycles."

IMO hardware support for 16 bit operations are a waste when you're concerned about DSP throughput, and they eat up your opcode space.  A full 32 bit processor is a thing of beauty.  And what seems medium-sized today will seem tiny in a decade or less.

"Adding one more DSP could reduce 32x32 multiplication to 2 clock cycles.
100MHz for 16 bit operations and 50MHz for 32bit operations emulated on 16(18) bit CPU with low FPGA resources is probably enough for real world applications."

Kinda slow IMO.  You want to fully utilize the bandwidth of your memory, ALU, etc. otherwise they're just sitting there doing nothing a lot of the time.

It makes sense to try fitting ISA to 18 bits for better utilization of BRAM resources and have dense code.
...
But since BRAM has native width 18 bits (16+2), it makes sense to utilize all 18 bits for instruction.

I'd avoid non-powers-of-two data, address, or opcode widths.  It can really tie your hands.  Though I do use the extra width on the stacks for overflow flags.

Processor design is a lot of fun, and particularly so if you can find good reasons to deviate from the norm.

Posted: 4/15/2019 3:51:40 PM
Buggins

From: Theremin Motherland

Joined: 3/16/2017

Testing teensy theremin firmware on real hardware:

uGUI library is working well, including touch support.

I've implemented support of input from 4 Pots, with debouncing.

Sound generation and output is working ok.

Theremin sensor part (oscillators, reference frequency generation, mixer, frequency measure) is not tested yet.

Posted: 4/17/2019 6:47:49 AM
Buggins

From: Theremin Motherland

Joined: 3/16/2017


IMO hardware support for 16 bit operations are a waste when you're concerned about DSP throughput, and they eat up your opcode space.  A full 32 bit processor is a thing of beauty.  And what seems medium-sized today will seem tiny in a decade or less.


Makes sense.
Adding +32 LUTs turns 16bit quad port register file into 32 bits one.
ALU itself supports 48 bits operations (except multiplication).
So, adding a few resources gives 32 bit CPU, allowing most of instructions to be executed in single cycle.

I've performed some tests for register file.
Found how to get 32->16 mux on register file output w/o additional slices (utilizing embedded mux which multiplexes two outputs of single LUT).
In simulation, it works well on 333MHz. 200MHz pipeline looks achievable.

Trying to figure out how to achieve better performance for multiplication.
Goal is to have at least one multiply-and-add 7.24 fixed point (R1 = R1 + R2*R3) per 100MHz cycle.

Xilinx series 7 DSP blocks using cascading and pipelining registers, and internal 17-bit right shift allow to eliminate necessity in external muxes.
Using two DSP blocks connected in cascade allows to avoid input multiplexing if 16 bit multiplication args are passed to one of them while long 32 bit operands - to another.
2 cycles will be required to put both high and low half-operands into pipeline registers. Multiplications may be done in parallel on both DSPs.

Non-trivial task here is how to deal with register dependencies - when next operation depends on results of previous operation - pipeline will wait until results of previous got ready. Due to long pipeline 200MHz CPU would turn into 50MHz one.


Posted: 4/17/2019 12:38:01 PM
dewster

From: Northern NJ, USA

Joined: 2/17/2012

"Trying to figure out how to achieve better performance for multiplication.
Goal is to have at least one multiply-and-add 7.24 fixed point (R1 = R1 + R2*R3) per 100MHz cycle." - Buggins

Fused operations like MAC speed up FIR filters, and can be used to do Goldschmidt division in a pipeline.  I haven't found any use for anything but the simplest FIR (comb filter - FIR filters need a lot of memory and many operations to work at low frequencies, so if you can use a recursive filter you should) and Newton's method works fine for inverse / division (IMO, Goldschmidt is partially a way to justify the existence of the MAC hardware).  But any operation fused with the already slowest multiplication will slow down everything unless you have a fairly deep pipeline.  Which leads to...

"Non-trivial task here is how to deal with register dependencies - when next operation depends on results of previous operation - pipeline will wait until results of previous got ready. Due to long pipeline 200MHz CPU would turn into 50MHz one."

This to me is the #1 reason NOT to make single-threaded processors.  They're often quite terrible at keeping the pipeline busy, and designers add hack after hack to ameliorate the situation somewhat, leading to a really complex model with indeterminate timing.  Add a cache to the whole mess and you have no real guarantee of meeting real-time unless you have gobs of overhead.  Go down this road waaaaaay too far and you end up doubling a billion transistor design just to get a 25% throughput improvement.  IMO it's madness.  Verification becomes a total nightmare (Pentium bug).

If you can get your individual operations really fast then there is no need to need to fuse them.  Fusing slows or deepens the pipe for all operations.

Also, processor details can depend on what it will be working on.  DSP calculations, which just need "sufficient" resolution for the answer, can be narrower than exact "scientific calculator" type answers.  FPGA ALUs are made more for DSP than calculator type stuff.  But sometimes you need a full resolution answer, which quite often requires a 33 x 33 = 65 operation.  And you can use that unit to do fast shifts and rotates too.

I know I extol the virtues of barrel processors too much, but multiple threads "live" in the pipeline pretty much for free, and all you have to do is replicate the register set for each thread.  You only need one ALU, memory, peripheral register set, opcode decoder, etc. all of which can be 100% utilized.  The timing model is as simple as these things get, which is a huge plus for embedded work, and having one interrupt per thread keeps you from having to implement complicated and confusing hierarchical hardware (interrupts need to be really simple).

[EDIT] Regarding the FIR: A few years ago an engineer contacted me about Hive.  He was doing synthetic aperture radar work in FPGAs and needed a simple brain for side calculations and control and such.  He was using shit tons of really high end FPGAs and pushing them to the max, with tons of amps flowing into his setup - really scary work!  To me, FPGAs and their ALUs are made exactly for that kind of work, where the whole device is doing FIRs because the data is highly phase sensitive.  Audio work is much less phase sensitive because the human ear is basically deaf to it, so recursive filters work fine most of the time.

Posted: 4/23/2019 6:07:27 AM
Buggins

From: Theremin Motherland

Joined: 3/16/2017


This to me is the #1 reason NOT to make single-threaded processors.  They're often quite terrible at keeping the pipeline busy, and designers add hack after hack to ameliorate the situation somewhat, leading to a really complex model with indeterminate timing.  Add a cache to the whole mess and you have no real guarantee of meeting real-time unless you have gobs of overhead.  Go down this road waaaaaay too far and you end up doubling a billion transistor design just to get a 25% throughput improvement.  IMO it's madness.  Verification becomes a total nightmare (Pentium bug).


I like this idea. It's a kind of free "hyperthreading".
If usual instruction fits 4-stage pipeline, there are x4 threads for free w/o argument dependency and branch prediction pain.
So, for 200MHz pipeline, there are 4 50MHz threads. 
Multiplication will take additional 4 stages, so multiplications will work at 25MHz rate. Pipelined multiplication will take 4 more DSP blocks, in additional to main ALU DSP block. 32x32 register bank gives 8 registers per thread. 64x32 - 16 registers.
Shared BRAM per 4 threads allows easy communication between threads.
Having several such x4 thread cores will require a kind of NUMA for shared memory or inter-core memory access.
I believe 4 cores * 4 threads can fit into low end FPGAs (10-15K LUTs).

I'm still thinking about direct additive synthesis design on FPGA.
1024 oscillators (f, 2f, 3f, ... 1024f), with phase and/or amplitude modulation using additional 2048 oscillators, and direct filter based on frequency->amplitude, frequency->phase shift tables, should be achivable with 2-3K LUTs + a lot of BRAMs.

Posted: 4/23/2019 11:43:41 AM
dewster

From: Northern NJ, USA

Joined: 2/17/2012

"I like this idea. It's a kind of free "hyperthreading". - buggins

Yes, exactly.  It's extremely elegant and solves all sorts of problems without significantly increasing complexity.  Even the separate stack memory areas on mine are just pointer offsets into BRAM - the individual stack pointers "live" in the pipeline as well. 

I believe that barrel processing isn't more popular due to historical reasons.  Until very recently, software hasn't been able to really fully utilize multiple threads, and as a result has relied on single threaded performance too much, so processor designers have been pounding on that.  In that scenario, multi-threading is just another complexity multiplying hack, and we've seen the security holes that has opened up which no one seems to be able to get a good handle on.  The single thread model for memory hierarchy is also largely set in stone at this point - IMO, a processor should be placed in the middle of its own sea of memory, with no caching or management going on.  If you want security, make one thread supervisor, with a portion of memory space protected just for it (R/W for thread 0 say, and ROM for the other threads, with the region size some power of 2 settable by thread 0).

"If usual instruction fits 4-stage pipeline, there are x4 threads for free w/o argument dependency and branch prediction pain.
So, for 200MHz pipeline, there are 4 50MHz threads.
Multiplication will take additional 4 stages, so multiplications will work at 25MHz rate. Pipelined multiplication will take 4 more DSP blocks, in additional to main"

I haven't worked through all possible scenarios, but by the time you load from RAM (several cycles through a byte barrel shifter), decode the opcode (2 cycles, lots of muxing!), select the operands (1 cycle), select the result (lots of muxing!), store to RAM, etc. there aren't all that many cycles difference between the various ALU operations, and (conveniently a power of 2) 8 stages seems like a bare minimum.  And you can use the time it takes to mux the results of operations that don't take as long as multiplication to break up, pipeline, and speed up the muxing operations, with multiplication / shift / rotate muxed in at the very last stage.  It can be a lot of design work shifting stuff around in the pipe, trying to keep the slowest case it as fast as possible.

You must be logged in to post a reply. Please log in or register for a new account.