"When writing verilog keeping hardware capabilities in mind, it's possible to achieve high performance solutions." - BugginsYes, the longer you do FPGA work the more you are aware of the abilities and limitations of the target hardware. I use a lot more intermediate registering now."Spartan 6 already is good enough and cheap. Although, Vivado does not support S6. Series 7 looks too expensive at first sight."I agree, the low end offerings are really adequate now and fairly inexpensive. The high end is always crazy.
"Interesting features of Xilinx FPGAs:LUT6 can be used as 2 LUT5 sharing the same inputs.There are additional MUXes in slice - allow to mux LUT outputs w/o adding interconnect delays.MUX 16x1 can be implemented in single slice (4 LUTs).When using LUT6 as distributed ram, it has 4x bigger capacity than LUT4."Believe it or not, the SW tools have much more impact on the design than the HW architecture, and both Altera and Xilinx offer much the same thing at different levels. IIRC they stopped suing each other over every little thing and entered a non-compete sitation. I switched to Altera for my own stuff a while back because, for comparably featured FPGAs given a real design, Xilinx top speed was significantly lower (~75%) and priced higher.
"E.g. one slice (4LUTs) can work as 64x6 simple dual port RAM - register bank for 32x36bit registers can be built using only 6 slices."
A processor register bank usually needs a triple port memory, two reads and one write. I've seen this implemented as a double bank therefore needing 2x the resources."DSP is interesting - you can make ALU almost w/o additional external resources. Series 7 DSP supports logical operations."I've looked into them a bit, but otherwise am ignoring the DSP blocks, waiting for them to make it to the very low end.
"I believe it's possible to design really usable 32/36bit softcore with 16-32 registers and multiplication support (although 32x32 multiplication will require 4 clock cycles), with program and data stored in dual port BRAM. Low end FPGA could contain 32 soft cores while leaving most of logic resources for other usage."
I imagine sufficient program and data store BRAM for those 32 soft cores would be the bottleneck. And then you run into interconnect issues: should it be a star, a mesh, a shared bus, etc. The 32x32 mult taking 4x clocks means you need several more to mux and register, which is why Hive has an 8 deep pipeline. And if you're pipelining you may as well make a barrel processor IMO. Because much of the state resides in the pipeline, barrel processors aren't that much more complicated than conventional processors. They avoid many of the hazards, they fully utilize the ALU, they naturally intercommunicate via the memory, they share a single set of peripherals, and their timing model is completely deterministic. I think they're quite well suited to embedded applications.
One/two BRAM instances per CPU core for program + fast accessible data would be enough for most of audio DSP applications.
If core does some calculations once per sample, working at 100MHz and 48KHz sample rate it could execute 100000000/48000 1 16 bit mux
00: ALU output low 16 bits
01: ALU output high 16 bits
10: external data bus
11: output of one of register file channels
Total resources used:
60 LUTs (0.34%)
1 DSP (1.25%)
Minimal CPU will require additionally
Barrel shift logic or at least one bit right shift signed/unsigned
Divider emulation (will take 16-32 cycles for division)
Result clamping support is very useful for DSP applications: optional ALU output values limiter: (v > max)? max : ( v 3 register values read -> sign extension of ALU ports A, B -> ALU -> reg file input multiplexer -> register file write data
According to timing simulation, this path w/o additional pipelining registers may work at 100MHz
w/o additional resources usages, pipelining registers may be added at reg file output, ALU inputs, ALU multiplier, ALU output (in this case, free accumulator + right shifter >> 17 is available).
Although, for instruction sequence where next instruction depends on results of previous operation, wait cycles will be added.
Write port may work as read/write, e.g. first half cycle - read + latch result, second half -> write
16 bit operatons will take 1 clock cycle
32 bit add/sub will take 2 clock cycles
32x32 bit multiplication will take 4 clock cycles
E.g. 32 bit fixed point +7.24 operations:
additions will take 2 cycles and multiplications - 4 cycles.
Adding one more DSP could reduce 32x32 multiplication to 2 clock cycles.
100MHz for 16 bit operations and 50MHz for 32bit operations emulated on 16(18) bit CPU with low FPGA resources is probably enough for real world applications.
Fast interrupt handling.
If register bank is based on Xilinx series 7 distributed ram blocks, minimal number of registers is 32. Probably, it's too much, especially if we try
Some thoughts about ISA (for both 32 or 16 bit soft cores).
It makes sense to try fitting ISA to 18 bits for better utilization of BRAM resources and have dense code.
3 address (registers) arythmetics might speed up execution and avoid unnecessary register-register transfers.
E.g. R1 = R2 + R3 + carry
R1 = R1 + (R2 * R3) + carry
For 16 bit core, it makes sense to have 32 data instructions support (e.g. sequential pair of registers is used when 32bit value is needed).
3 registers (for 16 registers bank) require 3*4 = 12 bits in instruction. For 16bit instruction, only 4 bits are left - it's obviously not enough to encode all 3 address operations we want (e.g. only 8..15 may fit there).
But since BRAM has native width 18 bits (16+2), it makes sense to utilize all 18 bits for instruction.
Since we have twice more registers than required for ISA (32 vs 16), we can divide them into two banks - and use second half for interrupt handling by simple setting of bit 4 in register number coming to register bank.
In 16bit architecture 32x32 multipliers may use register bank address to do multiplexing of high / low 16 bits of 32bit value.
For 32bit architecture 32->16 multiplexing would require additional mux which may cause delay in data processing path. But it's possible to utilize in-slice hardware muxes after LUT outputs - and have free (no additional slices / delays) mux for low half of register bank outputs to have high or low part of register value in low 16 bits of register bank output(s).