"Do you remember bit-slice archetecture?"
After looking into it a bit :) it seems to be a way for the designer to build data paths of varying widths with a single family of products, and a way for manufacturers to deal with the very limited pin counts of earlier IC packages (when you work with FPGAs you find out very quickly that pin management is something of a job in itself).
I'm always on the lookout for simple processors, because often much of the hardware in an FPGA design is underutilized due to low clock speed. If latency isn't an issue, then why not multiplex the hardware with a processor construct? But I'd like it to be really simple (stateless would be the ideal), not consume much in the way of logic (LUTs, block RAMs, multipliers), have compact op codes (internal block RAM isn't cheap), have high utilization of the ALU (the whole point), and probably most important: be easy to program at the machine code level so I don't need an assembler.
Ever since my first HP calculator I've been fascinated with stack machines. With no explicit operands, a data stack, a return stack, and almost no internal state, one can have incredibly compact op codes - often 5 bits will do. They can be very interruptable, and code factoring with subroutines is more natural due to the stacked registers. I've studied many of these, and have coded my own and had them running on a demo board. They are easy to implement but surprisingly rather cumbersome to program - one has to stick loop indices under the operands on the stack or in memory somewhere, and there are a lot of operations and time wasted on simple stack manipulation. The tiny non-standard op code lengths produce a natural instruction caching mechanism, but multiple op codes per word is awkward when you want to manually change the code in any way.
If you are interested in processors and you've never looked at the Xilinx picoblaze, you should - it is a marvel of tiny engineering. Kind of an overgrown register based state machine. But the op codes are not powers of 2 in length, and addressing the registers bloats them out.
I'm coming to the conclusion that any serious processor has to have multiple registers with operand addresses in the op code - this allows for a move along with an ALU operation in a single cycle. Self-evident to many I suppose, I'm so slow when it comes to the fundamental stuff. After looking at the J processor I had a thought - why not do a 2 operand machine (source & source/destination address in the op code) with 4 generic registers, and have stacks under those registers with bits to enable/disable them in the op code? A 16 bit machine with 4 bits for operand selection, 4 bits for stack enable, 8 bits for the operation itself. Large literals would be in-line, small ones could be in the op code field.
Anyway, it's this kind of massive side diversion that is (I hope not pointlessly) extending the design of my (mostly) digital Theremin, but I'm hoping to parlay much of the foundation established here into other designs. And it's nice to revisit things I've spent enormous amounts of time pondering and tinkering with, but with little to show for it all at the end. Though it does feel a bit scarily like answer C of that humorous Engineer Identification Test, where:
You walk into a room and notice that a picture is hanging crooked. You...
A. Straighten it.
B. Ignore it.
C. Buy a CAD system and spend the next six months designing a solar-powered, self-adjusting picture frame while often stating aloud your belief that the inventor of the nail was a total moron.