Let's Design and Build a (mostly) Digital Theremin!

Posted: 3/21/2018 9:07:09 PM

From: Germany

Joined: 8/30/2014

 Jacking my jaw and lips wide open and sticking out my tongue seems to minimize many of the confounding resonances

Sticking your tongue out also raises your larynx (depending on how far out, and whether it's already raised, of course) it may get a little "constipated" in the throat around it, probably doesn't sound natural. Especially since pulling your jaw down all the way tends to engage the muscles which lower the larynx (2 longer ones, below / left and right of it going down the throat), and if you pull in the other direction with the tongue, the thing tends not to appreciate that (excess tension), and it limits what the vocal apparatus can do.
Okay, there are people who use their voice like that, but I wouldn't say it sounds pleasant :-D

Posted: 3/22/2018 2:47:03 PM

From: Northern NJ, USA

Joined: 2/17/2012

"Sticking your tongue out also raises your larynx (depending on how far out, and whether it's already raised, of course) it may get a little "constipated" in the throat around it, probably doesn't sound natural."  - tinkeringdude

Oh, I agree.  But every time I record my own vocal sounds I find that the details are lost in a sea of resonance, and I don't want to go the route of every other researcher and spend the rest of my life doing LPC and other forms of inverse filtering just to get a clue as to what the larynx is up to.  If I can practice a bit and generate pretty much the same sounds without too much resonance then maybe I can sidestep most of the pain.  I'm really just trying to get a sanity check after reading too many papers and coming up empty.  Sometimes it's tough seeing a way forward.

Yesterday's flailing around didn't get me anywhere either, and I realize today that I should be working on making the basics sound more realistic, as I've found that adding obvious little features can sometimes open new avenues to explore.  Adding weird pulses and stuff to the fry feels like a wrong move.

I've complained before that the vocal volume control sounds just like that: someone adjusting the volume control on an audio amplifier.  It's not all that hard to control the strength of a high pass filter to accentuate harmonics with volume.  It's likely even more straightforward just to pipe the volume number over to the sine phase power harmonics control.  I've limited the harmonic level control to the range [0.5:1) in order to limit aliasing, but there are plenty more harmonics in the range [0:0.5) to be mined if necessary.

And I need to implement some kind of threshold control for the onset of vocalization.  My own voice jumps from breathy aspiration to low level vocalization. I'm hopeful that this threshold can be manipulated with noise to give a more realistic turbulent onset.  If I can achieve that then I think I'll finally be OK with the vocal sim.

Posted: 3/27/2018 6:29:45 PM

From: Northern NJ, USA

Joined: 2/17/2012

Volume Envelope Knee

Sometimes I'm stymied by what seems like a simple problem.  When that happens I often find that utilizing the visual centers of my brain will get me through it much faster.  I always do this with digital design: I draw the logic pipeline, and then some waveforms associated with it - sort of a paper sim of critical functionality before I implement the real thing in SystemVerilog.  

With the Theremin assembly code I'm up against modulo math a lot, and drawing graphs of the various steps makes it much clearer.  For example, I wanted to add a variable "knee" or downward slope to the volume envelope to make the vocal transition more realistic.  At first I looked at the obvious geometric ways to do this (x intercept, slope, etc.) but in the end it was much easier to generate a separate sloped signal and subtract this from the volume response:

Above we see the volume response 'X', which is linear, on the left.  The point 'T' is where we desire the break or knee.  If we subtract 'X' from 'T' we get the center response, and if we multiply this by 'K' we get something that hinges about 'T' at the x axis.  We can then subtract this from the original input to get our knee response.  All of these operations are full-scale 32 bit, and saturating math is used throughout.  I have many of these functions in the integer library I wrote, and it would be great if I could include them in the Hive op codes as they seem to get used a lot, but I have doubts that the Hive pipeline would allow it.

So the volume processing side is now:

  operating point acquisition => filtering => linearization => velocity envelope => knee => attack / decay via slew limiting => EXP2 => to stuff

I find that placing the knee before the attack / decay lets me soften the decay knee a bit in time, which sounds more realistic from a vocal perspective.  Voice often jump-starts, but then later can taper off below the starting level.  I wound up sharing the variable threshold 'T' knob between both the velocity envelope and knee functions, as the double-duty seems to make sense here.


DSP Architecture

I'm still pursuing the goal of designing the signal path for vocal sim first, and anything else that's easy after that.  I've developed all the modules for the following architecture, and have had most of it up and running for a while now:

On the left: I need to add volume axis and possibly pitch axis modulation to the oscillator harmonic content adjustment.  The noise has a dedicated filter because it usually can't be used raw, even if it is being processed by the formant bank.  In the center: there are two formant banks, the top one is for nose and throat radiation, the bottom one is for mouth radiation.  The filter following the lower one is for lip modulation.  On the right: the formants are summed and we can do final stuff like baffle simulation.  Then we ship it to the SPDIF module for D/A conversion.

I've currently got 6 formant bandpass filters, though voice sounds fine to me with just 4 (2 + 2 in the above plan).  They might benefit from a bit of volume axis / pitch axis modulation to their center frequencies.  It's easy to go crazy on this stuff and provide an adjustment for everything, but then you end up lost in a sea of screen menus.  Looking for a sweet spot of maximal flexibility with minimal control.

Posted: 3/28/2018 11:19:19 PM

From: Northern NJ, USA

Joined: 2/17/2012

Getting Closer

I piped the volume axis over to the oscillator harmonic brightness (which is accomplished via phase modulation as outlined in an earlier post) and gave the strength a knob.  In academic papers they describe the harmonic level as "openness" or "effort", and making it dynamically variable is somewhat subtle sounding, but ultimately makes a massive difference in the level of vocal realism.  The new volume knee and a bit of decay time also help.  See what you think: (link).

Laughter is pretty easy to do with a Theremin controller.  Read some papers on it (here's one: link) and there's a web site (link) where they automate / articulate it via a laptop (they can certainly do a wide variety of laugh types, but I think the underlying voice sounds somewhat artificial).

Posted: 3/29/2018 9:42:15 PM

From: Northern NJ, USA

Joined: 2/17/2012

Envelope Generator Reorg

While installing the knee function in the envelope generator, I took the opportunity to rewrite the peak hold and velocity sensing code.  The velocity sense & injection has always struck me as a "bag on the side" but today I believe I've formulated a much more unified approach.

The old way:
  volume number => (+) velocity sense & trigger => knee => peak hold => attack & decay (slew limit) => EXP2 =>

The new way:
  volume number => knee => (+) velocity sense => peak hold => attack & decay (slew limit) => EXP2 =>

The old way used a window, stored peak positive velocity inside the window, spit this out when the hand crossed the closer window edge, multiplied this by a gain factor and took LOG2 to give it some dynamic range.

The new way relies on the knee to naturally gain up the velocity and provide a small window, so the velocity sense is just a floored subtraction.  There's almost nothing to it and, while I haven't spent a lot of time polishing on it, retriggering seems a bit more reliable.  It also gives a natural emphasis to attack (and so de-emphasis to decay if adjusted correctly) so vocal disengagement seems a shade more realistic now, and doesn't rely entirely on fixed decay timing (which quickly gets dull and repetitive sounding).

[EDIT] It does a marching band snare pretty good (link).  

And I'll say this about development: you get a lot of fairly painless practice time in while demoing and polishing new features.  I get maybe 1/2 hour per day or more of practice this way, which is sufficient to progress fairly quickly.

Posted: 3/29/2018 11:21:07 PM

From: Tucson, AZ USA

Joined: 2/26/2011

I guess if you listen to any digital noise long enough it begins to sound good. I admired Stephen Hawking (RIP) and thought his voice after a while sounded rather natural.

Posted: 3/30/2018 2:50:03 PM

From: Northern NJ, USA

Joined: 2/17/2012

"I guess if you listen to any digital noise long enough it begins to sound good."  - Touchless

So digital = noise?  Literally all audio you've heard in the last 20 years or so has been digitized (including LP recording, engineering, and mastering) so it must be hell on earth for you.

"I admired Stephen Hawking (RIP) and thought his voice after a while sounded rather natural."

You seem to be implying that my Theremin sounds robotic.  There's no accounting for taste I suppose.

Look, I'm all ears when it comes to constructive criticism, but you're just blindly tossing bombs.

Posted: 4/3/2018 8:46:06 PM

From: Northern NJ, USA

Joined: 2/17/2012

Presets & Menus

Just a quick video to give you a better view of what I'm staring at when playing the prototype:

Tried to adjust the exposure so the LEDs wouldn't blow out / bloom, but then the LCD is kinda dim and the whole video is rather dingy looking.  And the LEDs are flickering in a rasterized way that they don't in real life, but I hope you can get the general idea.  First a major scale, then a walk through the menus, then some playing with the few presets that I've got stored.

Jimmied with the volume axis modulation of the oscillator harmonic level - it now has more range and is limited to only increasing the brightness, not decreasing it.

Also added a "drop" function to the noise envelope, which gives control over what the noise does as you cross the threshold.  Leveling it out or decreasing it at the onset of vocalization seems to add realism, though I'm slowly coming to the conclusion that I'll probably never be 100% satisfied with the addition of noise to the vocals.

I'm using pitch correction here, you can maybe catch it in action around 2:00 or so.  There's a fine line between too much and too little, and it can mess you up sometimes, but after using it for a while now it seems to be a net positive.


Found and ordered this RAM brand ball joint (part number RAP-B-103U-A) on eBay ($12 USD, free shipping):

It arrived yesterday and it seems like something that would work quite well for the next prototype or even the finished product.  Unlike the CCTV mounts I'm using now, the 1" diameter rubber ball really grips without having to massively torque the tightener down.  I wish it were available with a ball flange on both ends (gotta order that separately for $6.50 - part number RAP-B-202U).  Except for the tightening bolt, it's all durable seeming (reinforced?) plastic (and rubber of course).

Posted: 4/5/2018 3:34:52 PM

From: germany, kiel

Joined: 5/10/2007

Is the normal volume behaviour (the closer the quiter) also an option?
And then i am interested in the very low frequencies. How deep can it be?

Posted: 4/5/2018 10:59:03 PM

From: Northern NJ, USA

Joined: 2/17/2012

"Is the normal volume behaviour (the closer the quiter) also an option?"  - Dominik

Yes it is a switchable option.  Here's my video from several months ago demonstrating the inverse volume response (as well as user adjustable volume field size and location):

"And then i am interested in the very low frequencies. How deep can it be?"

On the pitch screen I've an octave bank switch that goes from -4 octaves to +3 octaves.  Another earlier video showing some low (and high) end with a sine wave:

The -4/+3 bank limits are arbitrary, the high end is absolutely limited to C9 (~8.372kHz) but the low end could go as low as you like (deep sub Hz).

I think it's nearing the time to finalize the DSP path, finalize better cabinetry and hardware, finalize the circuitry and get some PWBs made and populated, and ship at least one demo around to folks to get their feedback.

You must be logged in to post a reply. Please log in or register for a new account.