Training Neural Audio Networks

3 months ago 3

yager.io

Available as a blog post or jupyter notebook on Github.

I'm starting a new job at an AI lab soon, working on engineering for RL systems, and I want to do a bit of hands-on NN work to get a sense for what a notebook research workflow might look like, since I imagine a bunch of my coworkers will be using workflows like this.

I've recently been working on some music production plugins, so I have music on my mind. Let's figure out how to train a model to emulate various audio effects, like distortion effects or reverbs. Maybe you have some incredible-sounding guitar pedal that you want a digital simulation of. Can you automatically create such a simulation using machine learning?

I've not read much neural network literature at this point. Basically everything I know about NNs is just from discussions with my friend who works on neural compression codecs. I'll probably do some stupid stuff here, call things by the wrong name, etc., but I find it's often more educational to go in blind and try to figure things out yourself. I'm also only vaguely familiar with pytorch; my mental model is that it's a DAGgy expression evaluator with autodiff. Hopefully that's enough to do something useful here.

I figure an architecture that might work pretty well is to have the previous m input samples, $[i_n, i_{n+1}, ..., i_{n+m-1}, i_{n+m}]$, and the previous m-1 output samples, $[o_n, o_{n+1}, ..., o_{n+m-1}]$, and train the network to generate the next output sample, $o_{n+m}$

[ m prev inputs | m-1 prev outputs] | v .-------. | The | | Model | '-------' | v [next output]

This architecture seems good to me because:

  1. It's easy to generate examples of input data where we can train the model 1 sample at a time
  2. It gives the model enough context to (theoretically) recover information like "what is the current phase of this oscillator"

As for how to actually represent the audio, some initial thoughts:

  1. You could just pass in and extract straight up $[-1,1]$ float audio into/out of the model
    1. The model will have to grow hardware to implement a threshold detector ADC or something, which seems kind of wasteful
    2. This fails to accurately capture entropy density in human audio perception. A signal with peak amplitude 0.01 is often just as clear to humans as a signal with peak amplitude 1.0. Which brings us to a possible improvement
  2. Pre- and post-process the audio with some sort of compander, like a $\mu$-law
    1. This will probably help the model maintain perceptual accuracy across wide volume ranges
  3. Encode a binary representation of the audio. This feels to me like it will probably be difficult for the model to reason about
  4. Use some sort of one-hot encoding for the amplitude, probably in combination with a companding algorithm. This might be too expensive on the input side (since we have a lot of inputs), but could work well on the output side. We could have 256 different outputs for 8-bit audio, for example
    1. We could also possibly allow the network to output a continuous "residual" value for adding additional accuracy on top of the one-hot value

For now, let's just try the first thing and see how it goes. I expect it will work OK for well-normalized (loud) audio and poorly for quiet audio.

As for our loss function: probably some sort of perceptual similarity metric would be best, followe by a super simple perceptual-ish metric like squared error in $\mu$-law space. However, to start with, let's just do linear mean squared error. Should be OK for loud audio.

Read Entire Article