Show HN: Simple Speech-to-Text with an Arduino Nano

3 hours ago 1

This program will read audio from an analog microphone and distinguish between the spoken digits 'zero' to 'nine', logging the results over a serial link.

It is mostly just a proof-of-concept for an audio recording and feature-extractor for the Arduino Nano (Atmel ATmega328p microcrontroller)

It is a follow-up to a previous project:

"Simple Speech-To-Text on the '10 cents' CH32V003 Microcontroller

https://github.com/brian-smith-github/ch32v003_stt

except this time instead of going for 'lowest cost' I was aiming for 'lowest power microcontroller feasible' (the thinking being - fewer transistors = more power efficient)

It is based around surprising results from experiments in seeing how computationally lightweight an alternative feature-extractor front-end for the 'Whisper' open-source speech-to-text application could be.

(See https://github.com/brian-smith-github/mel80_64_int_top5)

The results suggested that is should be possible to product accurate transcriptions using only integer maths at very low resolutions (<16bit), and an (unrolled, optimized) FFT algorithm, which prompted this test implementation.

(see https://github.com/brian-smith-github/intfft128_unrolled and https://github.com/brian-smith-github/intfft128_unrolled_2bit)

Currently the accuracy seems to be about 95% / 5% word error rate. Obviously the more training examples of each number is provided the better the results. (each trained word uses 129 bytes, and there's 20K of the 32K free beyond the main code, so plenty of scope for this), but I'm not going to waste any more time on this. (I've got other projects to waste time on.)

8-bit vs 32 bit - lower resolution math but seems to be feasible.
both have equally awful noisy 10-bit ADC so recorded audio quality not great either way.
both have no FPU obviously so need to use integer fixed-point math throughout.
both have 2K RAM but since now working with 16-bit real+imaginary components for the FFT (the largest memory-consumer element) instead of 32-bit, the memory can be used more efficiently.
32K storage on 328 vs 16K on CH32V003 along with smaller instruction-sizes allows for more unrolling of code and more training examples to be used potentially.
16MHz vs 48Mhz, so only about 1/3 the instruction cycles available, timing is tight.
The 328 at least has an 8-bit hardware multiplier with 16-bit result, so things like windowing the input frame data and speedy matrix-multiplies becomes more viable.

a timer is set up to generate an interrupt 6400 times/second. On receiving an interrupt the ADC is read and the next ADC conversion started. The samples have a 1.0 pre-emphasis (essentially delta-coding) to flatten the spectrum and remove any DC level.
every 64 samples (10ms), a buffer of the last 128 samples is scaled to 4-bit range, and a 5-bit Hann-like window is applied (based on a window function from the LPCNet codec).
a 128-wide FFT is then applied to the data to generate 65 (64 excluding 0Hz) spectrum, maintained within a 16-bit range for the real/imaginary components.
Rather than go through the slow process of calculating 32-bit magnitude of each of the 65 16-bit complex frequencies of the FFT output (i.e. realreal+imagimag), a lightweight heuristic is used to determine the 'peaks' in the spectrum (based on https://dspguru.com/dsp/tricks/magnitude-estimator/)
The FFT output is roughly converted to log2-scale at 3-bit resolution (8 levels). It appears that the ear can only really distinguish between under a dozen distinct log-scale volume levels
22 approximately Mel-scale bins are calculated from the 65 log2 FFT bins.
an 8-bin cepstrum is calculated from the 22 log-mel bins
When the 'energy' of the frame (the overall scaling required to drop the frame sample data to 4-bit) is above a certain level, the frame is added to a buffer, otherwise a count of 'silence' frames is increased. When enough silence-frames have passed to signify 'end of word', its length is warped to exactly 16 frames and compares to entries in a 'cepstrum-tensors-to-spoken-digits' codebook and the closest match reported.

serial_read/ contains a simple Linux C program to read the frame data coming from /dev/ttyUSB0 and write it to a file (/tmp/a.raw). This can then be concerted to a WAV file via the 'sox' application for listening back.

nano_train/ contains the Linux-desktop training code used to generate the 'codebook.h' file here. It just reads 8-bit 6400 samples/second data being dumped by the nano when in 'TRAINING=1' mode, and runs the same feature-extraction code, but allows for creation/addition to the 'codebook.h' spoken digits file via a key-press for a second or so after a wrong-guess is generated. (best to delete this file and start fresh for new training)

Unfortunately the microphone setup I'm currently using (MAX4466 board with analog-out going straight into pin A7, 3V3 and GND for power, no external filtering components at all) produces very poor quality noisy audio which hinders the results.

Running the same 16-bit integer code on my Linux desktop with 'good quality' 6400 samples/sec audio clips produces superior results. I'm also playing with an 'Arduino Nano RP2040 Connect' board that has a built-in PDM microphone which gives much cleaner low-noise audio recordings.

Mostly, this is just a test to see whether the Arduino Nano has enough horsepower to do the traditional-style feature extraction processing (i.e. 128-wide/20ms FFT plus mel and cepstrum matrix stuff at 100fps) in real-time - which seems to be true.

This project is inspired by work by Peter Balch: "Speech Recognition With an Arduino Nano"

https://www.instructables.com/Speech-Recognition-With-an-Arduino-Nano/

Also uses examples on Arduino interrupts by amandaghassaei:

https://www.instructables.com/Arduino-Timer-Interrupts/

Read Entire Article