Published on 04.01.2019 in [software]
Whisp - An Environmental Sound Classifier
Whisp is an environmental sound classifer that can be used to identify sounds around you. In its current form, Whisp will classify 5 second sounds with an 87.25% accuracy across a range of 50 categories, based on the ESC-50 dataset. You can also record sounds in the field to get a another perspective of what is happening in your sonic environment.
You can try the app here!
Trying the "Record your sound" feature on your computer might not get very satisfying results because, well, most of us are on a computer in pretty sonically uninteresting places. Definitely give it a shot on your mobile device when you're out and about, surrounded by more interesting environmental sounds :)
As someone who has spent a lot of time recording and listening to sounds, the idea of a generalized sound classifier has always been a dream of mine to build.
I'm finding my interests moving more towards research in audio event recognition, so Whips is a first attempt to dive into that world.
Some applications that I've wanted to use one for include:
A tag suggester for field recordings
An augmented reality app that identifies sounds around you for the hard-of-hearing community
A tool for sound artists for analyzing audio events in your surroundings
In this write up I will walk through the steps to create the classifier, as well as drop hints and insights along the way that I picked up from the fastai course on deep learning.
If you want to skip ahead, feel free to check out the Whisp repo on Github.
The data I'm using comes from the ESC-50 (Environmental Sound Classification) Dataset.
This dataset provides a labeled collection of 2000 environmental audio recordings. Each recording is 5 seconds long, and is organized into 50 categories, with 40 examples per category.
Before training the model, its useful to spend some time getting familiar with the data in order to see what we are working with.
In particular, we are going to train our model not with the audio files, but with images generated from the audio files. Specifically, we will be geneating spectrograms from the audio files and train them with a deep learning neural net that has been pre-trained on images.
For more information on how I generated the spectrograms from the audio files, check out my spectrogram generator notebook on how I did this.
One thing to note is that with spectrogram images, I was able to get better accuracy by creating square images rather than rectangles, so that the training would take into account the entire spectrogram rather than just parts.
To train the model, we are going to use a resnet34, use our learning rate finder, and train twice over 10 epochs.
From the fastai forms, I was able to get a general sense of when I'm overfitting or underfitting.
Training loss > valid loss = underfitting
Training loss < valid loss = overfitting
Training loss ~ valid loss = just about right
Nice! That gets us an error rate of 0.127500, or 87.25%!.
There is a bit of overfitting going on (Jeremy Howard would think its ok), but still, really great results!
Here is our confusion matrix which looks pretty good.
Future Paths Forward
I'd like to train this model on Google's AudioSet data.
Explore more data augmentation methods as described in Salamon and Bello's paper on Environmental Sound Classification.