Published on 05.29.2019 in [software]

Cat Spectrograms
Fireworks Spectrograms
Sea Waves Spectrograms
Sirens Spectrograms

Whisp - An Environmental Sound Classifier

Whisp is an environmental sound classifer that can be used to identify sounds around you. In its current form, Whisp will classify 5 second sounds with an 87.25% accuracy across a range of 50 categories, based on the ESC-50 dataset. You can also record sounds in the field to get a another perspective of what is happening in your sonic environment.

You can try the app here! It works on desktop Firefox and mobile only Safari on iOS (Chrome has some issues that don't let using the microphone for recording work right now, sorry!).

Trying the "Record your sound" feature on your computer might not get very satisfying results because, well, most of us are on a computer in pretty sonically uninteresting places. Definitely give it a shot on your mobile device when you're out and about, surrounded by more interesting environmental sounds :)


As someone who has spent a lot of time recording and listening to sounds, the idea of a generalized sound classifier has always been a dream of mine to build.

I'm interested in creating technologies that change our relationship to the sounds in our environment. Or another way, I like creating sound technologies that change our relationship to our environment and the world at large.

I'm finding my interests moving more towards research in audio event recognition, so Whips is a first attempt to dive into that world.

Some applications that I've wanted to use one for include:

To those ends, I built a environmental sound classifier using the ESC-50 dataset and fastai library.

In this write up I will walk through the steps to create the classifier, as well as drop hints and insights along the way that I picked up from the fastai course on deep learning.

If you want to skip ahead, feel free to check out the Whisp repo on Github.


The data I'm using comes from the ESC-50 (Environmental Sound Classification) Dataset.

This dataset provides a labeled collection of 2000 environmental audio recordings. Each recording is 5 seconds long, and is organized into 50 categories, with 40 examples per category.

Before training the model, its useful to spend some time getting familiar with the data in order to see what we are working with.

In particular, we are going to train our model not with the audio files, but with images generated from the audio files. Specifically, we will be geneating spectrograms from the audio files and train them with a deep learning neural net that has been pre-trained on images.

For more information on how I generated the spectrograms from the audio files, check out my spectrogram generator notebook on how I did this.

One thing to note is that with spectrogram images, I was able to get better accuracy by creating square images rather than rectangles, so that the training would take into account the entire spectrogram rather than just parts.


To train the model, we are going to use a resnet34, use our learning rate finder, and train twice over 10 epochs.

From the fastai forms, I was able to get a general sense of when I'm overfitting or underfitting.

Training loss > valid loss = underfitting
Training loss < valid loss = overfitting
Training loss ~ valid loss = just about right

epoch train_loss valid_loss error_rate
1 1.063904 1.055990 0.325000
2 1.036396 2.332567 0.562500
3 1.049258 1.470638 0.387500
4 1.032500 1.107848 0.337500
5 0.924266 1.392631 0.417500
6 0.768478 0.623403 0.212500
7 0.596911 0.535597 0.165000
8 0.446205 0.462682 0.160000
9 0.325181 0.419656 0.135000
10 0.251277 0.402070 0.127500

Nice! That gets us an error rate of 0.127500, or 87.25%!.

There is a bit of overfitting going on (Jeremy Howard would think its ok), but still, really great results!

Here is our confusion matrix which looks pretty good.

Whisp Confusion Matrix
Whisp Confusion Matrix

Testing in the Field

I've been taking Whisp with me out on field recording expeditions around the Newtown Creek.

Dutch Kills

One night with Mitch Waxman, I took an early version of Whisp and made field recordings around the Dutch Kills area of the Newtown Creek, and down near Blissville. I extracted 3 sounds from the recordings that I knew would show up in the ESC-50 dataset categories.

Train Engine

Whisp classified this sound as a washing machine with 69% confidence, which... isn't exactly correct. But hey, a washing machine does sound a lot like an engine when its running right? I can understand the ambiguity. Whisp had 18% confidence that it was a helicopter, and 5% confidence that it was an engine (of some sorts).


Whisp classified this sound as a thunderstorm with 97% confidence, which are usually pretty windy! The next highest confidence score was wind, with 7% accuracy.

Train horn

Finally, Whisp classified this sound as a car horn with about 99% accuracy. Given that the dataset doesn't have "train horn" as a category, we can live with this being close enough ;)

Hunter's Point Park (Hunter's Point South Park Extension) - mouth of Newtown Creek

I recently took Whisp out into the field with Taiwanese sound artist Ping Sheng Wu to test Whisp in the field.

We saw a group of birds off into the distance.

Whisp was able to hear and classify their chirping!

We tried getting some water sounds, but most of it came back as wind, as that was the dominant sound out there. Sea waves did come back though, but with a low 3% confidence rating.

On our walk back to the train station, we found a fan and decided to try Whisp'ing it.

Whisp thought it was a vaccum cleaner, which, like the example above of the engine that sounded like a washing machine, isn't too far off. It also thought it could have been a washing machine and plain old wind.

Testing in the wild

Since releasing Whisp I've taken it out with me to try to classify sounds around me, which it does a really good job at!

Here are some examples of it classyfing sounds like:


Sea Waves

Ambulance Siren



Future Paths Forward

I'd like to train this model on Google's AudioSet data.

I'm also interested in Exploring more data augmentation methods as described in Salamon and Bello's paper on Environmental Sound Classification.

Some ideas that I'd love to explore are the idea of a "sound homonym". For example, there are a lot of sounds that sound similar to each other, and that the classifier gets wrong but is pretty close (washing machine vs. engine, for example). I wonder what it would look like to play around with sound homonyms for performance.

The other thing that I'm interested in is the "distance" between sounds. For example, the classifier gives you the "closest" prediction it thinks the sound is. You could imagine that the prediction that is the least close is the furthest away. It would be interesting to push this idea further and think about how different sounds are more or less distant from each other. What would it mean for a sound to be the opposite of another sound? Or the most different sound?


ESC-50 Dataset

ESC: Dataset for Environmental Sound Classification

Audio Classification using FastAI and On-the-Fly Frequency Transforms

Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification

Environmental Sound Classification with Convolutional Neural Networks

Music Genre Classification