Peter Gao
April 25, 2021
13 min

Deep Learning For Audio With The Speech Commands Dataset

Here, we train a very simple model on the Speech Commands audio dataset and analyze its failure cases to see how best to improve it!

In the last decade, deep learning has become a popular and effective technique for learning on a variety of different data types. While most people know of applications of deep learning to images or text, deep learning can also be useful for a variety of tasks on audio data!

At my previous job working on self driving cars, I spent a lot of time working on deep learning models for imagery and 3D pointclouds. While we had some cool applications of audio processing, I only dabbled in it and never seriously worked on it.

However, here at Aquarium, we have some customers working on some pretty interesting applications with audio data. We recently held a 3-day hackathon, so I decided to try doing deep learning on audio for myself.

In this blog post, we’re going to train a very simple model on the Speech Commands audio dataset. We’ll then use Aquarium to analyze its failure cases and determine how best to improve the model.

If you’d like to follow along, the full code I used for this project is available here. The dataset and model inferences can be accessed here.

Choosing A Dataset

I looked for a dataset that wasn’t too small to do anything interesting, but small enough that I could train a model on it using my local machine instead of requiring a fancier distributed setup.

After a bit of searching, I found the Speech Commands dataset, which consists of approximately 1 second long audio recordings of people saying single words as well as segments containing background noise. The task is a classification task, where a model listens to segments of audio and classifies whether a certain trigger word is spoken or not.

Even better, there’s also a nice Pytorch example that loads the dataset and trains a model on it. For this project, I used this code with some relatively minor modifications. I wanted to take something relatively out of the box, train it, and then understand how and why it fails.

Becoming One With The Data

First, I downloaded the dataset to get a feel of what type of data is in the dataset. It’s important at this stage to gut-check that the raw data is not corrupt/malformed, but also to make sure that the labels are reasonably accurate. It also gives you a sense of how difficult the task is — if it’s hard for you as a human to pick out what word is being spoken, then it’ll be even harder for the ML model.

At this stage, the best dataset inspection tool is your file manager! I clicked through the various folders in Nautilus and listened to random samples of the data to get a sense of what they “sound” like. From a cursory inspection, there are folders for each class that contains audio files of people speaking for that class. I listened to these audio files for ~15 minutes and confirmed that the audio files seemed to match up with the label folders that they belonged in, so the basic quality of the dataset seems pretty good.

Pro-tip: you can shift click a series of files in Nautilus, open them in VLC player, and VLC player will play them all sequentially, which makes it easy to listen to many different audio clips very quickly.

Believe it or not, Nautilus and VLC player are critical parts of the ML toolchain.

Now to load the dataset programmatically. The good news is that there’s already an example of loading the data and training a model on the data. The bad news is that it’s Colab notebook code, and I don’t really like notebook code. After I deleted all of that from the example, I started taking a look at loading the data programatically.

But first, what does an audio file actually contain? To understand this, we have to brush up on some of our physics and audio processing knowledge. Sound is fundamentally a vibration — variations in pressure — that travel through space as a wave. To record audio, computer microphones measure the changes in pressure over time as amplitude, do some processing/compression, and write them to disk.

However, sound waves in reality are analog signals while computer microphones record and save sound digitally. This means if one were to plot the actual sound wave as a chart of amplitude over time, one would get a smooth and continuous curve. Digital microphones sample this wave over time — instead of recording and saving a continuous analog wave, the digital microphone measures amplitude periodically (this timing is known as the sample rate), Additionally, the amplitude also doesn’t exactly match the true amplitude — it’s rounded to the nearest amplitude value at a certain resolution, which is determined by its bit depth. After this is recorded, the computer can then compress and saves the quantized amplitudes it measured over time.

The actual analog sound wave, in red, is a smooth and continuous curve. Digital microphones sample this curve at various points in time (represented in blue) and save these. Source.

When loading the dataset using the sample code, each row of data contains:

  • An audio waveform represented as a one dimensional array of numbers representing quantized samples (like the blue points in the diagram above)
  • The sampling rate (in hz) as an attached piece of metadata.
  • The label of the word spoken in the clip (ie the class the model will be trained to predict).
  • The speaker id.
  • The utterance number from that speaker to distinguish between multiple recordings from the same speaker.

This data is then split across a train set (~85k rows), validation set (~10k rows), and test set (~5k rows). Each audio segment was sampled at 16kHz and the audio arrays for most segments were 16,000 long, which translates into 1 second long audio clips. However, some of the audio segments were shorter than 16,000 long, which can be a problem down the line when a neural network requires a fixed-length input.

Training A Model, And Some Pro-Tips

Now we understand the data, it’s time to train a model.

The Pytorch example code does some simple preprocessing before feeding data to a model. First, if there are any examples that are less than 1 second long, it pads them with zeros to ensure they are the same length as the other examples. This ensures that the model trains on a fixed-size input.

def pad_sequence(batch):    
	# Make all tensor in a batch the same length by padding with zeros    
  batch = [item.t() for item in batch]    
  batch = torch.nn.utils.rnn.pad_sequence(        
    batch, batch_first=True, padding_value=0.
  return batch.permute(0, 2, 1)

Second, it resamples the audio to 8kHz to reduce the size of the input.

def get_transform(sample_rate):    
	new_sample_rate = 8000    
  transform = torchaudio.transforms.Resample(        
  	orig_freq=sample_rate, new_freq=new_sample_rate
  return transform

After that, we can define a model architecture to train on this data. The example code comes with a fairly simple network built on top of 1D convolutions on the waveforms. I’ll note that this model architecture is quite similar to the architectures used for image tasks, which is part of the reason why deep learning has proven useful across a lot of different domains.

I’ve found some things to be fairly helpful while doing deep learning that weren’t included in the sample code and I would generally recommend for most people getting started with a deep learning project.

  • Make sure there’s no leakage between the train / val / test sets. Otherwise your model will be unfairly advantaged at evaluation time and will not give you an accurate assessment of its generalization performance. When I checked this with the sample code, I discovered that the train set was improperly including the data from the validation and test sets! This bug is probably related to me running this code on an Ubuntu system instead of in a Colab notebook like the original example does.
  • Inspect/visualize the preprocessed data right before it gets into the network and confirm it’s what you expect. In my experience, a lot of bugs can pop up in the data preprocessing that will mess up your model training. Plotting some datapoints in Matplotlib / playing some audio with simpleaudio are easy ways to sanity-check the results of your data preprocessing.
  • Set up Tensorboard or an equivalent to monitor the progress of your training. This is very important because it allows you to eyeball your training + validation loss curves and track whether your model is converging and identify when it starts to overfit or underfit.
  • Save checkpoints of your model weights to disk every few epochs. If your computer crashes, you can restore progress from a checkpoint instead of losing a day’s worth of work. You can also use this to pick the model checkpoint that had the best validation loss according to your Tensorboard logs.
  • Reduce your cycle time by looking for drop-in speedups to the model training process. If you have a GPU available, set up GPU acceleration for your preprocessing and model forward + backward passes. CUDA installation can be very frustrating, but it’s well worth it when your model trains 10x faster. If you have multiple GPUs available (usually in a beefy cloud machine), try implementing data parallelism so your model will burn through minibatches even faster.
Eyeballing loss curves is important!

What Next?

After 50 epochs of training, the model got to ~93% F1 score on the train set and ~85% F1 score on the test set. That’s not so bad, but I’m sure we can do better! There are a few things I wanted to try to improve the model performance:

  • Implement a model that uses 2D convolutions on Mel spectrograms (instead of the current model’s 1D convolutions on the raw waveform), as this tends to achieve state of the art performance on other audio tasks.
  • Do some data augmentation to get more variation in the dataset (adding noise, shifting the timing of the audio clips, etc.) to increase the generalization capability of the final model.
  • Try using input audio with the original (higher) sample rate so the model can operate on a more informative input.
  • Collect more data of the classes that aren’t performing well to hopefully improve the model’s accuracy on those classes.

This is typically where most tutorial blog posts end. However, let’s go a few steps further and actually analyze the failure cases so we can intelligently work on the stuff that would most improve the model performance.

Using Aquarium For Failure Case Analysis

First, I wanted to inspect the differences between the labeled dataset and the inferences from my model in order to analyze some of its failure cases. After a brief detour to generate some spectrograms for visualization, I uploaded the dataset to an Aquarium GCS bucket and then used the Aquarium client library to upload my labels, metadata, and inferences.

# add label data
al_labeled_frame = al.LabeledFrame(frame_id=frame_id)
# Add arbitrary metadata, such as the train vs test split
al_labeled_frame.add_user_metadata('speaker_id', speaker_id)
    'utterance_number', utterance_number)
    'waveform_shape', waveform.shape[1])
al_labeled_frame.add_user_metadata('split', dataset_split)
# Add a spectrogram image to the frame for visualization
    sensor_id='spectrogram', image_url=image_url)
# Add the audio file to the frame for audio-lization
    sensor_id='waveform', audio_url=audio_url)
# Add the ground truth classification label to the frame
label_id = frame_id + '_gt'

# add inference data
al_inf_frame = al.InferencesFrame(frame_id=frame_id)
inf_label_id = frame_id + "_inf"

Once the upload completed, I went into the Aquarium UI and was able to inspect my dataset and play some audio clips.

Eyeballing and… ear-holing (?) some examples

I also plotted out the class distribution and saw that there were some classes that were relatively underrepresented in the dataset compared to others.

Some classes have half as many examples as other classes!

Aquarium also automatically surfaces high-loss examples in the dataset that are potentially labeling errors or egregious model failures. Using this and by looking around the interactive model metrics view, I found:

  • Some audio clips actually contain multiple words instead of one! In this case, it seems like the speaker accidentally spoke the wrong word and then corrected himself, and the model correctly classifies the first word.
  • There are some words that sound quite similar that the model confuses — up and off, tree and three, forward and four. A human can tell some of these examples apart, while other cases are much harder to distinguish.
  • Across the various failure cases, there seemed to be some instances where the speaker’s voice is cut off. In many of these clips, the speaker has barely started speaking before the clip ends, making it pretty much impossible to tell apart words like “four and forward” that sound the same at the beginning.

The last failure mode seems to happen across multiple classes in the limited set of data I examined, but I don’t have a good sense of how big this problem is and I didn’t want to spend hours listening to all of the audio clips in the dataset to gauge that. However, it’s important to understand what is the biggest problem to fix before deciding what to try next.

Luckily, a unique aspect of Aquarium is its ability to leverage neural network embeddings to cluster together similar datapoints. This functionality allows us to find clusters of similar datapoints as well as outliers in the dataset. This is particularly useful when it comes to identifying patterns of model failures.

It’s not too hard to extract out the embeddings from a model, simply run your model on a datapoint and then grab the activations of your model from the second-to-last layer (before the class-wise FC layer at the end). The embedding vector for a datapoint here is simply a 1D vector of floats that represents a learned feature vector. I ran my model on the dataset, extracted out embeddings and inferences, and pickled them to disk:

# embedding extraction
activation = {}
def get_activation(name):
    def hook(model, input, output):
        activation[name] = output.detach()
    return hook
def predict(tensor):
    # Use the model to predict the label of the waveform
    tensor =
    tensor = transform(tensor)
    tensor = audio_model(tensor.unsqueeze(0))
    likely_ind = model.get_likely_index(tensor)
    out_label = data.index_to_label(likely_ind.squeeze(), labels)
    probs = torch.exp(tensor).squeeze()
    confidence = probs[likely_ind]
    embedding = activation["pool4"].flatten()
    return out_label, confidence.item(), embedding.cpu().tolist()

I then loaded them up from disk and then uploaded them to Aquarium:

embeddings = None
with open("embeddings.pickle", "rb") as f:
    embeddings = pickle.load(f)
inferences_map = {}
embeddings_map = {}
for i in range(len(inferences)):
    (frame_id, pred_class,
        confidence, dataset_split) = inferences[i]
    inferences_map[frame_id] = inferences[i]
    embeddings_map[frame_id] = embeddings[i]

Once the embeddings are uploaded and processed, we can get a sense of the dataset’s distribution in embedding space. Feel free to try it for yourself here.

And then color the embedding cloud by the agreement/disagreement between the model inferences and the labels. We clearly see a huge cluster of disagreements in one section of the dataset that is way bigger than any other error pattern!

Coloring the embedding cloud by agreement / disagreement between the model and the labels

Only showing disagreements between the model and the labels.

Looking closer, all of these datapoints are audio clips where the speaker is cut off, making it impossible to tell what word is being said.

Looking at the waveforms of these examples, you can tell that all these clips are pretty much cut off at the end.

This issue affects thousands of examples in the dataset, and sticks out as the single largest failure modality. However, this is not necessarily a problem with the model as much as it is a problem with the expectations of the model — there’s no way that a human could listen to those audio clips and reasonably conclude what word was being said, and they are only labeled the way they are because the original data collection setup for Speech Commands involved instructing users to say a certain word and saving those audio clips labeled with the words the speakers were instructed to say.

We should probably fix this by creating a new class of “unknown” or “cut off” and retraining the model. This way it won’t be confused by examples that are labeled as legitimate classes but don’t actually contain enough information to be accurately classified. However, this is the point where I ran out of time with the hackathon, so I’ll leave that as an exercise for the reader.


There’s always a point at a machine learning project where you have a model that does alright but not great, and there are lots of options of things to try that could improve your performance. Andrew Ng has great practical advice for these types of situations: instead of trying random things and seeing what improves performance, spend some time to understand what’s wrong and then fix the things that are messed up.

Here we’ve taken some example code pretty much off-the-shelf, trained a very simple model trained on the dataset, and analyzed where it goes wrong. As a result, we have discovered that the biggest problem is nothing to do with the model itself, but with basic and easily fixable issues with the data. By taking a bit of time to do error analysis, we’ve found the highest ROI way to improve our model performance out of the many things we could have tried.

I used our company’s product, Aquarium, to do a lot of the visualization and error analysis for this post. If you’re interested in trying Aquarium for yourself, let us know!

Get in touch

Schedule time to get started with Aquarium