How To Learn and Do Speech Recognition? How Ok Google Implemented It...

By Kimberly Cook | Nov 18, 2018 | 6312 Views

Speech recognition is the task of detecting spoken words. There are many techniques to do Speech Recognition. In this post, we will go through some background required for Speech Recognition and use a basic technique to build a speech recognition model. The code is available on GitHub. For the techniques mentioned in this post, check this Jupyter Notebook.

Some background for Audio Processing
Let's take a step back and understand what audio actually is. We all listen to music on our computers/phones. Usually, they are in mp3 format. But .mp3 file is not the actual audio. It is a way to represent audio in our computers. We do not open .mp3 files directly and read them (like we read .txt files in notepad). We use applications to open those .mp3 files. Those applications understand what a .mp3 file is and how to play them. These mp3 files encode (represent) audio.

Audio is represented as waves. Generally, these waves have 2 axes. Time is represented on the x-axis and Amplitude on the y-axis. So at every instant of time t, we have a value for amplitude.

You can listen to a simple sine wave here. Great! Now we just need to figure out how to use these audio files in our code to perform recognition.

Using Audio Files
We will be using Waveform Audio File Format or .wav files. So how do we read these .wav files? Enter librosa - a python package that allows us to read .wav files. What do we get after reading these .wav files? We get a huge array of numbers. This is the output I got after reading one of the audio files which were 1 second long.

array([ 0.0007143 , 0.00551732, 0.01469251, ..., -0.00261393, -0.00326245, -0.00220675], dtype=float32)

What do these numbers mean? Remember that I told you that audio is represented as a wave with two axes. These values represent the y-axis of that wave aka the amplitude. So how is the x-axis aka time represented? That is the length of the array! So for a 1-second audio, the length should be 1000 (for 1000 milliseconds). But the length of this array is actually 22050. Where did this come from?

Sampling Rate
Consider a 5-second audio clip. If it is analog, then it has some amplitude value at every instant of time aka it has some value for every nanosecond, or maybe every picosecond. So considering a 5-second audio clip, it has some value for every picosecond. Those are 5e+12 or 5000000000000 values. Consider that storing on a computer. It takes 4 bytes in C to store a float value. So it is 5e+12 * 4 bytes. That's around 18 terabytes of data only for a 5-second audio clip!

We don't want to use 18 TBs just to store a 5-second audio clip. So we convert it into discrete form. To convert it into discrete form, we record samples (aka the amplitude values) at every time step. So for a 5-second audio, we can record samples at every 1 second. That's just 5 values (samples)! This is called the Sampling Rate.

Formally, the sampling rate is the number of samples collected per second. These samples collected are spaced at equal intervals in time. For the above example, the sampling rate is 1 aka 1 sample per second. You may have noticed that there is a lot of loss of information. This is a tradeoff in converting from continuous(analog) to discrete(digital). The sampling rate should be as high as possible to reduce the loss of information.

So why did we get the array of length 22050? Libros uses a default sampling rate of 22050 if nothing is specified. You may be wondering, why 22050? Well, it's the upper bound for the human hearing range. Humans can listen to frequencies ranging from 20 Hz to 20 KHz. That 20 KHz is 22050. A more common sampling rate is 44100 aka 44.1KHz.

Also, note that we got a 1D array and not a 2D array. This is because the .wav file that I used was mono audio and not stereo. What's the difference? A mono audio has only a single channel whereas a stereo has 2 or more. What's a channel? In simple terms, it is a source of audio. Consider you use 1 microphone to record 2 of your friends talking to each other. In an ideal situation, the microphone records the sound only of your friends and not any other background noise. This audio that you recorded has 2 channels since there are 2 sources of signals - your 2 friends. Now, if there is a sound of a dog barking in the background, the audio will have 3 channels with 3 sources being your friends and the dog.

We usually convert the stereo audio to mono audio before using that in audio processing. Again, librosa helps us to do this. We just pass the parameter mono=True while loading the .wav file and it converts any stereo audio to mono for us.

Features for Audio Recognition
We can use the above time domain signal as features. But it still requires a lot of computational space because the sampling rate should be quite high. Another way to represent these audio signals is in the frequency domain. We use Fourier transform. Stating in simple terms - Fourier Transform is a tool which allows us to convert our time domain signal into the frequency domain. A signal in the frequency domain requires much less computational space for storage. From Wikipedia,

In mathematics, a Fourier series is a way to represent a function as the sum of simple sine waves. More formally, it decomposes any periodic function or periodic signal into the sum of a (possibly infinite) set of simple oscillating functions, namely sines and cosines
In simple terms, any audio signal can be represented as the sum of sine and cosine waves.

A Time Domain Signal represented as the sum of 3 sine waves. (Source)

In the above figure, the time domain signal is represented as the sum of 3 sine waves. How does that reduce the storage space? Consider how a sine wave is represented.

The mathematical representation of sine wave. (Source)

Since the signal is represented as 3 sine waves, we only need 3 values to represent the signal.

Mel-frequency cepstral coefficients (MFCCs)
Our voice/sound is dependent on the shape of our vocal tract including tongue, teeth etc. If we can determine this shape accurately, we can recognize the word/character being said. MFCC is a representation of the short-term power spectrum of a sound, which in simple terms represents the shape of the vocal tract. You can read more about MFCCs here.

Spectrograms are another way of representing the audio signal. Spectrograms convey 3-dimensional information in 2 dimensions (2D spectrograms). On the x-axis is time and on the y-axis is frequency. The amplitude of a particular frequency at a particular time is represented as the color intensity at that point.

Waveform and corresponding Spectrogram for a spoken word â??yesâ??. (Source)

Overview of the approach
For the .wav files, I used a subset of training data from the Kaggle competition-Tensorflow Speech Recognition Challenge. Google Collaboratory is used for training. It provides free GPU usage for 12 hours. It is not very fast but quite good for this project.

Audio files are sampled at 16000 sampling rate. Spectrograms are used to do Speech Commands Recognition. I wrote a small script to convert the .wav files to spectrograms. Spectrogram images are input to Convolutional Neural Network. Transfer learning is done on Resnet34 which is trained on ImageNet. PyTorch is used for coding this project.

Stochastic Gradient Descent with Restarts (SGDR)
SGDR uses CosineAnnealing as learning rate annealing technique to train the model. Learning rate is reduced at every iteration (not epoch) of gradient descent and after completion of a cycle, the learning rate is reset i.e set to the initial learning rate. This helps in achieving better generalization.

The idea is, if the model is at local minima where a slight change in parameters changes the loss very much, then it is not a good local minimum. By resetting the learning rate, we allow the model to find better local minima in the search space.

In the above image, a cycle consists of 100 iterations. Learning rate is reset after every cycle. In every iteration, we gradually decrease the learning rate, this allows us to settle into a local minimum. Then, by resetting the learning rate at the end of a cycle, we check if the local minimum is good or bad. If it is good, then at the end of next cycle, the model will settle into the same local minima. But if it is bad, then the model will converge into a different local minimum. We can even change the length of the cycle. This allows the model to dive deep into the local minimum reducing the loss.

Snapshot Ensembling
It is a technique used along with SGDR. The basic idea of ensembling is to train more than one model for a specific task and average out their predictions. Most of the models give different predictions for the same input. So if one model gives the wrong prediction, another model gives the correct prediction.

In SGDR, we do ensembling with the help of cycles. Basically, every local minimum has a different loss value and give different predictions for data. When doing SGDR, we jump from one local minimum to another to find the optimal minima in the end. But, predictions from other local minima can be useful too. So, we checkpoint the model parameters at the end of every cycle. And at the time of doing prediction, we give the input data to every model and average their predictions.

Setup tweak to reduce training time
Training is being done on Google Colab. It provides a Tesla K80 GPU which is quite good for this task. One iteration of gradient descent takes around 1.5â??2 seconds on this GPU. But when training is done, it takes around 80 minutes to train for a single epoch! This is because, by default, you cannot use more than 1 workers in PyTorch data loaders. If you try, PyTorch throws an error interrupting the training abruptly.

But why does it take 80 minutes? It's because the task of getting the next batch ready is done on CPU whereas only the gradient descent and weight updates are done on GPU. When the weight updates are done, the GPU is idle, waiting for the next batch. So in this case, CPU is busy most of the time and GPU is idle.

When we specify the num_workers parameter in the data loader, PyTorch uses multiprocessing to generate the batches in parallel. This removes the bottleneck and ensures that GPU is utilized properly.

How do we do this on Google Colab? Google Colab is based on a Linux system. And most of the Linux systems have a temporary partition named /dev/shm. This partition is used by processes as shared memory. It is a virtual memory, which means that it does not reside on HDD, it resides on RAM. PyTorch uses this partition to place the batches for GPU.

Google Colab, by default, assigns a size of 64 MB to this partition. This size is very less for using enough number of workers. Which means that if we try to use num_workers, at some point during training, this partition will overflow and PyTorch throws an error. The solution is to increase the size of this partition. After increasing the size, we can use many workers for data loading. But how many num_workers should we use?

It seems like using as many num_workers as possible is good. I did quite a few experiments with different sizes of /dev/shm and different num_workers. Here are the results.

It looks like using 64 workers is not the best option. Why do we get these results? When we specify a value for num_workers in our data loader, before starting the training, PyTorch tries to fill those number of workers with batches. So, when we specify num_workers=64, PyTorch fills 64 workers with batches. This process alone takes 2.5â??3 minutes. These are now requested by our model. The model then updates weights based on these batches and waits for next set of batches. This process only takes around 3â??5 seconds. Parallely, the CPU is making the next set of batches. In Google Colab, there is only one CPU. So after updating the weights, the GPU is again idle waiting for CPU. Again, there is a wait of around 2 minutes. This process continues. That's why it took around 10 minutes for training when using a very large number of workers.

So, in choosing the number of workers, there is a tradeoff between the time required by the model to update weights and the time required by the CPU to generate the next set of batches. We have to choose num_workers by taking into consideration these times. By choosing 8 workers, we are able to reduce training time by 96%. You can check this tweak in this Jupyter Notebook.

After all this hassle, I was finally able to train my model. The model achieved an accuracy of 90.4%. This result can be improved by different techniques. Some of them are:

  • Data Augmentation - I have not used any data augmentation in my data. There are many data augmentations for audio data like time shift, speed tune etc. You can find more about data augmentation here.

  • Combining Mel Spectrograms + MFCC - Current model gives predictions based only on spectrograms. The CNN does feature extraction and the classifier (fully connected layer) does the job of finding optimal hyperplane from output features of CNN. Along with those features, we can also give the classifier the MFCC coefficients. These will increase the number of features a bit, but MFCC will give extra information about the audio file to the classifier. This will help in improving the accuracy. Appropriate regularization will be needed to avoid overfitting.

  • Use a different type of network - As we have seen with audio data, it has a time dimension. For such cases, we can use RNNs. In fact, for audio recognition tasks, there are approaches which combine CNN and RNN which yield better results than using only CNN.

Source: HOB