 In this video, we're going to create a convolutional neural network to distinguish between different sounds, sounds like air conditioners and automobiles, etc. We can do this in about 50 lines of Keras code, it isn't too difficult. But before thinking about modeling, we need to understand our data. Sounds that we are going to distinguish are a part of the Urban Sound 8K dataset. This consists of over 8,000 sounds of 10 categories. Children playing, dogs barking, jackhammer, running engine, air conditioner, street music, gunshots, siren, drilling, and a car horn. In order to visualize an audio signal, we're going to convert them into spectrograms. These are 2D visual representations of frequency varying over time. Once we have a grid-like topology for our data, it can be processed by a convolutional neural network. So here's a question. Why CNNs for this? Well, convolutional neural networks can discriminate spectrotemporal patterns. In other words, they are able to capture patterns across time and frequency for given input spectrograms. This is important for distinguishing between noise-like sounds, like the sounds in our dataset. Furthermore, they have an edge over traditional male-frequency spectral coefficients, MFCCs, as these convolutional neural networks are still able to make distinctions even when the sound is masked in time and frequency by other noise. However, there hasn't been much advancement in the field of audio distinction using CNNs because of its one main drawback, the sheer amount of data required to build the model. We don't have enough labeled sound samples, but now we have a solution to this problem. Data Augmentation. Data Augmentation involves generating new sounds by introducing slight distortions in the original data. The amount of distortion should still ensure the label is valid. This allows the model to become invariant to small changes in input and hence allow for better generalization, which is exactly what deep neural networks try to achieve. Let's take a look at the code and I'll explain the details along the way. We start with the following library imports. We have Keras, which is our deep learning framework built on TensorFlow back end to create and train our convolutional neural network. We have Librosa, which allows us to read, write, and play around with audio files. NumPy is a math library to express inputs, feature maps, and outputs in the form of matrices. Pandas reads and manipulates data stored in tables using data frames. We use this for reading metadata. And we have Random, which is used to shuffle our data set. We begin by reading the metadata of the Urban Sound 8K data set. Each tuple has information about the audio files like the name, the label, and the start and end times. Most of these clips are about 3-4 seconds long. For uniform input to the model, I don't want stray sound clips, which are too short to derive any useful information. So I get rid of those by computing the duration. I want to convert files to a log scale mel spectrogram using Librosa. Consider the first 3 seconds of the clip to keep it uniform sized. Here's an example of a mel spectrogram of a siren clip. Note the color indicates the loudness. Brighter the color, louder is the sound. And here's a spectrogram for an air conditioner. Here's one of children playing outside. And here's a spectrogram for a drilling machine. Now we iterate over all valid data samples, which we conveniently have in a list in the metadata and create mel spectrograms for each. This is done by dividing the frequency range into 128 components, ranging from 0 to 22.05 kilohertz. This frequency is a range of audible sound. Well technically it's 20 hertz onwards, but you get the idea. And for time, consider the 3 seconds of input divided into frames of 23 milliseconds. This leads to 128 components across time as well. So the overall input is a 128 x 128 matrix of real numbers. The input sounds maybe longer than 3 seconds, but we consider 3 continuous random seconds in the sound wave to constitute the mel spectrogram. We end up with 7,467 valid samples, which is fed into our convolutional neural network. Next, shuffle the dataset, split up test and train sets, reshape the inputs to 128 x 128 x 1 for CNN input, and encode the class labels using a one-hot encoding for 10 classes. I'll describe the model that I reconstructed from a recent research paper, and I'll link the paper down in the description below this video. Okay, so now let's start the model. Like I said before, we have 128 x 128 log-scaled mel spectrograms as input. We first introduce a convolution layer where we perform convolution with 24 5 x 5 kernels with a stride of 1. Using these small 5 x 5 kernels during convolution, we can discover time-frequency signatures in the input. These are distinct for different sounds, and hence help us distinguish these sounds despite the noise interference. The output of these convolution is a 124 x 124 feature map. But since we use 24 such filters, the output is 124 x 124 x 24. To this output, we apply a max pooling where each filter is 4 x 2. The output on one layer of the input volume is 31 x 62, but we apply it to 24 feature maps of the convolved input, leading to an output volume of 31 x 62 x 24. Now apply the ReLU activation, which doesn't change the dimensionality. Once this is done, apply another sequence of convolution activation and pooling. For the next convolution layer, the input is convolved with 48 5 x 5 filters. The convolution with each filter, without padding, leads to a 27 x 58 feature map. Since we have 48 such filters, the output is a 3D volume of shape 27 x 58 x 48. Once again, we downsample the features by applying max pooling with 4 x 2 filters with strides of 4 along the height and 2 along the width, like in the last pooling layer. This leads to a 6 x 29 output for a single layer. Since the pooling layer is applied to every one of the 48 layers, the output is a 3D volume of the same depth, 48. To this 3D volume, we apply another ReLU activation. Next just apply another round of convolution activation with no pooling. The convolution is similar to before where we convolved the input volume with 48 5 x 5 filters without padding. This leads to a 3D volume of shape 2 x 25 x 48. The following ReLU activation doesn't change its shape. We are now at the final part of our model, the fully connected layers. The 2 x 25 x 48 volume is flattened to form a 2400 dimensional vector. This is in an admissible form for the fully connected layers. We then apply a hidden layer of 64 neurons with activation, then apply dropout. The final layer is a softmax layer with 10 neurons as we have 10 classes of output sounds. And so we have defined our convolutional neural network. If you want to know exactly what each layer does and how these dimensions are computed, then check out my video on convolutional neural networks. I explain everything there, and I'll link it down in the description below. The code you see here is whatever I just explained, just plain and simple in Keras. Now we compile the model using the atom optimizer and measuring loss using cross entropy loss, because it's a classification problem. I display the performance of the system using simple accuracy. The model is then trained with 7000 training samples over 12 epics and a batch size of 128. This just means that the model learns every 128 samples. We then evaluate the performance of our model with simple accuracy and print it on screen. The current convolutional neural network is able to get us a 70-75% accuracy. Not bad. But let's see if we can improve this value using data augmentation. Like I said before, data augmentation involves the distortion of training data to an extent that the original label still holds for the distorted data. To augment means to add, so we are basically adding this new data to our data set to create a larger data set. Let's take a look at two methods of distorting data, at least with respect to audio. The first is varying the speed of the audio signal. The clip can either be slowed down or sped up. And the second is varying the pitch of the input without changing the duration of the clip. For speed, we take 0.81 times the speed of the clip, which is slower, and another distortion that we take is 1.07 times to speed up the clip. For pitch, each sample is pitch shifted by 2 semitones and 2.5 semitones. We can try out other transformations, but it generates a lot of data. My computer just doesn't have enough space. The original data set alone is 6 gigabytes. For every type of distortion, I'm creating a new set of samples. I use 4 distortions only, but that still caps out to about 30 gigs of space, so just clear some space up on your computer before you do this. You can try out other distortions on audio data as well, as long as it doesn't change the label. Our augmented data set has 35,000 training samples as opposed to the previous 7,000. We pass this augmented data set through our model and see that we can distinguish these noises with an overall accuracy of around 82%. This is better than the 75% we obtain, but whether it's worth actually performing data augmentation is subjective. We have an increased accuracy, but the training time was also much longer. There was also the time taken to generate these distorted samples. In any case, data augmentation is still a decent method for getting your hands on more data. The link to this notebook is down in the description below. Perhaps an even more fun method, and probably a more valid method would be to use GANs, Generative Adversarial Networks, to generate completely new audio data. In this way, we get more variety than simple distortion provides. But that's a topic for another video. Thanks for stopping by today, and if you liked the video, click that like button and subscribe for more awesome content, and I will see you in the next one. Bye!