 Let's go through model creation using Keras part. So this part of the activity covers six steps. Please do not mix those six particular steps with the five-step, five-generic steps of mentioned in previous part. So those six steps are related only to the model creation in Keras. So what should we do? We should gather the data, collect the data. We're going to build the learning dataset. And today we'll build the model of audiosync classification neural network. So to do that we'll use public dataset. So having the raw audio learning dataset we need to prepare the data. We need to do some digital signal processing. I will explain later on the details. But in general we need to frame the audio signal. Then the second step of preprocessing it will be log mal spectrogram. I will also explain the details later on. Then having the ready dataset, learning dataset, we'll be able to build the model. So this is the fourth step to build the model in Keras library. The fifth step is to train the model. So the core activity in fact of this part. And then evaluate the model by evaluation. I mean the testing of the model accuracy against new unknown data so called test dataset. We are building audiosync classification application. So as an input we have time domain signal. We have just raw audio signal. What can we do? We can transform the time domain signal into frequency domain signal. It means the frequency domain means that we will build the spectrograms based on input time domain, raw audio signal. What is the spectrogram? Spectrogram is just a picture. So what is the conclusion? We can use the well-known structure of neural network for the picture recognition. It is convolutional neural network to recognize the audio scene. Why? Because we have transformed the audio, the time domain signal to the picture, to the frequency domain signal. So this is our signal transformation flow, time domain, frequency domain. Then picture as input data for the convolutional neural network. And then three classes to be classified. Indoor, outdoor and in-vehicle. What about the learning dataset? We will use as I said before, we will use the public dataset. It is dataset collected by Helsinki University. It is 20 gigabytes of raw data composed of 30 seconds long, let's say atomic recordings. And they have recorded bigger number of audio scenes, acoustic scenes, they recorded 15 scenes. We need to have in mind that our micro is limited in terms of computational power. So we decided to decrease the number of acoustic scenes down to three classes. Indoor, outdoor, in-vehicle. And we also decided to decrease the sampling rate from 44.1 kHz down to 16 kHz. Decrease the mode of recording from stereo to mono. And decrease the accuracy of the recordings from 30 to 24 bits integers to single precision floating point. Okay, our tool to build the model, it is the Python source code. The Python is a basic stuff in this case. And I have a general remark. Now please do not analyze line by line the Python code. It doesn't make sense. You can do it later on. My goal, my idea is to show you the flow of the development chain. So as a first step to build the learning data set, we are downloading the raw audio data set from the public location here. We are doing it using Python code. As the output, as the result of this download process, we have 1,170 30 seconds long segments as a development samples. And we have 390 30 seconds long segments of the evaluation samples. A lot of data. Okay, the data preparation. What should we do now? To perform the data preparation, at least the basic, what I would say the medium data signal processing knowledge is needed. This practice, the framing of the signal is well known to the DSP experts. So the first step of the data preparation is to frame the signal. So we need to slice the 30 seconds long atomic recording into overlapping 64 milliseconds long frames. This is the basic frame here, the yellow one, 30 is 64 milliseconds long. Why 64 milliseconds long? It was the arbitraried decision of the DSP expert based on his experience. And here is the second overlapping window. The overlapping ratio is 50%. So the half of the frame overlaps the previous frame. Maybe I will show you the numbers. As you can see here, we have 30 seconds long atomic recording. And we decide to slice this atomic recording into frames 64 milliseconds long. So 30 seconds divided by 0.064 seconds. It is 468. Let's take the integer part of the result only. So we have 468, 64 milliseconds long frames. But the frames overlaps with the ratio of 50%. So we need to multiply this integer part by 2, because of 50% of overlap ratio. So 468, 468 multiplied by 2. It is, as a final result, 936 frames. Each frame consists of 1,024 ADC samples. Just to remind you, our sampling frequency is 16 kHz. The goal of the overlapping is to avoid the boundary discontinuity during FFT transformation. So this is another more detailed explanation of the framing. Here we have 32, 64 milliseconds long samples. What does it mean? As you know from the previous slide, the frame length is 64 milliseconds. And the overlap ratio is 50%. So we have 64 milliseconds divided by 2. The stride is 32 milliseconds long. And again, because of arbitrary decision of the DSP expert, we decide to slice the 30 seconds long atomic recording into 1 second long parts. So 1 second is the equivalent of the 1 picture. Why 1 second? Because we have 32 frames. So the stride time is 0.032 multiplied by 32. It is 1 second 0.024. So here we have time and the frequency. So this is the spectrogram of 1 second long audio recording. In fact, to be more precise, 1 second 0.024. So the x-axis it is a time, y-axis it is the frequency. And the color of this FFT spectrogram corresponds to the magnitude of the audio signal. So this is the framing. And as a result, we can perform FFT. This picture has one disadvantage. The useful signal is only present on the small area of the picture. So let's consider, let's think how to magnify this area, which for sure increased the accuracy of the neural network in terms of the picture recognition. So just to remind you, we are building an audio-steam classification and we want to mimic the human perception of the audio-steam classification. This is the goal. So we need to implement all the human-audition system mechanism, biological mechanism. Our perception of the magnitude of the signal, of the audio signal is logarithmic. But also the reception, our human reception of the pitch of the signal is logarithmic. So we are much better in differentiation between the frequency difference of the lower frequency than between the difference of higher frequency. This chart shows this feature. The equation has been developed by scientists, the laboratories. So this equation is taken from the practice. This kind of mathematical approximation of the phenomenon. And here you can see. Let's take the frequency difference between 300 Hz and 500 Hz. So our perception of this difference is much better than the perception of the same, in fact, difference 200 Hz between 3.2 kHz and 3.4 kHz. And we can reuse this relation, this chart, as a kind of magnifier glass to magnify this area on the picture. We can approximate the curve, but by set of filters, triangles shape and we saw called the set of filters, the male filter bank. So one filter here, the second filter here, etc. Just to approximate this curve. So again, we have row signal. Then we are framing the signal to perform the good quality FFT to avoid frames boundary discontinuity problems. Then we have male filter bank taken to approximate the logarithmic characteristic of the pitch perception. Then we can present the result in logarithmic scale. And as a result, we have quite nice picture. So this is the, this picture corresponds to this area and consists of much more useful information than the row FFT. So as a result, we have set of 1.024 seconds long pictures and this is the Python code to prepare those pictures. So just to remind you, we have set of recordings, 30 seconds long atomic recordings and each of the 30 second long row audio recording is transformed to the frequency domain and as a result of this transformation we have 29 pictures. So the 30 seconds long atomic recording is represented as 29 spectrograms. While 29, each spectrogram covers 1.024 seconds long recording. So we need to divide the atomic recording, 30 seconds long, by 1.024 So as a result we have 29, of course we need to take only the integer part so we have 29 pictures. And as final result of this preprocessing code we have 33k, almost 34k pictures of the 32 pixels resolutions. Okay, the next step is after the preparation of the input of the learning dataset is quite technical because the picture, it is the input to our neural network but we know what this picture represents. For example, this picture, this particular picture represents the outdoor audio scene. So this is our ground true data. Let's say picture number n represents the class number m. We have three classes starting and we can assign to each class the number starting from 0. So let's say 0 means indoor, 1 means outdoor, 2 means in vehicle and because we have a lot of pictures, let's consider the development dataset 34k of pictures. This is our input data and we have output data, expected result of the neural network ground. So this is the vector. So we have 34k items vector long. And the technical because Keras expects us to provide the binary matrix instead of a vector we need to transform the vector to the matrix. So this is technical operation. The next technical step is to standardization. This is the mathematical operation. The goal is to remove the mean and scale the input data variance to the unit variance. Why? Because we want to avoid the saturation during MCU data processing. So the standardizing of the data. Okay, this slide is important to catch the generic idea of the neural network and the learning dataset. So to learn the neural network, we need a dataset. But as you have seen, for example, on this slide, we have two datasets. Development dataset and this dataset is bigger and the evaluation dataset or test dataset. And this dataset size is lower than the development dataset. And what is behind? So the general learning dataset, the development dataset as a subset one evaluation dataset as a subset number two. So the development dataset taking the school example is the material for the lecture. So we have the lecture and after the let's say the school period, we have the kind of validation test. But what is the main feature of this period test that the material is known? Because the teacher said the pupils that during the test will go through the already learned material. The evaluation data the subset number two is used for testing the neural network. And the main feature of the evaluation dataset is that this evaluation dataset has not been presented to the neural network. During the development, during the learning, so the evaluation dataset. It is the equivalent of the final study exam for final university exam. So the questions are or should be not known to the pupil during the final exam. The subset is split into two parts, the test data itself and the partial test. The size of partial test is much lower than the final test because of practical reasons. For example, let's consider the test of the microcontroller of the neural network running on the top of the microcontroller and virtual comport as a channel to provide input data to the neural network. The virtual comport over the of course over the USB channel, the band of the USB is limited. It would be not so much practical to send gigabytes of data using USB. That's why we can define the partial test say few megabytes, several megabytes and use USB to test the neural network. Okay, it was general explanation and this slide shows the Python code which splits the dataset into development and test and validation. And partial test datasets. So for the training for the development dataset, we have 25 k of pictures for the validation. So just to remind you the validation means the end of the school period exam. This about 80 k pictures. The final school exam, the evaluation of the neural network dataset consists of 11 k of pictures and the partial data test sample consists of 114 pictures because we will use USB as a communication channel and just because of practical reason we need to limit the amount of data. And what is very nice for Python and Keras, this quite complex operation is done using only one line of source code here. So now we have the preprocessing of data. We prepare the development and evaluation datasets and now it's time to the I would say core activity to build the neural network model. Because in fact we are recognizing pictures, we will use well-known structure of the neural network for the picture recognition. It is convolutional neural network. And again, you can see the big advantage of the Keras library because we need to use only one line to build or to define each layer of the neural network. As a first line, we are defining the type of the model. It will be sequential. It means that we are defining layer by layer of the neural network. So first layer is a convolutional 2D layer. The second layer is a max pooling. Again, convolutional 2D max pooling. Then we need to flatten the data to fit the dense layer. And next point is the definition of the dense layer of the brain of our neural network. And the output layer consists of three elements because we have three classes indoor, outdoor and in vehicle. This is different representation of the graphical representation of the neural network. This is another representation of the same neural network. This is the representation of the neural network generated by Python script. And here is a basic explanation of convolutional layer. Convolutional layer is just kind of digital filter. Very basic one, as you can see on this picture. So let's consider 2D input for the neural network, 2D matrix with the numbers. And for the convolution filter we need to define the window of kernel size. In our case kernel size 2 by 2. And stride value or step moving filter step value. So in our case stride value is one by one. So we are moving one field in the direction of the matrix. And what is the operation behind? Let's consider this green area. We need to take the input value multiplied by the weight of the corresponding neuron input. So 8 multiplied by 0.5, 10 multiplied by 0.1, 3 multiplied by 0.2, 15 multiplied by 0.8. Then some accumulate all the values and as a result we have 17.6. Here is another representation. This is I would say analog representation of the operation. Here is a more digital representation. Just to simplify the picture I assume that all the weights are the same and equal to 0.5. So as a result we decrease the amount of data. This is in fact the main goal, the main feature of the convolution layer filter to decrease the amount of data. The max pooling layer, the explanation is similar. The idea is even more simple. So again we have moving window of stride value. And we need to just find the maximum value within the moving window. So for example for this window we have 15, for the magenta one we have 9, for the green one we have 8. And again the amount of data is decreased. And for example this filter is very good to detect the edges of some shape. And the fully connected layer, the dense layer, here it is the brain of the neural network. Okay, the training process. So having the learning data set, having the neural network structure we can start the training process. And again this is only the one line using Keras and Python. Before we need to select the optimizer and compile the model. As a result we are getting the loss function. So I think I should explain the loss function which shows us the quality of the learning process. Of course the lower losses, the better result of the neural network learning process. We need to back to the school at least for a minute. And the basic idea behind the loss function is a gradient. What is a gradient? Gradient points to the direction of the fastest function increase. So using the gradient we can find the maximum of the function. But just to be more clear, gradient doesn't point to the maximum of the function. Gradient points the direction to the maximum of the function. So let's consider the very simple free variables function like here. And the gradient of this function, it is a set of the unit derivatives of the function. So df by dx, df by dy, df by dz. Let's evaluate the partial derivatives. So df by dx of 2 multiplied by x, it is 2. df by dy, it is 2, 2y. df by dz, it is 3 z power of 2. And let's consider the starting point coordinates to find the maximum of the function. Let's assume that we start from the coordinates x, y equal 1, y equal 2, z equal 3. So this is our starting point in the 3D space. So we can plug in current coordinates into the gradient. You can see here. And evaluate the point, another point in the space, 2, 4, 27. So the maximum of the function starting from this point in the space, in this space, is in this direction. So we need to draw the line between the starting point and the end point. And this is the direction. Again, just to highlight, this is not the point of the maximum of the function. This is the direction of the maximum of the function. And now the really basic question is, how long should we go in this direction to find the maximum of the function? We can go the small step and then add the maximum of the function. We can go the small step and then evaluate the next direction using the gradient again. We can go the medium step and evaluate the gradient, the direction again. We can go the big step and evaluate the direction. Why it is so important? Because we can just overlook the maximum of the function. So it was the gradient explanation. And now let's come back to the loss function. So what is loss function? Loss function maps the values of one or more variables. In our case, it will be neural network weights and biases. So really a lot of variables, sometimes billions of variables, onto a real number, one real number, which represents some cost associated with those values. And our goal is to minimize the loss function. So on the previous slide, we have discussed the very simple function of free variables. In our case, we have function, neural network weights and biases. It could be even billions of variables. So really demanding in terms of the computation power. Okay, so the gradient points the direction of the fastest function increase. So it can be used to find the maximum of the function. What about the minimum of the function? We need to use the negative of the gradient. The gradient descent approach to find the minimum of the function. So exactly the negative approach of this basic rule. So we should go not in this direction to find the maximum. We should go to the opposite direction to find the minimum of the function. And again, the absolutely the basic question is about the step when trying to find the minimum of the function. Because if the step is too big for the big step of the gradient, the state is possible that we will never converge for the small gradient descent step. For sure we will converge, we will find the minimum, but the time needed for the computation will be really huge like days or weeks. Okay, how to evaluate the model? How to build the metric of the model? So the I would say most intuitive one metric is the accuracy. And the accuracy represents the ratio between the evaluation of the model metrics. The most basic one metrics is just accuracy. You can see here the example 0.89. What does it mean? It means that for 89% of inferences the result was proper. So 11% of the queries to the neural network, for 11% of the queries to the neural network the result of the inference was bad. The test loss represents the error during the learning process. The quality, in fact, the quality of the learning process and the lower test loss is the better quality of the neural network learning and in fact the better accuracy. The test accuracy is quite a generic number. It is kind of overview of neural network behavior and we cannot see the class-wise errors. But there is another metrics, so-called confusion metrics and for this metrics we can see the quality of the accuracy of the neural network class-wise. So as you remember we have three classes to classify indoor, outdoor and in-vehicle. And the size of the metrics you can see here is 3x3. So because of the number of classes. So here you can see that for class 0 it is the indoor class. The neural network inference results is 0.99. It means that if the neural network was fitted, the picture represents the indoor spectrogram, for 99% of the cases the inference result was proper. For 1% of the cases neural network mixed the indoor class with in-vehicle class. The same for the outdoor class, so 98% proper inferences. The neural network mixed the outdoor with indoor class for 1% and mixed in-vehicle with outdoor class for 1%. And for in-vehicle for 100% the result was OK. So this accuracy here is a little bit different than the overall accuracy here because it is just a mean of the diagonal of the metrics here. So it is 0.99. How it is done with XQBI, the confusion matrix evaluation process. So we have custom data, for example in CSV file. Then we fit the generated C model with those custom data. Then we have expected output. Grants true, so we know what we should expect. And there is a dedicated Python script to evaluate process the confusion matrix. I mentioned here the CSV file, so it is more technical slide showing that using Python code we can generate all the data sets, both the input data set as output data set in CSV format using Python script.