 And he's now embarked on an IoT startup called Sound Sensing. Today, he'll talk to us about a topic related to his thesis, Audio Classification with Machine Learning. Hi, thank you. So audio classification is not such a popular topic as, for instance, image classification or natural language processing. So I'm happy to see that there are still people in the room interested in this topic. So first, about me, I'm an internet things specialist. I have a background in electronics. From nine years ago, I worked a lot as a software engineer because electronics is mostly software these days, or a lot of software. And then I went to do a master's in data science. Because IoT, to me, is the combination of electronics, sensors especially. Software, you need to process the data and the data itself. You need to somehow convert sensor data into information that is useful. And now I'm consulting on IoT and machine learning. And I'm also the CTO of Sound Sensing. We deliver sensor units for noise monitoring. In this talk, my goal, we'll see if we get there, is that you, as a machine learning practitioner, without necessarily prior experience in sound processing, can solve a basic audio classification problem. We'll have an introduction about a little bit of a sound very briefly. And then we'll go through a basic audio classification pipeline, and then some tips and tricks for how to go a little bit beyond that basic. And then I'll give some pointers to more information. The slides and a lot of my notes and machine hearing in general, a little bit broader than audio classification, is on this GitHub. So applications. There are some very well-recognized sub-fields of audio. Speech recognition is one of them. And for instance, there you have, as a classification task, you have keyword spotting. So hey Siri, or hey Google. As a task in music analysis, you also have many tasks of genre classification. For instance, can be seen as a simple audio classification task. But we're going to keep it mostly on the general levels. We're not going to use a lot of speech or music specific domain knowledge. And we still have examples in across a wide range of things. I mean, anything you can do with hearing as a human, we can get close to, or many, at least in classification tasks, with machines today. So in NICU acoustics, you might want to analyze bird migrations using sensor data to see their patterns. You might want to detect poachers in protected areas to make sure that no one actually is going around shooting where there should be no guns fired and so on. So using quality control and manufacturing, especially because you don't have to go into the equipment or the product and your tests, you can listen to it from outside. For instance, used for testing your electrical car seats, the auto motors run. In security, it's used to help monitor large amounts of CCTVs by also analyzing audio. And in medical, for instance, you could detect heart murmurs, which could be indicative of a heart condition. So these are some motivating examples. And so in digital sound, I'll just go very briefly through this. First thing that is important is that sound is almost always or basically always a mixture. Because sound, I mean, it will move around the corner and like an image, for instance. And you will always have sound coming for us. It will also transport in the ground and be reflected by a wall. And all these things make it so you always have multiple sound sources or the source of interest and then always other sound sources. In audio acquisition, OK, we have, of course, sound is air pressure. We need to go have a microphone, convert it to electrical voltage, ADC, and then we have a digital waveform, which is what we will deal with. Then it's quantized in time, for instance, with the sampling rate and the amplitude. And we usually deal with mono, primarily with mono, when we do auto-classification still. There are some methods around stereo but not widely adopted. And also make the channel you could also have. We typically use uncompressed formats. It's just the safest. Although in real life situation, you might also have compressed data, which can have artifacts and so on that might influence your model. So after we have a waveform, we can convert it into a spectrogram. And this, in practice, is a very useful representation, both as a human for understanding what is the sound and for the machines in order to do detection of this. This one is a frog croaking, very periodically. You can see a little gap. And then it's hard to see, but in top, in the higher level, there is some cicadas that are going as well. And this allows us to both see the frequency representation and the patterns across time. And together, you can often, this allows you to separate different sound sources from your mixture. So we'll go through a practical example just to keep it kind of hands-on. Environmental sound classification, so given auto signal of environmental sounds. So these are everyday sounds that are around in the environment. For instance, it can be outdoors, cars, children playing and so on. It's very widely researched. We have several open data sets that are quite good. Audio set is several, I think, tens of thousands or even hundreds of thousands of samples. And in 2017, we reached roughly human-level performance. Only one of these data sets has an estimate for what is human-level performance, but we seem to have surpassed that now. And one nice data set is Urban Sound 8K, which has 10 classes of 8K samples, roughly four seconds long, and nine hours total. And state of the art here is around 80%, so 79 to 82 accuracy. So now we have the spectrograms. And these are the easy samples. This data set has many challenging samples where the sound of interest is very far away and hard to detect. And these ones are easy. So you see the siren goes very up and down. And jack-hammers and drilling have very periodic patterns. So how can we detect this using a machine learning algorithm in order to output these classes? So I will go through a basic audio pipeline, skipping around 30 to 100 years of history of audio processing, kind of going straight to what is now the typical way of doing things. And it looks something like this. So first, in the input, we have our audio stream. It's important to realize that, of course, audio has time related to it. So it's more like a video than to an image. And in a practical scenario, might you do real-time classification? So this might be an infinite stream that just goes on and on. So it's important for that reason and for also the machine learning algorithm to divide the stream into relatively small analysis windows that you will actually process. And however, you often have the mismatch between how often you have labels for your data versus how often you actually want to prediction. It's known as weak labeling. I won't go much into it. So in the urban sound, there's four seconds sound snippets. So that's what kind of what we're given in this curated data set. However, it's usually beneficial to use smaller analysis windows to reduce the dimensionality of the machine learning algorithm. So the process goes that we'll divide the audio into these segments. We'll often use overlap. And then we'll convert it into a spectrogram. We'll use a particular type of spectrogram called a male spectrogram, which has been shown to work well. And then we'll pass that frame of features from the spectrogram into our classifier. And it will output the classification for that small time window. And then because we have labels per four seconds, we'll need to do an aggregation in order to come up with the final prediction for these four seconds, not just for this one, a little window. We'll go through these steps now. So first analysis windows. As I mentioned, we often use overlap. So this is somewhat specified in two different ways. One is an overlap percentage. Here we have a 50% overlap. So that means that we're essentially classifying parts of the, or we're classifying every piece of the auto stream twice. So if we have even more overlap, we will maybe a 90% overlap. Then we're classifying it 10 times. And that gives the algorithm a kind of multiple viewpoints on this audio stream. And it makes it easier to catch the sounds of interest. Because the model might, in training, might have been kind of prefer a certain sound to appear in a certain position inside this analysis window. So overlap is a good way of working with that. So I mentioned we use a specific type of spectrogram. So the spectrogram is usually processed with us called male scale filters. These are inspired by the human hearing. So our ability to differentiate sounds of different frequencies is reduced as frequencies get higher. So low sounds, we're able to detect small frequency variations. However, for high-pitched sounds, we need large frequency variations to work. And by using this kind of filters on the spectrogram, we obtain a representation more similar to our ears. But more importantly, it's a smaller representation for a machine learning model. And it captures, and also you'll be able to merge kind of related data in two consecutive bins, for instance. So when you've done that, it looks something like this. So the top is a normal spectrogram. You see kind of a lot of small details. The bottom one, we've started the male filters at 1,000 hertz. This is bird audio, which is quite high-pitched. A lot of chirps up and down. And in the third one, we've normalized the data. And we usually use log scale compression in order to, because sound has a very large dynamic range. So sounds that are faint versus sounds that are very loud for the human ear is a factor of 1,000 or a factor of 10,000 in energy difference. So when you've normalized log-scaled, applied male spectrogram, and normalized, you look at something like the image below there. So in Python, this feature process is something that is I'm not going to go through all the code in detail. We have an excellent library called Librosa, which is great for just loading the data and doing basic feature pre-processing. Also, some of the deep learning frameworks have their own male spectrogram implementations that you may also use. But this is a general thing. In streaming, so when people analyze audio, they often apply normalization learned from the mean, for instance, across their whole samples, for our seconds in this case, or from their whole data set. That can be hard to apply when you have a continuous audio stream, which has, for instance, changing volume and so on. So what we usually do is we normalize per frame. So the hope is that you have enough information in our roughly one second of data in order to do a decent normalization. And doing normalization like this has some interesting consequences when there is no input. Because what happens is if you have no input to your feature purposes, you're going to blow up all the noise. So you'll sometimes need to exclude very low energy signals from being classified. It's just a little practical tip. So convolutional neural networks, they're hot. Who here has a basic familiarity, at least gone through a tutorial or read a blog post about image classification and CNNs? Yeah, that's quite a few. So CNNs are the best in class for image classifications. And spectrograms are image-like. And they are to their representation. They have some differences. So the question is, or maybe it was, is will CNNs work well on spectrograms? Because that would be interesting. And the answer is yes. This has been researched quite a lot. And this is great because there is a lot of tools, knowledge, experience, and pre-trained models for image classification. So being able to reuse those in the auto domain, which is not such a big field, is a major win. So you'll see a lot of the research lately. It can be a little bit boring in audio classification research, because a lot of it is like taking one year ago image classifying tools and applying them and seeing whether it works. It is, however, a little bit surprising that this actually works, because the spectrogram has frequency on the y-axis, typically, shown that way, and time on the other axis. So movement or scaling in this space doesn't mean the same as in an image. If I have my face inside an image, it doesn't matter where my face appears. If you have a spectrogram and you have a certain sound, maybe it's like a chirp up and down, if you move that up in frequency or down, at least if you move it a lot, it's probably not the same sound anymore. It might go from a human talking to a bird. The shape might be similar, but the position matters. So it's a little bit surprising that this works, but it does seem to do really well in practice. So this is one model that does well on urban sound. And one thing you'll note, compared to a lot of image models, is that it's quite simple. I mean, relatively few layers. This is smaller than or the same size as a Lee net. And there are three convolutional blocks followed by Mac, with Mac's pooling between the two first blocks. And that's the standard kind of architecture. This one, using 5x5 instead of 3x3 kernels, doesn't make much of a difference. You could stack another layer and do the same thing. And we flatten, and we use a fully convolutional end. So this is from 2016. And still is close to the art on this data set urban sound. OK. So if you are training CNN from scratch on audio data, do start with a simple model. I mean, there is usually no reason to start with, say, VGG16, with 16 layers and millions of parameters, or even mobile net or something like that. You can usually go quite far with this kind of simple architecture, a couple of convolutional layers. So in Keras, for example, this could look something like this, where we have our individual kind of blocks, convolution, Mac's pooling, Rayleigh non-larity, same for the second one, and classification at the end of the fully connected layers. Yes. So this is our classifier. We'll pass the classification through this. And it will give you also a prediction for which class it was. So 10 classes in the urban sound. And then for each window. And then we need to aggregate these individual windows. And there are multiple ways of doing this. You could do the simplest kind of thing to think about is to do majority voting. So if we have 10 windows of our four second spectrum, we could do the predictions on each and then just say, OK, the majority class wins. That works rather well. It's not differentiable. And so you kind of need to do this post-processing. And you're making very rough predictions on each step. So mean pooling or global average pooling across those analysis windows usually does a little bit better. And it's nice with deep learning frameworks is that you can also have this as a layer. So for instance, in Keras, you have the time distributed layer, which is there are sadly extremely few examples of online. So it took me like it's not that hard to use, but it took me a little bit to figure out how to do it. And so we apply a base model, which is in this case an input to this function. We pass it to the time distributed layer. And which essentially, it will use a single instance of your model, so we'll share the weights for all these steps in the, or all the analysis windows. And then, so we'll just run in multiple times when you do the prediction step. And then we'll global average pooling over these predictions. So here we're averaging the predictions. You can also do more advanced things where you would, for instance, average your feature representation and then do a more advanced classifier on top of this. But this is called probabilistic voting quite often in other literature when you do this mean pooling. Yes, so that allows us, so this will give us a new model, which is what will then take not a single analysis windows, but it will take a set of analysis windows, typically corresponding to our four seconds with for example, 10 windows. So if you do this and a couple more tricks from my thesis, you can have a system working on like this. So this has, in addition to building model and so on, which I've gone through, we're also deploying to a small microcontroller using the vendor provided tools that converts the Keras model and so on. So that's kind of roughly standard things and I didn't want to go into it here. So a little demo video. See if we have a sound. So here are the 10 classes. And we have various sound samples. This is a children playing. I think in Spain, since they said hello. And what we do also here is we threshold the prediction. So if no prediction is good, we'll consider it unknown. And this is also important in practice, because sometimes you have out of class data is drilling. Or this is actually the sample I found said jackhammer. And jackhammer is also a class in drilling. They are, to my ear, hard to distinguish sometimes. And the model can also struggle with that. There's a dog barking. And so in this case, all the classification happens on this small sensor unit, which is what I focus on in my thesis. There's a siren a little bit louder. And actually it didn't get the first part of the siren. There's do, do, do, do. Only this undulating sound later. But actually these samples are not from the urban sound data set, which I've trained on. So they're out of the main samples, which is generally a much more challenging task. Yes, so that's it for the demo. If you want to know more about doing sound classifications on this sensor unit, it's very small. You can get my full thesis. Both the report and the code is on GitHub. It's also linked from the machine hearing repository. Yes. So I won't go into too much details there. And some tips and tricks. So we've covered the basic audio processing pipeline, the modern one. And that will give you results and generally quite good results with the modern CNN architecture. And there are some tricks, especially in practice, where when you have a new problem, you're not researching an existing data set. Your data sets are usually much smaller. And it's quite costly and tedious to annotate all the data and so on. So there are some tricks for that. The first one is data augmentation. This is well known from other deep learning applications, especially image processing. And data augmentation can be done on audio, can be done either in the time domain or in the spectrogram domain. And in practice, both seem to work fine. So here are some examples of common augmentations. The most common and possibly most important is to do time shifting. So remember that I said that when you classify an analysis window, we made one second, the sounds of interest there and what the individual convolution kernel sees might be very short. If you have bird chirps, they're like whoop, whoop. And those are maybe 100 milliseconds max or maybe even 10 milliseconds. So they occupy a very little space in that image that the classifier sees. And but it's important that it's able to classify it no matter or it's desirable that it's able to classify it no matter where inside this analysis window it appears. So time shifting simply means that you shift your samples in time, forward and backward, left and right. And that gives them the algorithm has seen that, oh, OK. Bird chirps can appear many places in time and any place in time. And it doesn't make a difference to the classification. So this is by far the most important one. And you can usually go quite far with just time shifting. If you do want precise location of your event, so you want to have a classifier that can tell when did chirps appear, not just in the 100 millisecond range instead of just that there was birds in this 4 or 10 second audio, then you might not want to do time shifting because you might want to have that the sound always occurs in the middle of the window. But then your labeling needs to respect that. Time stretching, so many sounds, if I speak slowly or I speak very fast, it's the same meaning. It's certainly the same class, it's both speech. So time stretching is also very efficient to capture such variations and also pitch shifting. So if I'm speaking with a low voice or a high pitched voice, it's still the same kind of information and the same carries in for general sounds at least a little bit. So a little bit of time shift, a pitch shift you can expect. But a lot of pitch shift might bring you into new class. For instance, the difference between human speech and bird chirps. That might be a big pitch shift, so you might want to limit how much you're pitch shifting. So typical data orientation settings here is maybe 10% to 20% on time shift and pitch shift. You can also add noise. This is also quite efficient, especially if you do know that you have variable amount of noise. Random noise works OK. You can also sample. There's a lot of repositories of basically noise where you'll mix in noise with your signal and classify that. Mixup is an interesting data augmentation strategy that mixes two samples by a linear combination of the two and actually adjusts the labels accordingly. And that has been shown to work really well, also in combination with other augmentation techniques on audio. So yes, we can basically apply CNNs with the standard kind of image type architecture. This means that we can do transfer learning from image data. So of course, image data is significantly different from spectrograms. I mentioned the frequency axis and so on. However, some of the base operations that are needed, you need to detect edges. You need to detect diagonals. You need to detect patterns of edges and diagonals. You need to detect kind of a blob of area. Those are common functionality needed by both. So if you do want to use a bigger model, definitely try to use a pre-trained model and fine tune it. For instance, most deep learning frameworks, including Keras, have pre-trained models pre-trained on the ImageNet. The thing is that most of these models, they take RGB color images as data. And it can work to just use one of those channels and zero-fill the other ones. But you can also use just copy the data across the three. There's also some paper showing that you can do multi-scaled. So for instance, one has a spectrogram with very fine time resolution. And one has one with a very coarse time resolution. And then you put them in different channels. And this can be beneficial. But because image data and sound data are quite different, you usually do need to fine tune. So it's usually not enough to just apply a pre-trained model and then just tune the classifier at the end. You do need to tune a couple of layers at the end, and typically also the first layer at least. Sometimes you fine tune the whole thing. But it is generally very beneficial. So definitely, if you have a smaller dataset and you need that high performance and you can't get it with a small model, go with a pre-trained model, for instance, in the mobile net or something like that, and fine tune it. Audio embeddings is another strategy inspired by text embeddings where you create a, for instance, 128-dimensional vector from your text data. You can do the same with sound. So with the look, listen, learn, L3, you can convert a one-second audio spectrogram into a 512-dimensional vector, which has been trained on millions of YouTube videos. So it's seen a very large amount of different sounds. And that uses a CNN under the hood and basically gives you that very compressed vector classification. And I didn't finish any code sample here, but there is a very nice, the latest work is OpenL3. Look, listen, more is the paper. And they have a Python package, which makes it super simple, just import it. It is one function to pre-process. And then you can classify audio basically just with the linear classifier from scikit-learn or so on. So that if you don't have any deep learning experience and you want to try an auto classification problem, definitely go this route first. Because this will basically handle the audio part for you. And you can apply a simple classifier after that. One little tip, I mean, you might want to do your own data set, right? Audacity is a nice editor for audio. And it has a nice support for adding a label track in annotating. There's keyboard shortcuts for all the functions that you need, so it's quite quick to use. So here I'm annotating some custom data where we did event recognition. And the nice thing is that the format that they have is basically a CSV file. It has no header and so on. But this pandas line will basically give you a nice data frame with all your annotations from the sound. Yes, so it's time to summarize. Oh, it fixed me. So we went through the basic audio pipeline. We split the audio into fixed length analysis windows. We used log-mail spectrogram as a feature representation because it's shown to work very well. We then applied a machine learning model, typically a convolutional neural network. And then we aggregated the predictions from each individual window. And we merged them together using global mean pooling. And models that I would recommend trying first if you're trying some new data. Try audio embeddings with OpenL3 and a simple model like LinoClassifier or random forest, for instance. Try a convolutional neural network using transfer learning. It's quite powerful and there are usually examples that will give you pretty far. If you do, for instance, preprocess your spectrograms and save them as PNG files, basically you can kind of take any image classification pipeline that you have already or if you're willing to kind of ignore this merging of different analysis windows and use that. Data imitation is very effective. Time shift, time stretch, pitch shift, noise add are basically recommended to use. Sadly, there is not such nice or go-to implementations of these in, for instance, Keras generators. But it's not that hard to do. Yes, some more learning resources for you, the slides. And also a lot of my notes, in general, are on this GitHub. If you do want to get hands-on experience, TensorFlow has a pretty nice tutorial called Simple Audio Recognition. And it's about recognizing speech commands, which could be interesting in general. But it's taking a general approach. It's not speech-specific. So you can use it for other things also. There's one book recommendation, Computational Analysis of Sound Scene Events. It's quite thorough when it comes to general audio classification, a very modern book from 2018. So that's a nice one, also. So I think we have questions, maybe? We have 10 minutes for questions. So please go to the microphones in the aisles to ask them. I think our first is there. Yeah, thanks, John. Very interesting application of machine learning. I have two questions, small questions. So there's obviously a time series component to your data. I'm not so familiar with this audio classification problem, but can you tell us a bit about time series methods, maybe LSTM and so on, how successful they are? Yes. Yeah, time series is intuitively one would really want to apply that, because there is definitely a time component. So convolutional recurrent neural networks do quite well when you're looking at longer timescales. For instance, there's a classification task called audio scene recognition. For instance, I have a 10 or maybe 30-second clip. Is this from a restaurant or from a city or so on? And there you see that the recurrent networks that do have a much stronger time modeling, they do better. But for small short tasks, CNNs do just fine, surprisingly. OK, and the other small question I had was just to understand your label, the target that it's learning. You said that this is very mixed, the sound is a very mixed data set. So are the labels just like one category of sound when you're learning or would it be beneficial to have maybe a weighted set of two categories when doing learning? Yeah. So in the audio classification task, the typical kind of definition is to have a single label on some sort of window of time. You can have multi-label data sets, of course, and that's a more realistic modeling of the world because you basically always have multiple sounds. So I think audio set has multi-labeling and there's a new urban sound data set now that also has multiple labels and then you apply kind of more tagging approaches. But you're using classification as a base. So I had, with tagging, you can either use separate classifier per track or sound of interest, or you can have a joint model which has multi-label classification. So definitely this is something that you would want to do, but it does complicate the problem. We have over there one person in the mic. And you mentioned about data augmentation that we can also mix up to separate our cases and mix them and then the label of that mix up should be weighted also because it kind of concludes with previous question. Should we like 0.5 and 0.5 for the other and how will that work? Yes, so a mix up was proposed, things like two, three years ago, there's a general method. So you basically take your sound with your target class and you say, okay, let's take 80% of that, not 100, and then take 20% of some other sound which is a non-target class, makes it together and then update the labels accordingly. So it's kind of just telling you, hey, we're basically creating a lot of, there is this predominant sound but there's also this sound in the background. Okay, thank you. Yes, we have a question. You mentioned about the male frequency ranges but usually when you record audio microphones you get up to 20,000 hertz. Yes. So have you any experience or could you comment on when you have added information of the higher frequency ranges does that affect the machine learning algorithm or other features that one could use? Yeah, that's a good question. So typically recordings are done at 44 kilohertz or 48 kilohertz for general audio. Often machine learning is applied at lower frequency so with the 22 kilohertz or something that's in just 16 or in a rare case it's also eight. So it depends on the sounds of interest. If you're doing birds, definitely you want to keep those high frequency things. If you're doing speech, you can do just fine on the eight kilohertz usually. Another thing is that noise tends to be in the lower areas of the spectrum. There's more energy in the lower end of the spectrum. So if you are doing birds, you might want to just ignore everything below one kilohertz for instance and that definitely simplifies your model especially if you have a small data set. We have more questions. You need to go to the mic either here or there. Quick question. You mentioned the editor that has support for annotating audio, could you please repeat the name? Yes, audacity. Audacity, okay. And more general question. Do you have any tips if for example you don't have an existing data set and you're just starting with a bunch of audio that you want to annotate first? Do you have any advice for strategies like maybe semi-supervised learning or something like this? Yeah, semi-supervised is very interesting. There's a lot of papers but I haven't seen like very good like practical methodology for it. And I think in general annotating a data set is like a whole other talk. But I'm very interested to come into a chat about this later. Thanks. So yeah, we have two more and I think we're done. Very nice talk. My question would be, do you have to deal with any pre-processing or like white noise filtering? You mean to remove white noise? It's exactly because you just said like removing or ignoring certain amount of frequencies? Yes, you can. I mean, scoping your frequency range definitely is very easy. So just like, just do it if you know where your things of interest are. Denoising, you can apply a separate denoising step beforehand and then do machine learning. If you don't have a lot of data that can be very beneficial. For instance, maybe you can use a standard denoising algorithm trained on like thousands of hours of stuff or just a general DSP method. If you have a lot of data, then in practice, the machine learning algorithm itself learns to suppress the noise. But it only works if you have a lot of data. Thanks. So thank you for our talk. Is it possible to train a deep convolutional neural net directly on the time domain data using 1D convolutions and dilated convolutions and stuff like this? Is this possible? And it is very actively researched. So it's only within the last year or two that they're getting to the same level of performance as spectrogram-based models. But some models now are showing actually better performance with the end-to-end trained model. So this, I expect that in a couple of years, maybe that will be the kind of go-to for a practical application. Can I do a speech recognition with this? This is only like six classes like, and I think you have much more classes. If you want to clarify words. If you want to do automatic speech recognition, so the complete vocabulary of English, for instance, then you can theoretically, but there are specific models for automatic speech recognition that will in general do better. So if you want full speech recognition, you should look at speech-specific methods. And there are many available. So but if you're doing a simple task like commands, like yes, no, up, down, one, two, three, four, five, where you can limit your vocabulary to say maybe under the 100 classes or something, then it gets much more realistic to apply a kind of speech unaware model like this. Okay, thanks for an interesting presentation. I was just wondering from the thesis, it looks like you applied this model to a microprocessor. Can you tell a little bit about the framework you used where you transferred it from a Python? Yeah, yes. So we used the vendor-providered library from SDM Microelectronics for the SDM32. And it's called XCube AI. You'll find links in the GitHub. It's a proprietary solution, only works in the microcontroller, but it's very simple. You plug in, you throw in your Keras model, it will give you a C model out, and they have examples about the pre-processing with some bugs, but it does work. And the firmware code is also in the GitHub repository, not just the model. So you can basically download that and start going. Do join me here if you wanna talk more about some specifics about auto-classification. I'll also be around this day. Thank you. Thank you, John. Nordby.