 All right. Good morning again. That was a fantastic talk. Now, we're back again with John, and he's going to be talking about sound event detection with machine learning. Hey, John. Am I pronouncing your name right? Hey, yeah, that's great. All right. Perfect. So a little bit about John. He has been an engineer and has several years of experience with software development, specifically around embedded systems and digital signal processing. And these days, he's focusing more specifically on machine learning for audio and Internet of Things. So we're really psyched to see what you've got in store for us, man, and all the best. Thank you so much. So welcome, everyone. Today, I hope you enjoyed the session on AutoCarras. That's quite relevant here, as we will also be building on Keras. So my name is Jo-Nirvi. I'm the head of data science and machine learning at Sound Sensing. Sound Sensing is a company that focuses on audio and machine learning. And we provide easy to use IoT sensors that can continuously measure sound and use machine learning to extract interesting information. And the information is presented in an online dashboard and it's also available in an API for integrating with other systems. And this technology is used for products around noise monitoring and condition monitoring of equipment. And today we will be talking about sound event detection with machine learning in Python, of course. And these techniques can be used with our tools or of course with any other tools including completely open source solutions. So sound event detection is one of many common task formulations in machine learning. Some other examples are audio classifications and audio tagging and these are illustrated on the screen. So in classification, you have an input audio and then you give out your model or only produces one output, the class label. So for example, does it contain speech, music, and so on, but it's a single choice. And however in tagging, you allow the same clip to get multiple labels. So for example, a clip might contain both speech and mouse clips and also keyboard typing if it is in an office environment, for example. And in detection, which is what we will talk about today, you have precise time information. So that's what separates detection from these other tasks. So not just knowing that in this 10 seconds of audio there is these classes but knowing exactly what times do we have the different events represented. So the task definition is given an input audio, return the time steps, so the start and the end for each event class. And as a note, this task is also known as acoustic event detection or audio event detection and these are all synonyms. So sometimes you see the SED acronym and sometimes you see the AED acronym. So what are events and when do you use event detection versus classification? So in order for something to be event it needs to have at least a clear start or a daily also clear duration. So something that's start and stop and it's somehow discrete and you can, for example, count them. So here's an examples of modeling sounds as events versus modeling them as classes. So say that you have a sensor system near a road and if you take in audio, say 10 seconds and in roads, a busy road, the sound of cars is nearly, nearly constant in this topic. So you could model this as a classification problem and then you would say that the class would be car traffic. If you would instead model this as an event problem you want more detailed time-based information then your class names or your event names might be, for example, car passing. So that's a single car passing, your sensor. Or honk, so that's a single honk and you have it like in speech, for example, you could have like speech as a class. I'm talking that speech as a class like all over time continuously. Whereas an event representation in that domain would be like single words. You would timestamp that single word. So it's important to think about your problem and whether you want events or classes. Whenever you want to count something, count discrete things then it's a good chance that it's an event problem. So to make things concrete today we'll have an application, an example application and we will be talking about fermentation tracking or making alcoholic beverages such as beer, cider, wine, et cetera. And try to have a fun example that some might also be interested in from other perspectives. So in alcohol production or making a nice beverage you make a compound called the wort which contains yeast, some source of sugar, water and maybe also additional flavorings and so on and you put it all in a vessel. This vessel is put in a location that has an appropriate temperature. And after some time, this will naturally start to ferment or at least that's what you hope. And during the fermentation, the yeast will eat up the sugar which is the process that produces alcohol. And as a byproduct, it also produces CO2 gas. Of course, there are many things that can go wrong. The fermentation can fail to start. Maybe the yeast is dead or is struggling. It can be also way too intense. So you can get so much foaming that this almost doesn't explode but it might create a lot of mess in your house. Or sometimes the fermentation starts okay but then it stops for some reason abruptly and you might need to restart it, for example. And so as a brewer, you have to monitor this process. And that's what we will help the brewers do here. At the top of the vessel, in this picture you see an airlock and this is the device that will let the CO2 produced out while making sure that oxygen and also bugs or dirt and so on cannot get into your brew. So it sounds like this. So here you hear the CO2 being pushed through the airlock and it escapes out of the top. And as you can hear, this makes a characteristic sound, a plop for each bubble of gas that escapes. This example has a nice and clear sound. It's not always so easy to pick up. This is something that we can track using machine learning. We can have a microphone that picks up the sound, pass it through some software and use a machine learning model to detect each individual plop which you can then process further. So this is an example of events. It's clear, time-defined sounds that we want to count. If you count these plops, you can estimate how much fermentation is going on and you can also use it to estimate the alcohol content being produced. However, it's not a very precise method. But you can have a status that tells you whether things are normal or not. So if one would have such a system, then you could contract this over time and the brewing goes over fermentation, primary fermentation happens over several days usually. And in the start, nothing happens. Then it starts to ferment and it usually reaches a peak relatively early. And then as the yeast has eaten a lot of the sugars, the activity gradually goes down. So this is the kind of curves that you want to see. But of course, there are a lot of variations depending on what you have put in the brew and also depending on ambient temperature. And sometimes there are, even with the same settings and the same input into the wort, you might still have variations. But you want to see something like this, like no abrupt drop-offs and it should start within a reasonable time and so on. And if it doesn't, that's something interesting to check out. So you might want to have a notification on your phone if there are anomalies, for example. So our goal is to do this with sound, event detection and machine learning. And of course, there are also dedicated devices to this task such as a Plato airlock. So that's a professional device and that will be a better solution. So if you just want to track your brewery, go buy something that is like battle tested. But here, this is a fun and interesting problem to do this with sound. And it can make a practical solution as well. So we will have a system that can track fermentation activity and it will output a number of bubbles per minute. So how many flops that we heard. And we'll do this by capturing the air sound with a microphone and then use machine learning to detect. So that's our goal. So when people say machine learning, many people mainly think about machine learning algorithms and code, but just as important or in many cases, more important is the data. So without data, you don't have machine learning. There's nothing to learn from. And you will not get a good model out. No matter how great your architecture is or your training process, you need good data for a good day machine learning model and you need a good machine learning model for a good machine learning powered system. And the technique we are going to use, which is the most common for learning a classifier or detector is what's called supervised learning. And in supervised learning, you have labeled examples. So we will have input audio. And in the training set, we will also have labeled labels for this of the expected output. So in this case, was it a bubble or a plop or not? And there are many, several ways of labeling this data or formatting the labels. We will focus on what's called strongly labeled data. So this is a concept in sound event detection and also in other time series. So with strongly labeled data, which is shown at the top here, you have precise annotations about each event type and each event that happens, the start and the end. So it's the same kind of output that we would want to produce. This is all strongly labeled data. This is quite time intensive to produce, but it's the, once you have these good labels, it's the most straightforward learning process and also how to evaluate and so on. So it's the best thing to start with, invest in the data and then you will have the easier time on. It's also possible to use weekly labeled data, but that's an advanced topic for another time. So when you have this labeled data set, then you will put it into a training process and the training process will spit out a model that now can act as a sound event detector for your specific classes. And then that sound event detector can take input audio and spit out the timestamped events that we have in this audio. So important to consider is the requirements for your data. You need not just any data, you need good data. And one aspect is quantity. So you need to have enough data. And this varies a lot based on the complexity of the task, but this is like some rough guidelines. If you would have a hundred events, so instances per class, in this case, we only have one class. And then that's like the minimum level. You must be at least there, otherwise there's no point in starting the modeling part. You should focus on the data. With that, you would split this into a training and test set per standard machine learning uses. So you might just have 30 events in your test set. So that's quite literal. If you would want to have 99% accuracy or 1% error rate, then you wouldn't really be able to estimate that in such a small data set. And it will be very noisy. Like your performance will seem to vary, but it might actually just be, for example, in your cross-validation, it might just be statistical variation and not an actual significant change. If you're at 1,000 instances per event, then you're starting to get somewhere. Then you can have a couple of hundred events in the test sets, which will give you better, more stable statistics. Or if you have 10,000 events or higher, that's in a very good range. So in the professional system, that's where you want to be. Then you have very good robust statistics and you will have also significant amount of training data. So not just to validate, but also to learn a strong model. Of course, a quantity is not all. You need to have the appropriate quality. And the most important mindset here is that you need to have realistic data. And you have data that is relevant to your task and also captures all the natural variation in this process. So when this comes out in production, your model has seen data that's very close to in the training process, as it will in production. For example, there will be variations in the event sound. So in our case, different airlock designs. So this mechanism creates different sounds, different vessels, different sizes. Actually creates different sounds and the material vessel influences different brews. So like how intense the fermentation is, for example, is important and this changes also over time. So each phase of a brew is slightly different. And your recording device changes the sound. So it's not perfect. No matter how good tool you use to record, it's not gonna be perfect. Also, your users, if you deploy this in a way that users use their own devices, that will also have a lot of variations in the microphones, in the noise floor, maybe even program settings on what they are using. So this is a source of variation which needs to be captured. Otherwise you will create a system that seems to work okay locally, but fails when it comes out in the field. Also the recording environment. So if this is for home brewing, then the environment that this is in is often just like, might be a bathroom, might be a kitchen and it's used for other purposes too. So there are many other sounds, many other events, many activities going on that will be background noise. And some of these might be very, rather easy to confuse for a plop or some, if you don't have much variation at all, you could end up with a model that just looks at the sound level effectively. And that could be fooled, that will fail in many ways when, for example, someone just walks past. So you need to capture background noise data that is relevant for this task. And in an uncontrolled environment, this is potentially the space of all possible sounds. So it can be, but you need to at least get the typical ones so that you would handle that. So now that we will talk about data, we can go on over and talk about the model. And this is more, the many talks talk mostly about this. So here we're gonna talk maybe 50, 50. It's important to understand the general audio machine learning pipeline. And I've illustrated it here, that the start is common to most ODML tasks. And then towards the end, we'll get more specific into our particular use case. So it starts with the audio coming in. This can be stored audio clips or it can be a live stream from a microphone. And this is split into what's called analysis windows, which are fixed length time windows. And each window is then further processed independently of each other. So from each time window, we compute some sort of feature representation and a very common and effective feature representation is called a spectrogram. And this is time frequency based representation. So you have frequency along the y-axis and time along the x-axis. And you can often visually see the tasks you're interested and this is easy to understand as a human. And also we can use basically techniques from image classification on this. So each time we know we'll go through the classifier and it will spit out the probability of an event occurring in this very small time section. This will be a number between zero and one for each class. If you buy the classification, only one output. And then in our case, we will put this together into what we call an event tracker, which will convert this into discrete start and end times of events. Now we'll have this timeline of events. And then in our task, what we're actually interested in is not so much the individual events. That's actually mostly a tool. What we are interested in is actually the bubbles per minute, the estimate of the fermentation activity that we can put over time and we can detect anomalies and so on. So that's the overview and we'll have a look at a couple of the steps here in the video. So spectrograms, and I talked a bit about them. They are very easy to extract in Python and one excellent tool for this is Librosa. It also provides many other audio feature extraction tools. So definitely a library to get familiar with if you're working in audio ML. Spectrograms like this is also implemented in PyTorch audio and TensorFlow and so on. So it's rather accessible. You often want to convert this to a log scale spectrum or decibel representation. And this just compresses the numeric range and makes it a little bit more in tune with how we hear. And often we also use what's called a male spectrogram representation, which is just a way of dividing up the frequency axis that is a little bit closer to our perceptual system, how we hear. But more importantly, it allows you to compress the number of bins that you need on the frequency axis, which is good for machine learning model. And you will still keep all the, or most of the relevant information. Librosa also provides plotting tools, whichever it is. And the classifier model can be many different types of model in this framework, but a very popular and powerful method would be to use a convolutional neural network. And this is basically the type of network that you would use for an image-based problem. So in the spectrogram representation, we will have a 2D, essentially an image. And so this CNN is very suitable for that and well-known to many. One difference to image models is that typically we will have a much smaller model, I would say fewer convolutional layers. So typical sound event detection model might have two to five layers. And you can do really, really well on even quite tricky tasks on three layers of convolutional neural network. So don't go and put the biggest, fattest model in that you have, especially if you don't have so much data. So because these tasks are often compared to image processing, relatively simple tasks, relatively simple patterns. And if you're unfamiliar with deep learning, you can also do simple scikit-learn model as an alternative, for example, a logistic regression. Then it might be preferable to use not just a spectrogram and what's called MFCC representation. You'll find, if you search for this MFCC in audio, you'll find quite a lot of information about that. And that actually does well for many tasks. I've tested it on this task and it works quite okay. Yes. So important, of course, is performance evaluation. How do we know that our model is doing okay and how do we measure this? So for sound event detection, a typical characteristics is that we have in-balance data. So we have a lot of background. So a lot of times nothing is happening. For example, if the horizon started yet or even between each plot, since early phases or in the late phases, can be several, many, many seconds between each plot. So most of our analysis windows in our instances to our model will be false, will not have any event. And then a relatively small amount will have something happening. And then it's very important to not use accuracy because accuracy would score very well even if you would just always guess the majority class. So much more relevant is to use the false positive rate or false negative rate or what's called precision and recall. And it's very useful to plot curves like this. Like a rock curve, no, sorry, a trade-off curve. And you can then compare different models on this because sometimes you're interested in working in a high precision regime. So you wanna make sure that if the model outputs that something is happening, it's very likely to be that. But other times you might be okay with the be a little bit trigger happy as long as you capture as many of them as possible. So there's a trade-off here possible and you can choose the operating point. And then you can compare different models and you might wanna compare them in the regime that you're interested in. Or if you were looking to see if something is always better than you would compare it across the entire range. And you can also compare it with different conditions. So for example, the signal to noise ratio, how loud are your event sounds relative to the background? That's something one can, for example, synthesize data around or you might annotate it or calculate it. And then you can look at, okay, how does this affect our performance? Because it very definitely will. And if it's important that the system works well in low or high noise or low signal conditions, then you have to do these tests and ensure that it doesn't just work when you have perfect laboratory condition audio, but also works when you have noise. Yeah, and you can do this evaluation based on each instance, so each analysis window. And alternatively, you can also do it on the evaluation. And there's a tool called SEDeval, Python library, that can be useful for this. The other component in our system is what we call the event tracker which converts the continuous probabilities to this great list of events. And it's a very simple module. You just need to threshold, so you need to decide, okay, if it's above a certain probability, then we will say, yes, this is, we think this is a pop or an event. And then if you are already in such an event, it's useful to have a different threshold to detect when did this stop. And using different thresholds is useful because it avoids, if you will have very small variations around the threshold, this avoids the system going on, off, on, off all the time. And it's called the hysteresis and it's coming in process control. And the other component, which is maybe not directly sound and detection, but something we're building on top, is this statistics estimator to compute the bubbles per minute. And you could report, you could count for one minute and just report that number, number of events that you detected as the BPM, the bubbles per minute. However, our model will always be somewhat wrong. So you might have false positives, so you said it was a bubble and it was not. You must have false negatives, so you missed something. And we have a process that we expect is very, very periodic or regular. So if we assume that, we can use instead of just counting, we can estimate the distance between each plop and look at the, try to find a typical value for this. And if you use the median, then you will be robust against the outliers. And this is expected to perform better in real life conditions where you're gonna miss some events. It's for sure. And also you're gonna classify every one minute and you can do this process with a little bit of overlap as well, maybe 10 seconds or so. So, and each of those will give you different estimates. So you could again provide to average that in order to give you this curve over time because one minute resolution is way more than you need. What you might be interested in is, for example, a value out per hour, for example. I don't think any brewers want to check on their brew. Nothing will happen really in a few minutes. So to warn once per, or an hourly resolution might be very, very good. So you can do multiple levels here and including a certain estimation. So, and now when you have that part, then you might wanna present this to the user. And in this project, which I'll link the GitHub to later, I found that there's some established tracking solutions that brewers already used to plan their brewing and to track it over time with sensors. And one called BrewFather had a very nice API and super easy to integrate. So here's actually all the Python code needed for that. And then when you input into their system, you can do it max every 15 minutes, you can get this graph over time. And then you can at least visually check this, for example, once per day, you might have a look, how is my brewing doing there? How's it going? Or if you're in the interesting phase, you might check it several times per day from your laptop or and so on. And you know whether you should go and actually do something with your brew or everything is good and you just let it sit and it will give you a nice beer in the end. So that's the parts that we had for today. I have a couple of bonus ones that we might have time for. But for more resources, I've published the code for this project. It's in an early stage. There is some data there. There are some Python notebooks that implement the basic model, but it's not yet a finished solution by instance, but it's a good place to have a look and follow if you're interested in this project and in a practical application of sound and detection. So that's on your brewing audio and detection. And I also provide some resources for general audio ML on also on GitHub, the machine hearing repository. And for this particular task, if you've learned something now and you might want to learn more, there's a good sound event detection paper by Thomas Viertelen and more called sound and detection editorial, which is really excellent. It's more of a, it's very practically oriented, but it's published as a engineering paper. And I also did a presentation in Europe in 2019 about audio classification and machine learning. And that talk will teach you a lot about let's say more the basics. So if you felt now that some of the things went a bit too fast around audio, input representation and so on, check out that one. Or if you feel that this is all easy, I want to focus on optimization. I did a presentation at TinyML this year. The title is environmental noise microcontrollers, but it focuses a lot on how to make efficient models, of computationally efficient models, including ones that can run on a small sensor device like our own. So that's a good introduction in that talk. And if you're interested in machine learning on audio in general, I would encourage you to join the sound of AI community, which is a Slack group and is very active, have now around 3000 participants. And it's a great place to ask questions and connect with others that are interested in the same topic. So, no applause. Thank you. I have just one question I would like to ask the audience before they even ask us, is that think about what you want to make. So now you've learned about audio and detection with machine learning. What are you interested in doing? Popcorn popping, birds, coughing, COVID detection, drum hits from music, car traffic and so on, all many, many possibilities here. And I'm happy to discuss particularities. Questions and also in the breakouts, of course. And yeah, consider to use, if you want to make a professional system alias, consider to use the sound sensing tools that we have. And we are also looking for people. So if you want to make this your favorite thing, then definitely contact us. We have full-time positions, freelance work, internships, engineer thesis and partnerships. Thank you. Fantastic talk. I specifically love the fact that you covered the end-to-end journey from like, one, capturing the data to actually showing how you can mess around with it and so on. I really loved it. So thanks again for that. And I'm sure everyone who's this thing would love to join your Slack community as well. It seems like a pretty nice place to learn more about it and actually just to get their hands dirty with it. So we have a couple of questions. I'm just going to put them on the screen and we can take them one by one, if that's okay. Great. All right. So the first one is, how do you pick your spectrogram parameters and window length? Ah, it's a great question. If you have a task that someone else has done before or search around and see what others have tried, this is the best. So try to find the paper and try their parameters first if they have a real task. In general, you can do it good with around 40, 50 milliseconds window, sorry, frame length for a spectrogram and the number of frequency bins might be 30 to 128. And the window length, that's what really depends on your task. It needs to be long enough to cover your event because you need to see the entire event, at least the start of an event in one go. So that's the most task dependent and you definitely try different things out there. Perfect. Then this is probably one of my personal questions as well. What software tools can you use for labeling audio data, especially when you have a lot of data? Ah, yes, I think that. Yeah, I actually have a couple of slides on that. Will people still see them? Yeah, sure. I think we have time, we have 10 minutes, so we can... Yeah, yeah, yeah. I think you can just share your screen and we can add it to the stream. Okay, yeah. All right. So I had... Yes, I actually skipped one slide. So for manual labeling and also for verifying your data, Audacity is a great tool, it's open source tool. And here you can manually select the times that are your events and you would label them with true, for example, and you might want to select other things that are definitely not what you're interested in. You might label them with false or no, in this case. And then you can export this as a CSV and you can load it with pandas with a single frame. So this is the way to start and really like dig into your code. If you want to do a lot of... Sorry, dig into your data. If you want to do a lot of labels, this can be a quite slow process. So you can do semi-automatic labeling. And one popular tool is called Gaussian Mixed Model, Hidden Markov Model, it's a model, GMM HMM. And this is unlabeled process. So it's not going to be perfect at detecting your events, but it will do... If you have a, okay, signal to noise ratio, it will do a pretty good job. You'll basically specify, I have two things in here, some sort of Windows with background and Windows with my events. And please try to separate them. I won't give you any information, but separate them. And this is an output from actually this task and it does a really good job. It did only one mistake at the end. So you'll need to go over and quality-assure this later. I hope that answers that question. Yeah, I think it does. And I think the dual audacity looks quite nice. I was just googling it on the side. It's open source, looks quite nice as well, just to get started. And I was just looking at the chat. There are people already talking about building keystroke recognition models and so on. Do you know that a talk is successful when people are already talking about how they can apply it? Yes, that's fantastic. That's what I wanna talk about with the data. Yeah, there's actually one more question, which is could we just compare the detected beats per minute? I'm assuming BPM is beats per minute to actual BPMs for model evaluation. Yes, and in this case it's bubbles per minute, which is a special portion of BPM, but it's the same thing. And yes, this is not a standard sound event detection evaluation, but that's the task that we are actually interested in. So I would encourage to do both. Yeah, and do an evaluation of the classifier and also do a BPM evaluation, because in the end that's what we want. And that's also, you should be able to do better on that task because you can tolerate a bit of errors. Right, yeah, that makes more sense. Yeah, bubbles per minute makes more sense than my beats per minute. But I think these were all the questions. And again, thank you so much for the lovely talk. I'm sure people would want you to be around. So maybe you can head over to the breakout room of Parrot and you all can have a quick chat about what are the use cases you can do. I'm happy to go in the Parrot, and then if someone is not able to join Parrot now, but want to talk later, send me an email or join this sound event Slack community. I'm quite active there, you'll find me. Right, fantastic, thank you so much. Thank you so much.