 Thanks, thanks Javier for for having me back here. So yeah, I yeah, I recognize a lot of the faces here And it's great to be back for those of you If we don't know each other. So yeah, I was here a few years ago and Yeah, and I'm gonna talk to you about the work that I've been doing at NYU across a number of different projects So there's actually there's kind of a lot that I wanted to talk to you about today Which means that I won't necessarily be able to go into a lot of detail about some of the specific techniques or the details of those techniques But I'm more than happy to take questions So if at any point you have a question, please just feel free to stop me and I'm also happy to have Conversations offline at the end of the talk as well. So I guess this will be a little Overviewy in the sense that there's a lot a lot of ground to cover But I'm always happy to to discuss things in greater detail The other thing I'll just mention before I start is that even though my name's here on the slide All of this work was done in collaboration with a lot of great people including people here at the MTG I started listing everyone and then it just kind of scrolls down from the bottom of the slide So in the end I decided to To just not include it But yeah, this is work that has been done with collaboration with a lot of great people so They're kind of two main themes. I want to talk about today machine listening and urban and bioacoustic environments Which has been Perhaps the main theme of my work for the last few years But also about another thread of work that's kind of we've been working a lot on at the music and audio research lab Which is tools for sustainable MIR research Music information retrieval but also just audio research audio analysis research more in general Also, I'm gonna assume certain familiarity here with the domain with the techniques, but if at any point Something is not familiar to you have questions you want further explanations, please by all means feel free to stop me and just ask so starting with Starting with the first half of the talk, but maybe just very quickly. Who is this person here? This is kind of going a little eerie. Okay, I'll skip this slide. I Wonder Okay, basically it was a bit of my own trajectory. I'm originally from Israel did my undergrad in the UK Did my master's in PhD here and now at NYU so you can easily skip that and Then the other thing I'm going to skip is are the things that I'm not going to talk about today So as Javier mentioned, I spent a lot of my time working on melody extraction Where the base idea was, you know, you have a melody in a polyphonic music you want to extract the F zero trajectory of that melody Towards the end we could start at doing some fun things with those trajectories Like replacing it with a Vocaloid synthesis of the same melody And we worked in a whole bunch of other fun things while I was here like cover ID and genre classification Panda discovery tonic identification, so I'm not going to talk about any of that What I am going to talk about is mainly in the context of these two projects. Okay, Sonic Which stands for sounds of New York City very much inspired by Sons of Barcelona And then Birdvox, which is a similar project that we have in terms of analyzing sound using machine learning techniques Except that Sonic the focus is on urban sound and noise and in birdvox. It's about biacoustics It's about studying birds and it's a collaboration with Cornell University With the goal of studying bird migration patterns So these two projects will kind of fill the first half of my talk or maybe the first two thirds And then the second half will be about these tools for audio For audio research that we've been developing MIR eval jams skaper audio annotator and massage so Sonic is this big project It has a lot of moving parts. It has acoustic sensing. It has machine listening It has data science working with the analysis of the machine listening component And so there are a lot of moving parts here And today I will very much focus on this part here in the middle on the machine listening part And by machine listening I generally mean Extracting semantically meaningful information from audio signals and as you will see in our case It's really about identifying the presence or absence of different types of sound sources from from a signal so I'll the way I'll present it is a bit kind of giving the historical arch of the different steps that we went through In our research, but to make sure that nobody kind of gets impatient as I go through this Spoiler alert It's covered. Okay. Basically the spoiler is Deep convolutional neural networks with a lot of data kind of win the day The second spoiler that I just added was that actually we have some new results very recently that haven't been published yet And so I'm gonna give you a sneak peek of what we kind of what we have very recently So throughout this talk there will be four I think Overarching problems or topics right that will kind of be transversal through working with birds working with jackhammers working with cars One is about signal representations, right? And that's basically how do we work with the audio signal to get it to a place where it's amenable for machine learning? The other one are about the models themselves the specific architectures Then a big topic we've been working with is the issue of data data scarcity the problem of not having enough data This of course we started working on this a lot before audio set happened before Google released that So things have the pictures slightly changed now but I think that a lot of this is still relevant still applicable and then There are some real-world challenges in terms of actually getting these systems to work in the real world deploy them out there Which I will not focus on too much Okay, so this is a kind of my why even do we care about urban sound or urban noise, right? Why why should we even bother with this topic? so It's been estimated that in New York City alone 9 out of 10 adults So they almost everyone is exposed to harmful levels of noise So if you've been to New York before you will probably have noticed that the city is super loud. It's just noisy all the time and that's and there've been yeah over Like over 3 million complaints since 2003 so New York City has this service called 311 Where you can call and complain about stuff. It can be your neighbors making noise It can be that they haven't removed the garbage from your streets It can be that there's a hole in the road But it can also be about noise and it actually turns out that noise is the number one Complaint, it's the number one thing that people complain about in 311 and the number of complaints just keep growing from year to year Now many people say okay, so people complain about it, but noise. It's not really a big deal. Well Turns out that noise is a big deal There's some scientific studies that actually link noise to health serious health problems like sleeping loss hearing loss And when you suffer from these things you obviously are less productive at work And so there are also studies linking noise pollution to drop in productivity increased stress and Learning impairment in children so basically when children are in schools that are surrounded by very noisy environments That can actually influence how well they do in school This is my own little personal anecdote So I'm I'm a data point in my data set because this is the truck full of all of my stuff when I move departments because I couldn't take it anymore My first apartment in Brooklyn was in a very very very noisy area And it got to the point where it was just became unlivable and so we had to pack our stuff and move so I've kind of Experienced noise also firsthand So we say okay, so noise is a big city is a big problem in New York We want to tackle this but we want to tackle it in a data-driven scientific fashion Before we can start making up methods and techniques. We probably have to understand. How does the city work with noise right now? So we looked into it and let's say you're in the city you have these things Let's call them noise sources, right AC units dogs garbage trucks However, it's important to remember that these things perhaps with the exception of the dog in a bond themselves They won't just start making noise, right? These are all operated by humans So the humans are the kind of critical part in the loop. So we have these humans They operate these things that make noise and that in turn contaminates our acoustic environments now right now as a citizen Your only option is to basically file a complaint through 311 that's complaint if it's if it's some types of complaints will go to NYPD, but most of the complaints will go to the DEP the Department of Environmental Protection Okay, and they will then if they have time will send an inspector and then the inspector will try and issue some sort of ticket As you might imagine there are a lot of inefficiencies with this pro with this process so First of all The reporting system is sparse and bias right sparse because not everybody complains about noise when there is noise and bias because it Actually turns out that some studies show that people living in richer neighborhoods Tane to complain more than people living in poor neighborhoods So actually the kind of the higher expectation you have for quality of life the more likely you are to complain about noise Which means that this is by no means a mirror of how much noise there actually is in the city And then the other problem is that the scheduling of inspections is ad hoc right they get complaints And then if someone is free they send them to a certain location, but there's no there's no strategy behind scheduling these behind scheduling these inspections and consequently what happens is they end up with a very poor hit rate because most of the time by the Time they get there the noise is gone Right so they can't issue a ticket. They can't issue a fine because there's nothing to there's nothing to observe So, you know, our question was well, can we do better? How can we try and fix this? and So what we basically add is a layer on top of the existing system So one thing that was important for us is not to come and say we're going to replace Everyone and everything using technology, but rather we're going to enhance the existing process With technology so if before you only had citizens reporting now We're adding a system of sensors of acoustic sensors that are constantly monitoring the environment and transmitting that information to our kind of main data center where we can perform analysis pattern recognition identify patterns of noise over time in space and Using that information we can then inform the city's inspectors To optimize their chances of actually observing a noise violation, right? If our system measures that every Thursday between 8 and 8 30 There is a jackhammer happening at location X then we can actually start building predictive models and say your greatest likelihood Observing a noise violation will be Thursday next week between 8 and 8 30 and in this location. Okay So anyway, this is all just all the spiel about everything that's involved in doing this so another way to look at sonic is like a Cyberphysical system or loop with three main components you have sensing analysis and Actuation actuation is important right if we just keep this is a pie in the sky research project And we never actually work on getting our information in the hands of those people who can actually then go out and issue tickets and act on The information the whole project remains hypothetical However by ensuring that from the get-go we're working in collaboration with city agencies like the DEP We can really try and get to the point where our system has some real impact So over the last few years The two main components that we've been developing now there are many more Where the sensing side, which is just the physical hardware sensor? And that looks something like this. This was mainly developed by my colleague a postdoc named Charlie Middler's There's a Raspberry Pi in there the microphone component is This it looks like this so actually there's the the green board here is the microphone The microphone itself is a MEMS microphone. It's tiny. They're printed like chips And so the board just adds some extra post processing on the signal to kind of get rid of the inherent frequency response of the microphone so that we get a nice flat response and Also doing some tricks to avoid Electromagnetic interference and this kind of stuff and then it's mounted on this 3d printed Gray thing and never really knew what the name for it is but the point is that this is both a mount it also protects it from rain a little bit if it starts raining and The holes in the sides are to ensure that we don't get any funky resonances Inside the mount that then affect the signal that we get from the microphone And so that's our sensor here you can see a picture of one of these mounted in New York City We have close to 50 of these in different locations in the city now The hope is to get to a hundred as part of our kind of pilot pilot project with this and so The point is these sensors they can already measure decibel levels right they can measure how loud something is But the problem is if you only measure how loud something is and you don't know what that thing actually is How can you reliably start developing strategies to mitigate it? Right if something you don't know if something loud was a dog barking or a person or a car passing by or Construction and the whole point is to collect this data at a large scale from the entire city to really try and build statistical models of these noise patterns to develop Large strategies broad strategies for mitigating noise So you can think about it Only measuring db levels in the image world would be the equivalent of only giving you the average color of an entire image as one pixel Right when actually what you care about is what's in the image and that's why The problem that I've been focusing on for the last few years is source Identification right which basically means given a signal an audio recording automatically identifying the presence of Be Jack hammer siren person dog barking basically any type of sound that might be interesting for us to study as part of this work And it's a challenging problem Because first of all there are tons of different types of sources that we might want to recognize right so our vocabulary is potentially huge There's a lot of background noise right so we're working outdoors. This is not unlike music We're often if unless you're working with live recordings if in a studio conditions, everything is very perfectly recorded and quiet here It's just a single microphone capturing an entire environment But sometimes the noise is actually what you're trying to identify right so like you hear this drilling in the background That's actually the source that you're trying to pinpoint so even the very concept of What's the background and what do you actually want to detect is? something that we have to work to define And unlike other signals like speech or music it doesn't really have some sort of large-scale structure Right that we can exploit like in language. You can build language models and then you can use those to constrain The detections of your NLP system in our case. We don't really have that other than very Broad fluctuations like day versus night weekends versus weekday But you know we have no guarantee that if we heard of Jack hammer five minutes later We'll hear a siren. There's there. There's no link between these things So we started tackling this problem In multiple steps The first one was building a taxonomy right so we first had to if you want to classify things those things first needs to be grouped Into different categories So we started building a taxonomy of urban sounds you can see you can't really see it But basically starting with the entire environment. We divided it into human sounds nature Mechanical sounds and music musical sounds. So really I think the mechanical sounds is the broadest category We have here because it includes construction and it also includes traffic I think here a bit of a zoomed-in version so you can see motorized transport divides into Marine rail transportation road transportation airborne transportation and then within there we split it further further down So this was for us the first step just to understand How do we distinguish between different concepts and how should we organize those concepts as humans before we try to build a Machine to do this for us Since then those of you are familiar with the audio set data set that Google has just released They also coupled with the data set. They've released a giant taxonomy as well Much much much larger than this Inspired in a very small way by our taxonomy. So that was nice But yeah, I think that moving forward will probably try and and and merge these two together so that they They speak to each other nicely So we built a little data set from free sound So we crawled free sound based on like different tags and sounds that we want to find and we ended up building a data set of a little under 9000 recordings where every recording is just a short snippet containing one of ten different types of sounds It's called the urban sound 8k data set. It's Publicly available you can download it you can play around with it. So If you just Google urban sound 8k, you'll find it and the first thing we did was to try a Feature learning approach, but a simple one. So now everything is about deep learning first We said what about shallow learning so as opposed to trying your standard MFCC's approach Let's see if we can learn a representation, but to keep it shallow So in this case we used a clustering approach where you have your you extract the time frequency Representation and then you cluster it and then you take the center of those clusters as keywords in a dictionary And then once you have this dictionary you can just take any incoming sound You can take the dot product of its representation with the dictionary and that gives you your feature representation And it's a fixed dimensionality because we then summarize it over time and once you have this representation you can just feed that into your Supervised classifier of choice can be a random force in SVM. What have you and see how well it works? There are a bit of details here, but I'll skip those Okay, so we tried this we compared it of course to a baseline, which was just MFC season SVM So nowadays this is really old-school kind of stuff and we tried dictionaries of different sizes and flavors When our clustering was clustering one frame, right? So we just took every individual all the individual frames of the data set Clustered all of them took the centroids of those frames as our dictionary and kind of disappointingly what we found was that basically feature Learning wasn't giving us anything. We were getting the same performance as MFC sees but What happens if we move from clustering individual frames to actually clustering time frequency patches So if we go from one frame Why is this not? Sorry, I'm gonna try and launch this again. Just because it's Acting a little silly right now Some of some of the plots are not appearing. Oh Ahora si, okay So as we move from one frame to multiple to four frames, we get a big bump eight frames We get a further bump 16 not so much So basically the main takeaway from this is that when you're working with environmental sounds the Instantaneous timbre of just one frame is not enough in order to distinguish between different sounds and the Characterizing sounds over time and over frequency and over time Looking at these patterns is really what allows us to start tearing like separating different sounds I should just mention the sounds in this data set include things like car honks jackhammers idling engines air conditioning units dog barks Drilling etc. And so then we looked at the What do these what do these centroids what are this representation that we learned? What does it look like for different classes and indeed you can see some patterns like in siren? You can roughly see the lines going up and down representing the undulating signal of a siren It's a little hard to see but for children for dogs you get Signals that kind of looks speech like you can see clear harmonics Yeah, so that's for example the siren Jackhammer you can see this I Want to show you a quick demo of the latest version of this So once we basically have a little we can plug it to an interface So this is exactly the same microphone working in our sensors I have it plugged into my laptop and so now the input to the system is coming from From the sensor All right, they're actually more classes, but the resolution is very small Anyway, and then I have my my favorite prop here Oh So yeah now because this is a it's trained as a multi-class model not a multi-label model It's always going to try to identify something and since we didn't have human speech in here when I speak It's mostly doubting whether I'm a dog or a child Obviously in our more recent models We've switched to multi-label models meaning that multiple things can be active at the same time and also nothing can be active If there's nothing that the model is convinced about probably if I raise my picture of my voice a little bit I can try and make it sound like a child and So that's you know You get the idea it's pretty pretty straightforward And so the idea is if you have something like this running in real time in these sensors now these sensors can start Identifying what's happening in the street and send that information back to us the home base Generating actionable information using these analysis algorithms I'm going to very quickly try and adjust the resolution of the screen Yeah, I think that's a little better Oh, yeah, is it gone got lost Okay Is this better? Okay, great All right, so that model was based on male spectrograms. We said okay What if we take a different representation right nowadays? It's kind of pretty standard to use male spectrograms as your default representation for for feature learning in classification And so we looked at the scattering representation where basically this is a if x is your time domain signal It's a concatenation of weight wavelets transforms and and smoothing and then again wavelet transforms and smoothing To give you a bit of an intuition for what this actually looks like Let's say that this is your It's the it's referred to as a scale a gram. You can think about it as a certain form of spectrogram And there are three sounds here Okay, so the first kind of is smooth the second one has a sharp attack the third one is intermittent After we apply the smoothing they look like this right so you say well, what's the advantage of applying smoothing smoothing gives you Shift invariance in time right so if I now to where to take the signal move it a little forward a little backwards It looks the same But the the problem is is that we've now lost the temporal characteristics of the signal and that's why given this we now apply again This filter bank and this filter bank is now applied to each frequency band of this signal Independently so I'm going to show you what this transformation looks like just for one frequency bind and it looks like this So suddenly here you can see the onset and the offset here You can see the sharp onset and here since there's actually a frequency of intermittent sound it's in in the scattering of presentation It appears as a frequency and So now basically we have these two representations which combined tell us both about the average timbre of the sound and about its temporal evolution So we took exactly the same framework of shallow feature learning with a dictionary But we swapped out to the mouse spectrogram for this clustering representation And what we found is that the results weren't significantly better, but interestingly we could use much smaller dictionary So with the mouse spectrogram as we used bigger dictionaries We got better results and that makes sense because the representation is not shift invariant So you need to shift it many times and learn a centroid for each one of those shifts to still be able to represent all sounds Moving to the scattering we're able to reduce the size of the dictionary by our own order of magnitude and still get the same performance So basically what we got is a much more efficient representation That allows us to build smaller models faster to train faster to run inference on and still perform the same skip this Okay, and then the step the next obvious step is we're saying okay, so we noticed we observed two things one Time frequency is important, right? You can't characterize Environmental sounds just over time nor over frequency you need time frequency and the second thing we observed is that shift invariance Gives us better performance. So a convolutional network does exactly that, right? It's it basically learns time frequency patterns or two-dimensional patterns at the function of your input representation and its shift invariant In time and frequency if you're using spectrograms It's not shift invariant, but if you use a logarithmically scaled frequency Axis like constant Q or male spectrogram except for the bottom range, which is linear then then you have a representation That's where it makes sense to use these Convolutional filters so we tried what nowadays I think is fair to say a super vanilla convolutional network, right? It's like really nothing complicated. We have three layers three convolutional layers followed by two dense layers Training using the standard bag of tricks nothing special and Kind of disappointingly we observed so this is the shallow feature learning. This is another deep Model that someone proposed and this was the deep model that we were trying and we're like oh it doesn't actually Work better so it failed to outperform the shallow model and the same happened with this Other model and so the question is is it that the model is just not It's just not useful for this problem. It's not better But we said maybe the problem is is that we don't have enough data, right? As we know these models can tend to be very data hungry They require a lot of data to learn meaningful discriminative representations and so We turned to muda, which is a tool developed by Brian McPhee who is also currently at NYU He was at Columbia before that which is a augmentation framework for music So he originally designed it with MIR research in mind But we've been happily applying it to environmental sound for a long time now It's open source and so again, I encourage you to To try it and play around with it and it's basically given a single a single audio signal You can apply a variety of transformations like adding noise pitch shifting time stretching in order to generate transformed versions of the signal and When we're using this to train models as long as we don't transform it too much to the point where it's no longer a valid representation of the signal We can use this technique in order to generate more data from our existing data add variance into our training data and hopefully improve the generalizability of our models and in particular convolutional models So going back to to these results. We had our our shallow approach to CNN's What happens when we add the augmented data? Basically you we used muda to Make a data set that was 20 times the size of the original data set And we then trained both the channel model and the deep model using this larger data set And what we found was that in both cases they improved but actually the convolutional network improved significantly more than the shallow learning approach and so here we're starting to see that Yes, we just didn't have enough data in order to train a convolutional network And once we do have more data not only that is it improved, but it significantly outperforms a shallower representation I'll skip this So just the one last anecdote I'll mention about data augmentation here What we did is we took we broke down the different classes or In our data set and for each one we checked how each of the different data transformations augmentations affects classification so if the if the bars are moving this way it means it's performed the Classification for that sound was better for that class But if the bars are in this way it means that they were worse and so you can actually see that there are some classes for which applying data transformations actually reduced the accuracy of detecting those sound classes so the conclusion from there is that maybe we Don't want to just whole say a whole sale apply all of our transformations to all of our data Maybe we want to apply different transformations to different classes So this is again the demo that I showed you before here There are some recorded sounds of course, none of the recorded sounds are sounds that were used for training So I'll just quickly play this to you and this runs in real time with the CNN as the back end for a performing classification Now you see that it first takes a while before it settles on a classification And that's because we're using a context window of four seconds for this model And so it takes about four seconds of the audio coming in before it actually kind of hooks into something so From these experiments we basically saw that yes feature learning helps Rather than using your standard MFCC representation say and that for environmental sound we Shift invariance models that build in shift invariance increase the generalize generalizability of the model For large models, we need we just need more data data augmentation help, but is class dependent and data really seems to be the bottleneck and so Recently Google released audio set, which is a data set of two million. I think YouTube videos Including tags annotations of what sounds are in those videos And they also released a paper where they trained a bunch of very deep convolutional models on that data Now the nice thing was a convolutional network is that you can at any point in the network You can just cut it with scissors and compute the intermediate representation that that network computes Right and so the hope is is that if this network has learned a meaningful discriminative audio representation that that Representation might also be useful for solving other problems, which is also sometimes referred to as transfer learning or just Boring embeddings so what we did was they've released their model, which basically you can use that model to compute The embedding that Google learned after training a model on two million YouTube videos And then you can try and leverage it for your own problems And so we said okay Well, let's take urban sound 8k and extract features using Google's model that was trained on two million videos Which would take us a very very very long time to do and then once we have this feature representation Let's just throw it into a simple classifier like an SVM Okay, now for context the best result we got before using our CNN and data Augmentation and basically all the tricks that we could throw at it was 79 percent classification accuracy, okay Using these deep this deep embedding without data augmentation just directly on the limited data set that we have we got 79 percent And so kind of the moral of the story here perhaps a little sadly is that The more data you have really the better you're going to do right it And there was a recent paper where they tried the data sets of increasing orders of magnitude of data And they kept seeing an improvement in model performance And so you can if you have a small data set you have to try all of these little tricks And you can apply augmentation etc etc But if you have a very very large data set you can use relatively simple Architectures, and if you just throw enough data at it it seems like you're going to learn a representation That's highly discriminative but Should that mean that we should just throw out of the window all of the DSP knowledge and The kind of specific observations that we've made over the last three years Studying this problem and just take like you know these uber trained models and see what happens No, it turns out that the answer is not necessarily yes And so we've been also looking at so that Google model is trained on male spectrograms Our CNN was trained on male spectrograms But we saw before that with when we did the dictionary learning thing actually using a scattering representation worked better So we said okay, let's now we're working with convolutional networks Let's revisit the idea of this scattering representation except that now we're using a time frequency scattering representation So now once you have let's say something that looks like your spectrogram you perform a two-dimensional wavelet transform on that representation, okay, so now these web wavelets represent information both in time and frequency in a joint fashion and so We have this scattering representation and remember that this representation is hierarchical, right? You do a wavelet transform and then you smooth and then you do a wavelet transform and then you smooth So the representation goes from being as fine as possible to being as coarse as possible, right? Very smooth than average over time At the same time when you're working with a convolutional network You're going deeper and deeper and deeper and so we said well How can we combine these two things in a way that makes sense and we came up with a model that at the moment we're nick-nicking We're naming The snowball model because it looks a bit like a snowball rolling and as it rolls and rolls and rolls it collects more and more information So blue represents the convolutional architecture which kind of goes deeper and deeper and deeper Yellow represents the scattering representation which becomes coarser and coarser and coarser and so at every depth of the architecture We combine the intermediate representation with another scattering representation So if I zoom in into just one of these modules, it looks like this, right? You have your standard convolution nonlinearity and then pooling operators That's like one building block of a convolutional network and then this is one building block of a scattering transform Right a wavelet transform and then the modulus which filters and smooths Sorry a modulus and then the low-pass filter which smooths the representation for the next level So this is one block and then we concatenated a whole bunch of these together and then we trained it and This is fresh out of the oven this literally we got this result last week, but suddenly from 79 Percent accuracy we got a jump to 91 percent all right, so I Don't really want to kind of give you any conclusions on this just yet because this is still very fresh for us But I guess the only moral of the story is is that the the jury is still out about which representations are The most useful or the most amenable for training audio classification models, right? You have models working from the raw signal models working on spectrograms male spectrograms For this problem where time frequency patterns are really important a scattering representation Which specifically tries to extract that information from the signal seems to give us a very significant bump in performance All right, so that's sneak peek number two because this is not published yet either and third thing that's not published yet is So all of the problems have mentioned so far have been on the urban sound 8k data set and that data set in every sound There's in every audio recording. There's only one class And so as you saw in the demo if you train a model on such a data set you get a multi-class model Which means the model always tries to classify something and that's not really what you want to deploy in a sensor out in the street Right because sometimes there's nothing of interest and you don't want to force the model to say this is something when maybe it's just nothing So that's one limitation of using multi-class models Another limitation is that you can't have overlapping sounds, right because it's always forced to predict just one class and Then a third limitation of all of this thing is that we require We need to tell the model exactly when the sound starts and when the sound ends, right? So let's say if we had a 30-minute recording These these models I've shown you you'd basically tell them between seconds five and ten There is a siren and then I would cut out the siren and I would train the model just on the siren and so these labels Hand force I will refer to them as strong labels Okay, they're labels that tell you if this is time exactly where in time different sounds happen weak labels on the other hand is let's say you get a 10 second recording or a 30 second recording and you're just told I don't know where the sounds are happening, but somewhere in this recording. There's a siren. There's a truck horn and there's some screaming That's all I know. I don't know where it happens And so for us the question was and then and the data said that Google released audio set is like that You have two million ten second clips and all they give you are the weak labels They say these are the sounds that are in here and so for us the question was Given audio and weak labels. Can we train a model that will give us strong labels? Can we train a model that will go from not knowing where the sounds are to telling us exactly where the sounds are? And it so happens that that's also one of the challenges that they chose for a decase for those of you who are not familiar with decase It's kind of like Merix, but for environmental sound For those of you who are not familiar with Merix. It's like Kaggle for music analysis problems and so we participated in this and basically our solution is based on a convolutional recurrent architecture with a Soft max pool layer. And so that's kind of the novelty there is in the pooling If you want to chat more about this, I'm happy to go into details But I felt like that's a little a little too in the weeds for for this talk So I'm not gonna I'm not gonna get into details about this technique All I will mention is that the results are gonna be published this Friday of the challenge. So We'll see how we did then All right So that's kind of a very very quick overview of what we've been doing over the last four years for environmental noise urban noise, but I mentioned that in parallel we have this bird project and Basically, what we've been doing is we've been taking the same techniques and translating them across domains from urban environmental noise To bioacoustic species recognition And so I'm gonna really whiz through this because you've basically seen all of the techniques already Why do we class care about classifying the sounds made by birds? Well, the folks at Cornell University really like building models that look something like this, right? So this is for a specific species of bird the white-throated sparrow over time They want to know where the populations of this bird lives, right birds are fantastic indicators of You know Ecologic health so sometimes if you start noticing big differences in bird populations It's probably mean that there's something messed up with the environment as well You also just don't want birds flying through airports and getting sucked into plane engines and crashing the planes so There are a whole bunch of reasons why we might want to study the migration of birds or at least the folks at Cornell University Want to study the migration of birds? It leads to new biological insights and it can help in developing conservation applications So they've been trying to attack this problem from a number of angles And in particular They have a very very successful citizen science project called e-bird because there are a lot of people Let's refer to them as birders or bird watchers They like to go out and observe birds and make lists of those birds And they've managed to tap into that existing community and actually convince them to upload their lists online If you're curious how they did that initially they said help the scientists share your observations with us Nobody participated then they added scores and top you know top 10 Contributors and compare your lists with those of your peers and suddenly the thing just exploded So sometimes a bit of healthy competition can get you a long way. The other source of information. They have is radar Okay, so this is radar designed to measure weather But if you have large flocks of birds flying in the same direction that will actually show up it will light up And different types of rate of radar So these are the kind of two sources of information they have right now What are the problems when people go out to watch birds? They do it during the day so they don't get any nocturnal observations But that's when most of the species that we're interested in migrate they migrate during the night The other problem is that the radar does tell you something about the quantity of birds in the direction But it tells you nothing about the specific species, right? It's just a cloud flying in a certain direction So the idea is to add audio as the missing bit of the puzzle if you have a sensor deployed out in the field The birds fly over the sensor as they migrate when they fly they emit a sound called the flight call Which is very very short. So this is not like birdsong, which is long and elaborate. It's very short But it turns out that different species emits different such sounds and so you can go back from the sound that they emit to the species of the bird So Here's an oven bird. Here's a spectral spectral representation of what the flight call of an oven bird looks like and As I note mentioned, it's much much much shorter than birdsong. It's literally Usually between 50 and 200 milliseconds and it's produced primarily during migration So I'm going to play one for you now, but you have to be very careful. Otherwise, you'll just miss it Okay, that's a flight call And so we looked at the data. We compiled a data set of 43 species in the photo. They're 48 Here are their flight calls all concatenated together To the naked eye it just all sounds the same if we slow them down You can actually start to all the bleak here differences in these calls So again, we have a very similar problem We have some phenomenon in the environment that generates an acoustic signature and our goal is to classify that signature into a certain tag Or label. Why not try exactly the same family of techniques now in this general problem of migration monitoring again There are many moving parts many problems to solve So one of them is dealing with things that are not birds at all like frogs and alarms and people the other one is Classifying the sound to a specific species once, you know that it is a bird and the other one is what happens if you have Overlapping calls. So right now we're just focusing on this middle one. So again, I'm not claiming. We're not claiming to solve everything I'll skip All of the whiz bank claims So we started again with which we try the dictionary learning approach first, which you're now by now familiar with And then we compare that to a convolutional network. So we basically did took exactly the techniques We had for urban sound. We said, how do these compare and contrast when we try to apply them to birds And as we did with urban sounds we applied data augmentation. So For those of you who are still kind of wondering what what do we mean by these transformations? If you think about images, let's say you have an image of a stop sign You can transform it in many different ways and all of these images will still be images of a stop sign So as long as you don't transform the image to the point where it no longer looks like a stop sign You can use these transformed images to train your model so Again, we used Muda for this And we applied a whole bunch of transformations and to give you an idea of what these transformations sound like tweet tweet Let's say that that's your common bird because if I do this on the actual bird sounds, it's way too short, right to ever recognize So let's play our common bird once again tweet tweet, okay now, let's play that tweet tweet pitch shifted tweet tweet time stretched tweet tweet compressed And with background noise Okay, those are the transformations that we've been applying throughout this whole talk So if I do this on the actual signal of a bird, it's actually it's very hard to really notice the differences original compressed with noise Pitch shifted and time stretch The only thing we have to be careful when we're working with biological sounds is that we don't transform the signal to the point Where it's no longer representative of the species so we had to work closely with our orthologist friends and play them the different transformations in different ranges of pitch shifting to make sure that we're not Invalidating the signal and getting something like this Which would no longer be and we compared it to an MFCC baseline as before We worked on this data set again online freely available. Feel free to download and play around with it All of the the things I'm saying are online you can find on my website So if you want like a one-stop place for finding data sets codes libraries Justin Salomon.com and The distributed the data set is very unbalanced. There's specifically one Magnolia warbler There are many more of those in the data set So if you just were to predict the majority class your zero R classifier in Weka for those who remember Weka The kind of baseline accuracy is 23% okay, so the question is can we do better than 23%? And what we found was that feature learning kicks ass does much better than MFCC's convolutional network Disappointingly doesn't do better than the shallow model and this is with augmentation if we train the CNN without augmentation It does even worse. So we said okay. That's kind of a bummer. It doesn't really do much better does Do they learn the same model? Do they make the same mistakes right does the CNN and the shallow dictionary model do they make the same mistakes? So what we did was we took the confusion matrix from the CNN and we subtracted from it the Confusion matrix of the dictionary learning approach and then what you get is it what I like to refer to as a Delta confusion matrix So basically if the two models are making exactly the same mistakes The Delta matrix would be zero everywhere if it's not zero somewhere. It means that one model made more mistakes than the other So along the diagonal They and and so the interesting thing is that especially along the diagonal We're seeing that some are blue meaning negative and some are red Which basically means that one model is doing better for some species and the other model is doing better for other species They are not making the same mistakes. So we have two models and each model is slightly better at doing something Can we combine the two models to improve the overall performance of our system? This is commonly referred to as fusion or ensembling and So in our case we're looking at late fusion, which means that the model already generates predictions or likelihoods for each class And then you want to combine the likelihoods We had to do a bit of wizardry to get likelihoods from an SVM So the simplest thing we can do is just take the geometric mean right? Imagine you have five classes the likelihoods for five classes from one model the likelihoods for five classes from the other model Just multiply them right and that's your new set of likelihoods or some them in in in Some them in log space You can also treat each one of these likelihood vectors as a feature vector and try to use those to actually train another model In order to predict the actual class So that's learned fusion and we tried a whole bunch of you know different machine learning models using these feature vectors And so what did we find when we apply fusion? Yes, we do get a statistically significant bump in performance So if you have models that learn different things and are good at different things Trying to combine their predictions will always buy you a few extra points and actually nowadays It's pretty common knowledge that anyone who wins a Kaggle competition is always using an ensemble of a large number of models So any model by itself will never generalize as well as a family of models Which fusion worked the best? Simple fusion in our case. So literally just taking the geometric mean of our likelihoods seem to work well And so in the same way that we had this demo for urban sounds We have this demo for birds the problem is that since none of us can actually recognize the species you're gonna you're gonna have to Just believe me that this is doing the right thing But again the idea here is that basically as different flat call sounds it identifies the species of the bird and And visualizes it And you can see here that these at the top are like the quintessential or clean spectral Representations of these calls and here at the bottom are examples of the types of calls that I used in the demo And so you can really see that the model is generalizing across time stretching the addition of noise echoes Reverberation it's robust to all of these different transformations of the signal so We tried shallow and deep deep did comparably but combining them gave us an extra bump I'll skip this. All right. So kind of the last Seven minutes or so. I want to talk to you about something completely different Which are these different tools that we've been developing because just as we've been doing all of these research We keep hitting roadblocks where something is making our life difficult And so we try and solve it and then we make it open source in the hope that it will also solve somebody else's problem So this is where I'm really going to try and sell stuff Sell stuff to you or at least convince you that these things might be helpful for you in your research So I'm just going to start listing problems How can I guarantee that my evaluation code is correct? And if I'm participating in Mirix, how can I guarantee that my code is the same as the codes that's used in Mirix? Right Mirix is this competition for comparing different music information retrieval algorithms How can I store multiple annotations? For the same audio file together So let's say I have a file and I annotated the beats But I also annotated the chords and I also have the the notes of the melody Right right now all of these traditionally. We just store them in text files in lab files There's no way of keeping them in one place. There's no way of indicating that these all belong to the same file Furthermore because each one of these representations of music has a different way of you know chords is one vocabulary Pitch is a different one beats are in time Every time we have one of these representations. We have to write a new parser in order to load the data I don't know about you but in my experience every time I download a new data set I have to write a new parser in order to start working with this data set Right and that's time-consuming and error-prone How can I if I'm working on sound event detection, right? I can take scenes real recordings from the outdoors But those are very hard to annotate and I have no control over them What if I wanted to specifically test the performance of my model as a function of SNR the signal to noise ratio of a Foreground sound and a background sound. I have no way to control that in the real environment But what if I can synthesize soundscapes where I can control all the audio parameters and further more Automatically get a ground truth and use that in order to benchmark different algorithms How can I crowdsource audio annotations right annotation is so consuming time consuming and costly Can I crowdsource it what tools are there for doing that kind of stuff? And finally For those of you who've worked in melody extraction or just multi-pitch or anything to do with pitch You know that annotating pitch continuous pitch is a horrible horrible task It takes forever right you you try and run a monophonic pitch tracker And then it will make loads of mistakes and then you have to manually correct them And as a result the data sets that we've been using for melody extraction to date are all tiny Compared to you know compare a 1 million YouTube video data set to the data sets used in lyrics of 20 melodies 40 melodies 100 melodies best-case scenario right and so the problem with using such tiny data sets is that more likely than not the results You observe are not significant. They won't generalize once you try to apply these algorithms to massive data sets So we have all of these problems and then propose just to show you the solutions that we've come up for these So when it comes to evaluation This was led by Colin raffle at Columbia and I was just one of a number of people contributing to this We built MIR eval and the idea here is it's just a standardized open-source library for Evaluating different MIR problems such as beat detection melody extraction source separation I think right now there's chord recognition I think there's close to 10 different MIR problems for which MIR eval offers an evaluation framework And it's very easy to use so you can use it computationally inside your Python kind of loops But if you wanted to just use it from the command line, it's literally as easy We have these ready-made Evaluators for you and you just literally provide it you tell it war you want to store the results What's the name of the reference or ground truth file? What's the name of your estimate file and it will run the evaluation and save the results out to disk now? You might say okay, but this is yet another implementation of any of an evaluation code Why should I specifically use this one? Well because we tested the hell out of it and basically everything in MIR eval is 100% unit tested Which means it's correct and we regression test it against the Mirix results to ensure that the results are consistent And when the results are slightly different due to library differences or you know implementation details Those are always very thoroughly documented So all of this is just to say if you're working on an MIR problem and there is an evaluation Evaluator for it in MIR eval save yourself the hassle of writing your own eval code Just use MIR eval if the your problem isn't in MIR eval I very much encourage you to make a pull request and Contribute to MIR eval rather than writing code that will never be used by other people All right, so the next problem I mentioned was this issue of Annotations right that at the moment we almost always store music annotations. It's just text files lab files Sometimes we don't even have column headers right for melody extraction for years The de facto standard was just one column for timestamps one column for f0. That's it Sometimes space separated sometimes tab separated. It was a mess And so we said well What if we came up with a consistent and single format for storing any type of MIR annotation and more and more over you can store multiple annotations together and You can store metadata together and you can explain how you generated the annotation and what tools you used and which file This relates to and how long the file was and everything and so we ended up with this thing called jams Which stands for jason annotated music specification? So the nice thing about it. It's that the structure is in json Which means that you can also use it easily over web services and there are loads of existing frameworks that allow you to use jason and It's basically it has a list of annotations Which give you annotation metadata and then the actual data in the annotation and you can also store file metadata Such as the artist the title the version of the jams library you use the release whether it's part of a certain corpus, etc So in principle jams is just a specification and you can use any software any operating system as long as it adheres This specification, but that's a lot of work. And so what we did was we built Python API That allows you to very easily load in a jams file explore the annotations inside and This API will automatically validate your annotations So it will make sure that the format is correct and it even gives you For free it gives you a valid evaluation and visualization via MIR eval so Very quickly here demos are always a bad idea, but can't help myself So this is just an example of an ipython notebook, can you see or is this way too small is that better? Is this better? Right, so here for example, I created a new jam file. I added some metadata then I created a new annotation I told it that it's a midi note annotation Then I added a few note events. We even have parsers to load in a midi file directly into jams Then I added some metadata about this specific annotation And then if I just go jam and I hit enter in a pipython notebook I get this so I can very easily browse the metadata and the annotations I see there's one annotation inside note midi here I can see in the format of a pandas data frame The start time the duration the value of each note and the confidence levels I can see the metadata about the annotation and If I were to just print this you get the raw json This is what this looks like under the hood, but we also have a display model And then for free you get a visualization that makes sense for the specific annotation type So we have namespaces to tell you whether this is notes or chords or beats or sources And then if you just run that through display For free here you get the time and here you get the midi notes and you get a quick visualization So jams is it's both a library for manipulating annotations. Sorry Fixed this but it's also very good as your prototyping you can just very quickly stare at annotations and see what they look like So yay jams sorry almost done and It's available at github.com slash moral systems The next thing is scaper So I mentioned if you're working in sound event detection the problem is that natural soundscapes are very uncontrolled Which is very hard to kind of do systematic evaluation about these soundscapes And so we said well What if we could just simulate soundscapes and control every aspect of it when the sounds happen how loud they are What sounds they are the SNR with respect to the background and so we built scaper This is kind of the the structure of Skape it's open source. We're actually going to be presenting it at waspa in a month and What you do is you build the soundscape specification and then scaper will return you both the audio the mixed soundscape and An annotation in jams format so you get both of them And the way scaper works is you define your sound bank of foreground sounds and background sounds So you need to have you choose your library And then you define a specification and the nice thing about this the specification is that it's probabilistic so you can say give me a sound with an SNR from this distribution of values and a duration coming from this distribution of values and Everything that you can basically define about the scene you can define probabilistically which means that from a single Specification you can generate infinitely many Different soundscapes furthermore we've integrated sound transformations in there such as pitch shifting and time stretching so even if your sound back is not sound bank is not very big you can say and Randomly apply a pitch shift to the sound from a normal distribution But you know with a mean of zero and a standard deviation of three semitones And so every sound that you plug in there is also now going to be transformed and the result is that's basically from one Specification one library you can build a whole bunch of soundscape files with relevant annotations This is on Github as well And so the nice thing is that once you have something like this you can start doing studies like I took exactly the same data set But what I did is I varied the signal-to-noise ratio of the foreground sounds with respect to the background and reevaluated Exactly the same model and here you can see the SNR going up And this is the precision in blue recall and green and F measure in red So if I only looked at the F measure the F measure would look pretty stable But if I look I actually see that there's a trade-off between precision and recall So it tells us that as the sounds become louder our recall improves So we detect more sounds, but also our precision goes down So we make more mistakes or we have more false positives and this type of analysis such a detailed breakdown of a classification Model would be impossible if you only worked with natural soundscapes over which you have no control So skaper and Then finally Another tool we will list is the audio annotator So the audio annotator is basically just a web-based Interface for labeling sound because we said well labeling sounds we still we're recording We have over seven terabytes of audio data collected from our sensors in New York City There's no way we're going to be able to label all of that ourselves And so we said well can we crowdsource this can we crowdsource this and so we needed an interface And so we built this interface where basically you have your spectrogram representation and you can add label You can add these you know selection areas and then you have a fixed set of labels and Again, I want to very quickly show this to you okay, so I can play my sound I can skip to different places and then I'm like, oh, okay. Well this you know this to me This sounds like a whistle. Let's maybe just hear the If you need to adjust it slightly and then let's And you could hypothetically have overlapping I'm not very good at manipulating this but basically you can have overlapping annotations as well And so we were about to use this in order to crowdsource data, but we said wait a minute How do we know that this interface is the one that makes the most sense? How do we know if we're not biasing the quality of our labels because of our interface? How do we know how reliable the labels are for people as the audio scenes that we give to them become more and more complex? And so what we did was we ran a bunch of experiments Where we used skaper to generate soundscapes, so we had the perfect Annotation so we knew exactly what the labels should be and then we generated soundscapes with different levels of complexity And we gave them to people to annotate then we had 30 people annotate each one of 60 different soundscapes always 10 seconds long and then we varied both the complexity of the soundscapes and Different features of the interface in particular We compared using a spectrogram presentation to a waveform representation Which you'll be familiar with from from free sounds for example is the standard representation that you get of a signal with some added coloring Or no representation, right? Maybe a visual representation will bias the quality of the data that people give you Maybe you want to have no representation no visual representation and so that's our manipulations in the Visual domain in the auditory domain. We defined two characteristics the maximum polyphony Which means the maximum number of overlapping sounds that happen anywhere in the scene? And then we bucketed our soundscapes into three groups low polyphony no overlap Middle medium polyphony maybe two overlapping sounds and then high polyphony Which is three or four overlapping sounds and then the other parameter that we evaluated is the Ginny polyphony, which? It's a little hard to explain very quickly, but generally what it means is Do things overlap for short amounts of time or do things overlap for a long period of time, right? Because that makes the scene more complex as well, right? If there's a lot happening all the time versus a lot happening, but only sporadically and most of the time There's only one sound And so we had a whole bunch of people add use our interface label the sounds and then we evaluated their Annotations against the ground truth references generated by skaper For those of you not familiar with how these things are evaluated It's called segment based evaluation, which means let's say this is 10 seconds You slice it into segments of equal length. Let's say a hundred milliseconds. It can also be one second It basically depends what's temporal resolution or accuracy you care about so we did a hundred milliseconds and then basically You you round your annotations to those slices and then what you do is you basically for each slice you in you separately compare all the labels that are present that were added by a Human versus all the labels that are present that were placed there by the machine that are the ground truth annotation and Then you just compare those and you can calculate your true negatives true positives false negatives false positives and once you have these quantities you sum them over all the segments and then you You use the summed quantities to compute your overall f-measure precision and recall. Okay, so this is and This I think is also a good idea for evaluating note transcription because one of the big problems with Transcription is that let's say if you have one note in the reference and then the person annotated two notes Then you're gonna be heavily penalized for the second note, right? Only one note will match the reference the other one will be considered a mistake and you know There are ways you could there are evaluation frameworks out there that I that you know That's try to what happens if we fuse these or if we segment them But I just like the idea of in every instant in time How many things are present really and how many things were annotated and just compare those and then you get rid of the whole concept of Individual notes and the trouble that that introduces Okay, and what we found was that Using the spectrogram you actually get slightly better quality annotations, which maybe we might have expected but we never had Validation for that and then in terms of complexity The I think this is very interesting what we found is that as the sound becomes more complex as the polyphony increases the number of overlapping sounds The recall goes down But the precision stays more or less stable What does that tell us it tell us that people when there are many overlapping sounds they will still correctly label the sounds They just want to label all of the sounds they will miss some of them And that has direct implications if you want to use those annotations for training machine learning algorithms because it tells you that you must treat the annotations that people give you as Weak or missing right they are correct But they don't contain everything that's in the signal and that's something that you will want to factor in when you train Your model and then the other cool result we found is we ordered every person annotated ten files And so we ordered the performance of people Based on like when they annotated each file and then we average them and you actually see that people improve over time So you know if you want people to annotate stuff give them a few training examples because even going from the first annotation example to the Tenth annotation example people did better and They improved more when they use the spectrogram representation So that tells you that people who are complete non experts if you give them spectrograms and you give them some training They will actually learn how to how to take advantage of that representation, which I think is a cool result And then this is also coming to is mere 2017 in China So but very quickly melody extraction generating annotations is a pain in the ass it takes forever So can we somehow automate this process the answer is yes So we take the melody so that the trick is using multi-track recordings So traditionally what people do is they take just the isolated melody track they run it through a monophonic pitch tracker and then you know and then they Manually fix all the mistakes made by the pitch tracker and that's what takes tons of time So what we said this is a collaboration with Emilia Jordy Juanjo here at the NTG we said well, let's just clean the signal automatically, right? So things that look like really crazy estimates will just remove them and we'll apply some smoothing and Then we'll get something that looks like this the problem is we can no longer use this as Reference because now we've cleaned it. We've smoothed it. It doesn't correspond to the audio signal anymore, right? So in fact after we've done this automatic cleaning and smoothing. We probably have something like this So how can we use this as a something that is a valid? annotation By using it as the basis for Re-synthesizing the track so now we take this cleaned f0 estimate We use it to guide a sinusoidal modeling algorithm that estimates the amplitude and frequency and phase of every harmonic of this f0 track and Then we use that to drive a synthesis algorithm, which basically gives us a new Synthesized melody so this melody is a little different from this melody But it perfectly matches this curve and this curve was generated purely automatically So now your data sets can be as large as the number of multitrack files you have access to That's kind of in a nutshell and Then the other paper that will be hitting is mere 2017 is About a deep salience function for so for those working with pitch estimation You know that often what we do is we start with a time frequency representation Then we derive a salience function to highlight which pitches are most likely to be active and then from that we usually trace Different instruments or the melody and the problem is that the salience function is usually very noisy and contains loads of fake notes and ghost notes And that's why downstream any downstream processing we perform on the salience function still ends up being pretty noisy And so this is a study led by by Rachel She's almost finishing her pee pretty much finished hasn't defended yet her PhD where basically We use a deep convolutional network working on a multi-octave constant cue representation Which means that the network can actually learn filters across octaves so it can learn timbre patterns Which I think is pretty cool and Basically you start with your time frequency representation you go through this deep network and in the end you end with a salience function That's very very clean This is the target these are literally only the pitches that are active in the piano roll sense of the word And this is the matching representation that's learned by the system And then because you've learned such a nice clean representation Downstream processing can be very simple with like simple viterbi tracking You can get very nice results for example for melody extraction and their experiments in the paper that that show that So thank you for bearing with me even though we start a little late And then I went a few minutes overboard. I appreciate your time and I appreciate you coming to to listen If you have any further questions, there's a lot of information on my website I'm also going to be around until Friday, and I'm more than happy to to chat and get your feedback if you think that this is You would disagree with this strongly than common and tell me but yeah, thank you for your time Anyone has one burning question Otherwise we can just chat offline and let everybody get back to their their day-to-day Okay, let's let's take it offline Okay Yeah Which system the sat urban sounds yes, right Well, yes in the sense that smart cities basically means cities Instrumented with sensors in order to make data driven efficient decisions about how to run the system, right? And so you can measure energy usage in buildings air pollution traffic flow and in our case what we measure is noise and sound and so this is a smart cities project because we're using Sensors in order to try and make smarter decisions about how to fight noise in the city It's an internet of things project because we're using sensors that Communicate over Wi-Fi in order to transmit information and small on-board computing things So yeah, I think that's what basically makes it a smart city project It's the fact that rather than trying to guess what's happening in terms of noise we want to actually quantify that by Deploying a large number of sensors across the city right now We only have 50 the pilot will have a hundred But if the pilot proof successful of course the idea would be to expand that to thousands of sensors all over the city Are what people are saying so it's okay. So your question is what happens about conversations? Yeah, that's a great question privacy is a huge issue. It's something that we've been dealing with from day one We don't want to be an extension of the CIA It's definitely not our goal and it's a big concern And we actually had to also to go through a lot of legal hoops because in New York if you're recording something And you're not physically there when the thing is recording that falls under an eavesdropping law And you can actually go to jail for doing that So everywhere where we have a sensor there's a big sign saying we are recording sound in this location Our sensors don't record continuously They only record 10 second snippets which are spaced randomly and we hired a third-party consulting company to stand under the sensors and speak at different levels and then we took the recordings and we had them listen to Them to see if you could actually understand the conversations and the result was since the sensors are not they're mounted on like the First left the story of buildings. So, you know, three four meters up Sometimes five or six the result was that unless you're you're shouting Literally under the sensor the content of the conversations is not intelligible And so that's already a pretty strong safeguard for people's privacy because if you're shouting, you know You're your social security number, you know, other people are gonna hear it as well It's probably not a good idea to do So we know that we can't really capture conversations We're not developing any technology in order to transcribe speech. It's not part of our agenda and On top of all of this, we're only collecting audio right now to build a large dataset in order to train Keep training and improving our machine learning models We're only gonna be recording for roughly a year after that we turn the audio switch off We'll have enough audio data to build machine learning algorithms that can run in real time in the sensor And so they only stream the results of the analysis back to home base and that way we completely Solve the privacy concern because the sensor is never recording audio never transmitting audio the signal goes straight into an analysis Pipeline running on the Raspberry Pi and only let's say the posterior likelihoods of the presence of different sounds are what's transmitted to the To our servers and as a kind of nice side effect that also dramatically reduces the bandwidth Requirements because all your signals are very you know data heavy And if we switch from transmitting raw data to the results of analyzing the data suddenly the you know Everything drops and then hopefully in the future we can even use solar powered sensors transmitting over over cell cellular Channels as opposed to Sensors that are plugged into the wall and use Wi-Fi, which is what we have at the moment All right. Thanks