 Thank you. Thank you. Welcome again to this session. Thank you for making space in your lunch schedule for the locals It should be fairly early, but for the others. Maybe I'll be starting to feel hungry So my name is Gabriele Bunker again I work at MathWorks the makers of the MATLA programming language. I've been there for about almost 12 years I have a worldwide role as a product manager for our DSP and audio processing products And this session is going to be about that it's going to be about the application of deep learning techniques to signal processing and audio processing problems or audio speech and acoustics and I mean if you if you how many of you have an experience of signal processing of some sort in the past or something to do with audio Kind of a rough show of hands Okay, no many but if you have it and if you're in my position over the past years You see you must have seen a bit kind of a big trends where? Traditional techniques have been replaced by a lot of new techniques based on machine learning and deep learning Starting from the most popular example like the voice assistants To something that we try to see we tend to see less such as things like mood classification systems in in cold centers This bigger verifications bigger density verifications in your phone banking Systems and for example for public services radio Microphones to complement the actions of vision and radio system in in self-driving cars and and a lot more you know if you look at simple things like Software for for for computer audio, you know the typical things that I use to produce content the content on TV cinema Audio you find that even that community that used to be pre anchored on traditional signal processing method Not only has been transitioning to using new methods based on deep learning But also make a big deal of it in their technical marketing content You know they make statements where they call out deep learning techniques to make themselves visible and to differentiate a bit because You know that that that's also a market. It's very competitive So these are some examples audio now mix over on the top left They have software that they're easily allows you to extract voice from mixed premixed content They're based in France a cruise on Agapraising Greece and another example They build audio restoration software to take away noise or reverberation again from from this content And this is I so dope a much bigger company not still not so keen you know Verlach companies, but they have a several Several several products that do tough things like take away the the friction noise of microphones like these From recording these are all things that were not possible a long time ago Even if you look five ten years ago, or if they were possible the quality was very low And this new techniques have completely revolutionized the way these things work these days So there's a there's a fundamental question like like us to answer during the session and it's something like this What does it take to develop an effective real world machine learning system for audio and speech application? And when I say real world, I mean something that works for real something that you can sell okay, and So I want to do it by By using the example that's part closer to to our day life because you know Even even the ones that are more traditional in the way they use electronics have started to use voice to communicate Maybe it's your phone maybe two other devices, right? And so and I want to be specific I want to use the example of trigger word detection, which is a pretty important component in those systems And let's let's agree on what that is. I just recorded myself here So trigger word detections are those algorithms that allow Phones you know every voice-enabled device to continuously listen for anything that happens around all the time 24-7 and To wake up the system when they to hit a keyword or a key phrase Okay, there are kind of I like to think about those as the embedded gateway to your cloud-based voice assistance because they Have to take place on the device Without that your phone cannot listen continuously stream audio to the cloud That also means that everybody who wants to develop a voice interface needs to build their own Well, you can leverage Alexa if you want if you build your own device But this you need to implement so either you buy it and they tend to be very expensive or you develop it in the house So what what what best to investigate how something works and how difficult they try to make it So what you're seeing right now is it's a prototype of a trigger word detector Okay, and it's built entirely my love Trying to wake up when it hears the word yes I'd like my system to wake up whenever someone says yes Obviously not when they speak any other random word. That's not yes So this is now working as a prototype It's it's an audio plug-in running inside a kind of a very very standard audio recording and Editing software as a plug-in And so what I'd like to do during the session is to really show you how you put together something like that from the grounds up Okay, and and so if you ask that question about this application What could be the answer if you asked me that question? Well, rest assured that plus if you don't ask me but you ask the majority of people that are involved with deep learning You'll get this as an answer. There's a deep new owner. That's the first thing you might get even on luck here And get this kind of response back which perhaps might make sense to some of you to to to you know show hands Who is able to take some useful information from this second type of answer? Well, that's good. If you have some experiences make some sense. Otherwise just some blurb. So this is this is fair It's definitely not wrong But what I'd like you to take away in the next half hour So is that there's a lot more to it? Okay, if you asked me I would probably I mentioned at least this a lot of data a good dose of signal processing expertise and tools build around speech And all the application and you might wonder why so right if it's the neural network what you need to do What do you know those things? Well, let's find out and at the end of the talk Let's reevaluate what's more important in the balance and what counts and where you can really make the difference in these application so This is really Important to me because what where lies the important is it really has an influence into Understanding where the investment go where the roles go should you hire a signal processing person or a computer scientist and also You know, there's a lot of people complain about if things are changing my work perhaps is losing relevance So let's find out if you this is taken from an independent research that that I came across a few months ago And it's describing the amount of investment, you know in terms of time and hours That goes into developing the actual models So that the networks like finding out about possible architecture training and so forth and working on the data to train that network And if you look at research somebody doing a PhD, this is the picture very little effort on data Everybody uses the same data and all goes in exploring the architecture of the system But if you go to the industry and I'm not just referring to audio and speech here But in general people developing product Sold as powered by AI with deep learning inside you will find out that a lot of investment there are Driven on data and that's the requirements for shipping something that works, okay So what does that mean that I'll represent that it using another picture as that that we that we tend to use So if you think about the development of a deep learning system as a workflow There's definitely a stage in that workflow where you need to design the network and train it Okay, but there's a lot happening earlier and after so the first thing is to think after what do I do with that? I need to implement it somewhere for the kind of problem that we are having in hand here I need to make it run on a phone for example, but then it's it's it's how do I train it? And there's a lot going on inside. Are you so here to Schematize there's a first stage where the data that needs that it's used to train the network needs to be processed in some Way to make it digestible Understandable by the network and there's a lot to know about a collecting the data and notating it augmenting it and so forth So I'll try to cover I you know here and there Diverse stages of this process So let's see. This is this is Jordan which I'll cover that so I'll start You know by taking the elephant of the room out of the equation Just discuss what you need to do about designing the network choosing what network to to use and and how to go about it And then I'll I'll talk a bit about the data that you need to start with and in particular annotation or labeling I'll talk about synthesizing or augmenting data and about extraction feature for different data. Okay, and then we'll wrap up Right, so let's start from the basics How do you? What's the idea about training a network go? Most of you might know but let's recap it A network is all kind of a deep learning system or a new on that Who is a system that just like any type of machine learning learns from data and let's produce an An answer from data, so it's fed with some input some expected outputs and after you know Adjusting its behavior based on the accuracy or not yours its response It's optimized in order to then produce an Optimized model a train model that can then once new data is available Predict the right kind of output from that data that it hasn't seen before the first phase We call it training the second we might call it prediction of inference, you know, they're pretty much equivalent terms So in when you When you think about networks One type of ways to think about them these days especially it's about thinking about layers So this I'll use very few snaps of code during the presentation It will be my lab code in this particular case. You can find the network that I use in that system that you saw it Don't don't lose time reading about the details But take away that it's a vector of layers, right? And if you know what layers you you need if you know how to how to call them in my lab you can go ahead and and look it up and form a vector, okay, and Why why layers well kind of that's a way of representing these connections of neurons They originally they were biologically inspired and a convenient way of Building networks was I was about thinking about stacks of these neurons and connect them together And by doing that you come up with a number of parameters that you need to optimized Using training to have the right answer answer given an input, okay? These days layers might appear in literature in different ways because the way they work the math behind them evolved it evolved you know the way you optimize them evolved and No, you know this is this various ways of representing same things like this slides represents a typical ways of representing convolutional neural networks For example, and this other it's a kind of it's a common way of representing LSTM is a long-short-term memory layers, which are the ones that were used in the example that are That we use these are sometimes, you know, this belongs to a larger class of recursive neural networks, okay? So let's let's let's get back to our example. How do I choose what kind of layers to use? Well, I think chances are if you're developing a product that's that's a that's a problem That's been discussed and examined in literature. This is just a snap of example taken from from Google I'm not endorsing any of these in particular But it's just to give you give the idea if you if you were after an example Especially even in this area and computer vision not not even to mention the literature abounds with with with example of networks That could work well in that case So the general consensus is just grab the first that you got on your eyes provided You're able to implement it and then you've got margins for improving it in all terms, okay? so what you can see here in this case is speech applications and a lot of mentions to LSTM layers between three and four layers the kind of things that we're using right so the kind of thing that we had there Obviously, you know using language right away when you're new to the problem Maybe a bit nasty. You don't know what those functions are. You don't know how to find them So and other approaches in general There are apps or their drug and drops that may be useful as an entry level where you just have a look at catalog of Layers available in that framework you drug and drop them into a canvas connect them and have your model, okay? Similarly when you design the network, it's important to see once you've designed it How many parameters are there that you'll need to optimize when you do training? How complex is your model the more complex the model the more data you will need and you see that in this case You know, maybe people when you deep learning tend to see those couple of years of LSTM and think oh come on This is not deep back in the 80s when I first started in your network We had networks with two internal layers one in this case Look at the the thousands of weights that you're you're bound to have to to optimize in this case so Now so you had your network Optimizing these days means definitely not writing in the equations for grading the sense and finding yourself But it's about using a framework whatever it is In this whatever framework will give you a way to specify in the network or to specify some parameters You want to use to adjust the way the network trains or optimizes the value of the weights And then once you have the network design and the training options You just go ahead and call a comment like this one train network that takes the architecture the specifications inputs and expected outputs and goes ahead probably using some some Hardware like a GPU and optimizes the network and and and you'll see something like that This is a common type of representation of the level optimization of the system that represents the accuracy not over time But over over training cycle Okay, and what you're definitely happy when those lines goes up into the right because it it it means that you Coming close to a hundred percent accuracy that means that the network predicts the right responses for every input And at the end you've got your train network. Okay, like say look at this example This is telling us that by using a single GPU for training It took five minutes to train the system and we kind of got an accuracy of 90 percent It means nine times over ten the network gave the right the right answer for this problem And this is right the plot that you would see if you train that that network that I'm presenting here Right, so validation accuracy. Let me use this as an important point to segue Two lines here a blue and a dash dot the line Okay, these are telling us two different bits of information the blue Tell us the accuracy on the data that we're actually using to optimize The weights of the network. Okay, that is data that is going actually into the computations There's not a bit of data that's not going on the computation But it's it's always kept handy as a means of checking It's data that's not used to optimize the weights But it's used to check how the weights are doing that's data that the network has never seen before It's not useful training and so that tells us Usually how the network behaves for a real problem most likely that validation data will be data That's very close to the data that the final system will be seeing okay data That's realistic that takes into account the defects of any microphone that we'll be using the acoustic responses of the room That's the data we've done that's telling us something how are how is our network behaving for a final problem that we're giving it Okay, what's the training set is something else? So let's use this distinction. Let's keep it in mind because they will accompany us through quite a few parts of this talk And let's move on and use this as a segue to to think about how you you go about we go about annotating data Because we we talk about data and labels, but you know data Aren't born with labels either you take data with labels from somebody else Or you'll need to record data and put a label onto it And so labeling data these days has a different meaning to me for training and and validation data And so I want to just give you some some bits of information is due to like you think about it Back in the days I mean if you're in between your 30 and 40s now, and you were in school in your 20s If you study any deep learning in school, then you will be told the following you've got this much data Okay, that's all you got and you to train a system with that a machine learning system So making sure that you chop your data in three parts The largest part you use it for training We said to actually optimize the weights of the network and then you got other two parts The second part you'll use for validation the one that we've seen the black line. Okay, and then there will be another part used to test Why another part if the validation was already telling us that the network if not what how the network was behaving well, because if the network is not behaving well We're likely to change something in the network like some options in the training for example or some other parameters So although that data doesn't come directly into the optimization We're using it to inform us into choices So we are likely to adjust the parameters so the network believes well on the data So once we're happy at the stage where we need to Verification the network was well on data that it really hasn't had any impact in in in how it's behaving And that's the role of the test data right and at the time so these two used during training this afterwards at the time You know 60% 20% 20% 70 15 15. This was the kind of ballpark Okay, but they you know, this is largely a recipe of kind of a textbook recipe And that was but work well when you know the size of the data set where we think kilobytes and megabytes The traditional size but now with modern systems so many weights and so much data needed to train them the picture is changing Data is exploding in amount But if you shrink it down to the same proportion you find out that the training part of the data will be over 90% will be 98% typically something like that or close 95 and you just a tiny bit of the data for validation and training Hold on. I said tiny in proportion This will be probably a lot more than you used to have in your old machine learning problem But in comparison with with with a whole budget of data. This will be tiny. Okay, so What does that mean is that the kind of things that you can do on your data actively in terms of investment? Very a lot. It's very difficult to intervene on the training data because that's huge who you are Where are the resources to go and work and correct the training data? But you have a lot more margin to work and correct the validation data. Okay, so let's start from it Let's give you some examples in terms of what you can do So what's a good type of of? Validation data with some labels for our problem. Remember our problem is to wake up whenever there's word. Yes So this is an example Let me play it first. It's a recording first You said yes, then no, then you said yes, then no, then yes, then no driving me crazy All right, I like the signal because there is there is a voice there's a speech there's recording But there's a lot more there's some washing machine noise in the background. There's a little reverberation Let's assume that this is a realistic Recorded signal it's something good. It's representative of the actual working environment of my future product so I like that the other thing that I like is that kind of The kind of mask let's call it that highlights only those the errors of the recording where I have the word Yes, and I feed that the red as a label to the network because I need the network to wake up whenever there's a transition from From zero to one. Okay, and and and let's listen just to those portions to make sure that this is well labeled Yes, yes, yes Okay, so no extra no less it's kind of sounds sounds about right how do we achieve this? Okay, so this to me is a good validation data sample Because it's it's got a realistic recording and it's also actually labeled so realistic recording It's okay. You are developing the system. You might have a prototype, you know what microphones you're using You can use it and go where you need to record the data raw data is easy to record. What about labeling? Okay, so various approaches How to label well in general is use an intelligent system trained to carry out a similar task with proven accuracy, right? You are developing a system doesn't mean that it doesn't exist a system that does it better. Maybe it's a cloud system Or maybe it's a human, you know, that's the first choice That's still how many people label data the people who sell you the data most most times have a warehouses of work You know just remote workers that label the data. How do they you work? Well, they use apps that are made to label data This is just an example of an app that's recently been Shipped in my lab. It's called audio labeler Testing a prototype of a wake-up word detector. So you're gonna wake up when it hears the word. Yes Okay, encounter. Yes, stop Go back zoom in and making sure that you label it as something that you're looking for it Can be yes can be truly can be up but you select only the error where this yes You validate it you click down into the label area type whatever you want that corresponds to that Adjust the the section and then you move on right? So that's the first way to do that. It's kind of It's it tends to be honors It's important to have it because sometimes you need to go in and correct the labels But it's it's it's a bit, you know You can't bet on really delivering a full label data set with with a with a manual approach So then the second option is to use an automated system like an intelligent system So a pre-trained machine learning model. What does that look like? So for example, you know for it for speech This this various services available out there like Google speech used in this case from that same app But they're similar from Amazon IBM You name it and and so the ability to just call out to a system like that That's not exactly labeling data according to your needs in this case It's it's labeling whatever all the words, but it's giving you the time intervals for all the words So you can label all your collections your files just by sending them to the way and getting them back You can see that every word is is highlighted in isolation and then because you have some programming experience It doesn't things don't need to be exactly like you need them You can post-process it by writing some code in this case those labels in terms of the value and the time intervals can be exported into my app So you can just you have them here represents as a table So that there you go, you can select only the rows where there's yes Having only the intervals where there's yes and and wherever you encounter. Yes you know start from zeros where I'm gonna yes put a one and Pretty easily using a combination of intelligent services and a language that you can control you can get to a situation like this where Yes, yes, yes, yes. Oh, yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes Yes Yes, I know I didn't completely plan that this will be so funny once once once played back in public But it's the idea is that we understood, you know playing it back only selected error It's a just a good way to check that this is well-validated. Okay, so this is the general the general rules Okay, this is this is a good way to to go about labeling Validation data what to say about training. Well, you know for training unless normally you you tend to accumulate Data sets multiple data sets that are already labeled So the idea is usually still to use a programmatic approach to use the information that you have But revert it into the forum that you that you need I want you know We had just limited bandwidth in terms of time I won't go too deep but I use training just as again as a segue Because the problem with training data is not only that the labels need to be accurate But training data inherently won't be specific to your problem even in the way it's being recorded Okay, so There's a new problem about training Which is that the data is different from the one that your system will digest in the end. Okay, it will be data typically Let me go directly to the slide record it at first. You said yes and no then you said yes at a desk, right? With a little mic a headset like you it's it's who knows where with what? Microphones at what point in time by what person in what mood you don't know you have this data You need to make some use of it and let's assume that the labels you're in a good place That's labeled data and you're able to convert them into the format that you need like in this case a mask over time How do you make sure that you can make your training data a little bit more similar to your validation? So when you train your network, you just you know Randomly expected that magically your network behaves well on your validation data, but you can help it Well, the only way to do it is to Take your data and try to bring it closer to the validation data make it more real Okay, this is an example of more realistic Okay, the kind of thing we had before so reverberation and noise background take all the variations that you That you want. How do you do that? Well, some people call this synthesis some others call it Augmentation a log augmentation because synthesis disposes you don't have anything and then you come up with your invent stuff Augmentation is more suitable to audio speech and acoustics because you normally you want to start with something You don't just synthesize data at random, but then you have a way to augment it So let's take a look at the possible things you can do You've got One if you've got this two Three okay, this recorded signal, you know a very muffled very fast reach but also very dry in reverberation and a lot of Electrical noise so one of the things you can do well at for example one Two odd reverberation. How do you add reverberation? Well, you can measure the behavior of acoustical systems and you can combine them together. You can have many One you can record noise. That's relevant to your situations of use And if you want to avoid applying these effects on background noise that's specific to that recording So for example reverberating the electrical noise will be a pretty nasty thing to do because it wouldn't happen in reality Well, you can use one clean up algorithms to either, you know that you read yourself available in the MATLAB tool boxes Or any other system or you can use external API MATLAB these days for example you can consume easily audio plugins and You know one of those guys at the beginning that did Audio restoration tools, I was able to easily use their plugins to clean up the audio in preparation for processing like reverberation That's the other things that I've done specifically for speech often. So for example, yes This is the original yes, and this is you can do time stretching that means the same pitch taking more or less time To play it's a bit tricky to implement, but it's a common standard effect You know, it's it's made to account the fact when people speak faster or slower Okay, starting from a single sample similar thing pitch shifting. Yes Okay, accounting where people are just more relaxed or or or more depressed or happier or in a hurry or the opposite Okay, things like time stretching sound doesn't doesn't change But it allows you to make the network more responsible to different positions based on how it scans the signal It's clear that There's some domain expertise required, you know things like pitch shifting and you can do it in different ways You know, there's a brutal way of doing pitch shifting and the non-bruel way, you know, this is the original yes again Yes, this is a pretty good way. Yes. This is a bad way. So the last two Yes, yes, okay same pitch strictly the same pitch But in some way in the first way to the right is the same person talking with lower pitch With the same shape of the vocal tract in the other case It's the equivalent of a of a giant replica of the same version with a big vocal tract So that the that is trivial. That's not trivial Okay, so you need to be an inch to do things the right way or you're just messing about and train your network on rubbish stuff okay, so We covered that let's take a look at the last topic which is creating it but for the networks There's a there's some misconceptions around deep learning, which is you have a network and you just need to feed it raw data Because it will be so intelligent and so self-sufficient They will make sense of your data regardless of how many samples per second it is how complex and how abstract, right? That's a misconception that what you call feeding raw data to a network has a name and it's called end-to-end learning and It wouldn't be right to say that it's nothing to do with deep learning It's got something to do with it, but it's not deep learning historically deep learning in some with some types of signal has been able to produce End-to-end learning it doesn't happen so often with with audience speech the only example that I've got in Mind that works well doing end-to-end learning is Google's wave net no more. That's not what people do Okay, everybody regardless of the type of layers that the network is made of will do some kind of first pre-processing of the data So that means that you have various near a large data sets composed of data samples and labels For convolutional networks typically because they are there they were born in first place to work with images You transfer you transform the the the audio or the signal into some form of spectrums That means frequency over time that will be slices with frequency transform This is also called short time free transform and you'll feel that to the network That's a possible way of of pre-processing the data The other way is that there are Specific features there are some that are very useful speech or other domains That basically takes a measurement of estimation of the data that we know makes sense Typically, there's an invariance problem behind extracting features in that you want to have the same features When you apply some modification on the data that don't change its it's some substance So there's a lot of literature literature on those so that's that's the way to go And and again deep learning doesn't mean end to end learning so I For the few who have some some single-person experience this will make some sense or go quick I'll just give you an example of typical things that are used and in this particular example They represent different stages of the say of one end goal, but you can feed any of these stages So for example Starting from the the initial signals over a time the first thing you do is to you buffered or window it You slice it and then you do the same thing on all the slices and you stack it together the first thing you can do It's a spectrogram or or a simple a short time for a transform. Those are FFTs Start one next to the other Another kind of more refined step that you can take is to wait that FFT the FFTs is is an artificial construction It doesn't reflect how the human ear sounds and how voice have evolved to be well perceived by the ear So there's this perceptually accurate versions of spectrogram that that take various names For example a male spectrogram these are perceptually scales trying for his representation and again from this another very useful thing that's computer this is this physical MFCC male frequency capsule coefficients These I just meant to put this light so there were some some some key words that might resonate with with some of you right so Now remember this slide at the beginning. We're talking about the two phases training and and and infer it's not prediction How do things look like when you put them all together? How do you do training? So you will have a collection of recordings with the right labels Okay, and we've just seen that we use typically some kind of feature extraction to transform or Reduce the amount of information in some way that has the effect Let's let's talk bluntly to Allowing you to use you know less heavy networks. So you don't need So they're trained in a reasonably In a reasonable amount of time or resources right even Google will Advise against using that wave net for any kind of practical problem because the training times are so completely out of scale To compare to the usage time it takes a day to train it for one minute It's ridiculous. So not even mention embedded system, right? so Then then you use the features along with the labels to train the network Okay You'll feed the features and and the features over time to the network and then you that's how the network will be will be Optimized because we saw that almost always we also need to Pay augmentation that means that the feature will be expected on the augmented data, right? So what where you end up there may be larger or smaller than the original data depending on on what you do But this is the general workflow okay for training So there's a lot going on things look a lot easier for inference, right? Once you you don't need to augment the network is already trained I forgot to date the cars those are meant to be field weights adopted And the idea is that you just take the input signal you extract the features out of it You feed it to the network and you expect to find back at the end the right type of inference, okay? When you put it into code this looks quite simple. So there's a this is a sample it's like code So there's obviously more code here But you can see the bit of code to extract MFTCs and repartition them in a way that digestible One line of code to have a network that's pre-trained and let it have predict on output signal And then in this case some some you know some logic to insert The trigger sound stuff like that is I was saying that this is a lot more There's this you know, there's parameters in terms of how you accumulate the data what latency etc But you know, that's not going to the details there And so so by doing that I just I think I covered all these aspects There's the buttons contact, which is how do we how do we end up building that kind of application that demo that you saw the beginning and that was That was done in this way. So well We did cover all of this pretty much at least not all not all but but you get good sampling with this Documentation labeling transformation physical extraction etc. But what about the last slice there? How do you take a bit of code that you used as an investigation process? The research and development and and change it into a product or at least a prototype that can be tested in real world And so this is just a kind of a few slides just to take with a curiosity for for the demo you so If you start it with that piece of code Imagine that that and all that sits around saved to to a file to a mutlify in this case This is one of the many ways that MATLAB can transform a piece of MATLAB code into COC plus plus There are other avenues as well in this particular way It already allows you to take that code and transform it into a plug-in So what was running underneath that plug-in worse was compiled C++ including all the network So something like that. This was this was the the source C++ in a juice project in case this you're familiar with that and that That basically was what was running in there So so it was running pretty pretty well in real time not too many issues and these networks can also be scaled very easily Instead of having 150 Units per layer get at 50 and it worked just about fine. Okay, so What's the the answer to that question? It's summarized form. What do I need to develop such a system? Obviously a simple and proven deep network recipe is One part of the answer which was the equivalent of the deep of the deep network. That's fine But I think that the bulk of the answer to me was you need a lot of awareness and expertise In terms of how things work, which means that all the competencies is the skills the tools the expertise That signal processing engineers in this case have accumulated over years are still valuable And they're still valuable even to train deep networks And this is what happens in in in companies that do these things for real not just for research But everything that you find in your in your in your phone You don't even know who's developing it Because I'm lucky your full manufacturer don't want you to know but what they will be doing is this kind of things Okay, and the bottom line to me is deep learning system can only be as good as the data you should train them There's no way to get away from it Right and that should be the end for me