 So, good morning and a warm welcome to our webinar. It's now one or two minutes past nine o'clock Central European time and it's time for us to kick off this webinar. My name is Lars Dahlberg and I'm acting as the moderator and administrator of this webinar. This webinar has got the title How to Build Keywords Spotting AI Application and Deploy It on an Edge Device. So this is going to be an interesting webinar for you all. So this webinar is arranged by Imagimob and before I introduce today's speakers I just want to inform you that you have the ability to ask questions whenever you want during this webinar. Just check out the go to webinar control panel and the question mark icon and there you can type in your questions whenever you want. The rest of the audience will not see your questions and we will come back and answer the questions in the end of the webinar. So please take the opportunity we really do have the experts here with us today and the webinar will continue for 60 minutes. So with that it's time for me to introduce today's speakers because we really have a lot of things to deliver to you today. So let's see the speakers of today. So with us today as speakers for this webinar we have Johan Malm head of product at Imagimob and Alexander Samilsson, CTO and co-founder of Imagimob. So welcome guys. Yeah there you are. Thank you very much. Yeah thank you very much Lars. Yes it's great to be here. We're broadcasting from Stockholm today. This cold morning when I woke up there was minus 13.7 degrees outside so we're trying to warm up here. Yes I'm Johan Malm. I'm head of products at Imagimob and ever since I started working for Imagimob I've been deeply involved in research development and strategic product decisions. And I have been working with big customers in many different projects and we have delivered advanced solutions to complex problems. One such project was the in-ear headphones that I was involved in from the beginning until the end. Which was the first gesture controlled in earphones in the world. Exactly it was a really cool product actually. And it's a great honor to work at Imagimob with all our talented engineers to put all this knowledge that we have from all these projects into our main product which is Imagimob AI. I can only agree to that Johan. It's an honor to work with such talented people. My name is Alexander Somensal. I am CTO and one of the co-founders at Imagimob. And at Imagimob we have been building HCI applications since year 2015 so we were really early on. We've been part of shaping and growing this very promising market of HCI applications. And through my time at Imagimob I've been part of running directly or indirectly 30 HCI applications from start to proof of concept and in some cases deploying these solutions into actual products in the market. Johan gave a great example of this gesture controlled in earphones. Another application that we've built and released into the market is a fall detection system inside a watch for the elderly which is another great example of what this technology can do. So that much protects the elderly and warns if someone is falling over and has six months of battery life thanks to HCI technology from Imagimob. And already in 2015 at Imagimob we had this belief that AI will move away more and more out of the cloud and into the devices where the actual data is generated because this gives a lot of benefits and application that we will show and build together today is a great example of this. Right Johan. So let me switch over here. So why should one do audio classification on the edge Johan? That's a great question and I actually got this from my daughter yesterday evening. Why do you do this? And one example that I told her because today we will build an audio classifier that can run on the edge which means on a small processor on a small MCU microcontroller and what we will do today is to train it on the words up and down to recognize those two words and for instance IKEA does a lamp bulb with a small processor in it and we could actually run this on this application on that MCU and monitor the light by saying up and down. So that's one use case we could have here and the benefits is of course you wouldn't like to have this bulb connected to the cloud all the time and listen to what we say at home that would be so privacy reasons it's one thing. Reliability I mean in many parts of our world in fact most parts of our world is not connected to the cloud or to the internet so reliability autonomy those things it's very important. You can build an application that always work and it might be connected to a machine that needs to be stopped when something happens and then you can exactly yeah always always work always be up. Yeah and that's also about real time here this machine should be able to listen to sounds and react in real time. Maybe shut off if something dangerous happens or or just react really really quickly. The fourth thing is cost and I think many companies start to realize this now that actually all that data and and transmitting all that data to the cloud is actually actually a significant cost for the company and maybe this wasn't so obvious in the beginning but Yes imagine having a million of these devices out there always streaming all of their data all of their audio data into the cloud that would have huge costs. Exactly because we're talking about sensors now and sensors generate data it's a lot of data again it can be in the in the like audio that's in the kilohertz range so thousands of samples per second but but but also even more megahertz or yeah. So we're gonna show you today how to build this application using our software service in Manimob AI which is an end-to-end service so we will cover everything from data collection to deploying the final application on the device and what the opportunity that we give you through the software is to be able to build these applications that are actually really really hard to build and deploy on the edge. So several a lot of organizations and companies are building audio classification applications and putting them in the cloud and have been doing so for several years but taking this application applications and deploying it on an MCU it's a completely different level. You have been needing a large cross-functional team of data scientists, machine learning engineers, firmware engineers and even compiler experts to make this work so up until now it's only the really big players that are doing this. We're talking Google and Amazon has this in their Alexa and Google audio devices and Facebook and the lights but with the Manimob AI you can do this with smaller teams and even single developers and after this webinar you will have the opportunity to do this yourself with our software. So what exactly is it that we're going to do today? How is it that you want? Right, so we have taken one of these sensitile boxes from ST Microelectronics and we actually have it here. So it comes with this blue cover but we stripped that off so then it looks like this. It has a lot of different sensors on it, accelerometer, gyro, pressure sensor but also microphone and it can run through them. Yeah, it has a battery but we will connect it to the computer today and we will use a data set of sound data, an open data set, Google data set which we will train on and we also use some of our own data that is actually recorded using this device. Yeah and these are the steps, Alex. Yeah, so the steps that we will cover today is first we will show you how to import data, in this case this audio data set into Manimob AI and very importantly how to annotate that and make sure that that it's well annotated which is very important in order to build good AI models. Next we're going to process this data and show you how to do that so that it's easier to visualize both for us but also so that it's easier for the AI models to learn this data. Then we're going to train it, show you how to train it through our training service and then we're going to evaluate the data in a very detailed way or evaluate the models in a very detailed way in order to be able to pick the best performing models and finally still in the Manimob AI we're going to optimize and translate this train model that we select into C code and then we're going to move out of the Manimob AI into an STM cube which is a tool from ST and we're going to show you how to integrate this C code in a very quick manner into the firmware of this device and finally we're going to show you a live demo of this application running in real time detecting our ups and downs here and at the end there will be some time for you to answer to ask us questions and we will give you some answers there and please stay until the end because there's some goodies for you so at the end of the webinar you will get a link where you can sign up get the software for free to try it out and the 10 lucky ones of you will also win this device and so you will all be able to replicate what we're showing you today but some of you will even get the hardware so that you can test it live by yourselves so should we jump into it Siwon? Yeah so what you see here is the Manimob AI yeah you're connected here right good yeah so this is Imagimob AI we call it the studio so Imagimob studio and this is where you do these projects from the beginning until the end and the first thing I will show you is that we have a keyword spotter starter project and so that this is a relatively new feature that we launched in the last release where we have starter projects that means projects that intend to make it easier for the customers to get started and each of these projects has essentially everything you need to get the ready model it has some example data pre-trained models and pre-processors and everything like that and here we have a keyword spotter project so it actually here if I if I press okay here it will download the database and pre-trained orders so we've already done that yeah exactly so here I have started up a new project and the first tab here we will follow the processes we follow from from the top one until the bottom here all these tabs and the first is the data tab so what we will do is to import some data to to to this project and that before that I will run this autolabel script because in this database we have wave files and and it's recorded so that it's one word per file and what we would like to do is to connect a few instances of this word together on on on a timeline so that when we sweep this data with a with a sliding window we will see not only the one word in its full content we will also see slide over it and see fourth of it and and and half of it and and so on and and this script will do exactly that so it connects the words together and it will also remove some some bad recordings and things like that so this is a this is a script that we we have created you see it removes some outliers and and you will get access to this yeah exactly the source code of that exactly um so I did this for the downward here okay so then the end result looks like this and so I can I just double-click that the data file here and I can play through the data and what you see is that I have put or this script puts course labels so this is labels that that around the word down but what I can do now in this tool is I can fine tune these these labels and and then yes the press ctrl s and and I save this this file and this is one of the concepts in imagine of AI that you can put all of the data open up all of your data and put it on a timeline like this and visualize it yeah and look at annotations and play it back and edit everything easily until it looks right exactly um right so when I'm done with with all my labels I oh right let's see um I go back to to this tab where I add the data and here I can add it's it's up down and some background noise so so so exactly now I add the data to the project and this view is really nice I think because you see this status status here so I can immediately see that that the data looks good here you see a little warning sign and that that means uh that this data was unlabeled so so no label in this file and that's okay because this is background noise and I have one label for down one label for up uh but no label for for other sounds so so that's fully correct so this view is really a time saver um if you would have any inconsistencies in your data yeah which is very common in audio data that you find you use a data set and some of the data will be 16 kilohertz mono and some might be in 48 kilohertz stereo um you would not be able to train an AI model on on that because the input needs to be consistent yeah here you immediately get notified of your inconsistencies and errors you would see if you had put the wrong label somewhere because it would have shown up here yeah on the right so and we we've built this because we we have encountered all of this in in our projects and we've made all of the painful mistakes yeah exactly so that you don't have to exactly so this looks fine I see the frequency is 16 kilohertz and yeah so right and one other thing uh here you see that that the data is in either of these three sets so it's in the train validation test set and I can change the the target size here um what's the purpose of the different sets so yeah exactly it's it's um the model is is trained on on the training data and then it's tuned on the validation data and finally it's tested on on unseen data so that's like the pundamans of of machine learning and what I can do is is to put individual data files in in different sets like this so there's a lot of things you can do managing your your data in this step then if we move to the next tab here uh I can again see that I have only like two two classes and one unlabeled implicit class so so this tab gives you an overview of the distribution of all of your data yeah exactly and and I have a weight here that I if we have an unbalanced data set which is usually the case when you collect data for certain events like this uh and you have an unlabeled background class um I can I can put the sensitivity a bit higher on on on the classes that have fewer samples so I usually do something like that and that means that um otherwise the AI model might be trying to optimize getting most of the data correct and missing this important yeah exactly words that we are actually looking for yeah um so um we'll almost have to yeah so we we already went through the data collection um and annotation and how to manage the data and the next step we're going into now is how to process the data and this is a very important step where we put put a lot of effort because when building AI models for edge devices this is very important um you could train an AI model without processing the data you could feed in the raw data yeah and that's actually the error of the deep learning that we actually um let the model um look at all the data and and because what we realized a couple of years ago uh or actually more than that now uh is that the model if you if you build a deep model with many neural network layers it can figure out a lot of things that that we cannot see sometimes uh and and we follow that approach uh but but we also have to think about getting a small model that can run on an edge device so we have to I'm I'm thinking of this like showing the data slightly as so so it it it's easier for the for the model to digest and this can make a massive difference yeah with some clever preprocessing in this case for audio data yeah yeah you can reduce the size of the model and and the inference time how fast it runs on the device by a factor of 10 or 50 yeah so it wouldn't be possible to put the raw the raw audio data and classify it on on this device that we show you guys here today exactly so so but what we're looking for is uh relatively general preprocessors like like in this case we have a an oscillatory data so it goes it's composed of waves so um a good preprocessor in in in those cases is is a Fourier transform um so that's what we have here we we start with the sliding window that collects a fraction of the data a fraction of the data uh and then uh we we filter it uh or smooth it a little bit and perform the the Fourier transform and and then we take the norm of it so up until here we we have um basically a power spectrum of the data uh so so that could be used as as a preprocessor on its own but what we did in this case is to apply a male filter on that data and that's that that's a typical preprocessor for for audio applications and it actually tries to mimic um what how our hearing uh works because this male filter is actually the perceived um when when we try to to hear a difference between tones uh we tend to have a higher resolution on on the lower on the lower spectrum than on the higher one so so this is a little bit a logarithmic filter bank and this is very important because that means that we remove everything that the AI model doesn't have to care about exactly reducing the the data amount uh because actually it kind of focus on on the important things that that that distinguish words from each other and and don't worry we will show soon show your visualizations you will see yeah how this actually looks like that's actually a really nice thing because you see down here uh we can create track from preprocessor so what what this function does is that it it goes through all the data in the database uh and process it through these only steps and and writes it out to to file and i already did that of course um then we can take a look at the how down looks like so what you see here now is the raw data on top and below that this uh spectrogan that yeah visualizes how we as humans are hearing this yeah this is really nice i almost become a um what do you know nerd in in this project this is really nice you see um uh that that the words really resemble each other uh you have this little uh diagonal thing here and then the big shank on the bottom so what we look what we're looking for in this step is is that the words within one class actually look like each other because they they they they should and and this is something um that we've learned and proven over and over that if we as human can see the similarities between the same word here or the same event they and model will be able to figure it out yeah so if you've reached that point and your data is well annotated then you're extremely well set up yeah for a successful uh project and here i can just show how how up looks like and it's it's quite different and that's also a very good thing to look for that your different classes yeah uh are easily i mean to separate from each other yeah exactly right so that's about the preprocessor uh let's go back um so you you see that finally here we we put all these um 40 features into a new window composed of 50 such features um and those will be then fed into the neural network so why you wanted you select uh that size of the window 50 right that's a good question so um what we're looking for is to have a a window in the end that covers your biggest data which in this case is down it's it's roughly half a second long so that's why i've chosen that size here so that's the rule of thumb capture your longest instance of the longest word yeah with the window uh okay so now we've gone through the processing of the data now it's time to uh start generating some models yeah and uh that's this step uh so if i press generate generate model list we will generate so this is what is usually referred to as uh auto ml so so we actually we're not sure in the beginning exactly uh which layers that are good in this case uh what's what kind of model that that is the best we have a the experience will tell you um what direction to look for but but we make it easy for for the users uh to actually generate a list of models and then train all these models so you don't have to be a machine learning expert exactly um you can generate really good models without being that yeah but if you are we will show you soon you're able to go into all the details but we've put a lot of effort into um creating this auto machine learning functionality so that you will get models that will give you good accuracy while being well suited for edge devices so being small and being fast exactly so you can see that that the models here they have different amount of parameters uh and that's those parameters are basically the the weights in the model that that are updated in during the training so they are adapted to the to the data and here is a relatively small one i i could actually get to quite high accuracy on this one and 2000 parameters will be roughly 12 kilobytes of of memory or flash size so that's something you can look for already already in this step here to know if it will fit your yeah target device exactly because that's if it doesn't fit the target device then you don't need to go further in this step so so exactly and here we can go go in to look exactly what layers this model has and you can see the input shape up here it's the output shape of the pre-processor so the pre-processed data will go in in here go through different layers convolution 1d max pooling and then finally some dense layers and and three output neurons and that's the those represent the three classes unlabeled up or down so here the software configured all of this for you yeah and if we now press start new training job this data blob will go off to the our cloud training service and and and all the models and all the data will go exactly so so let's let's log in did i log into the right maybe can we take a look at is it the right cloud or is it the left yeah and i think maybe can you see my jobs yeah i think so okay there it helps to log into the right cloud yeah okay so this is a training job that you ran earlier you want yeah exactly and and here we immediately see some statistics so the accuracy on the different data sets train validation test but also the f1 score that's a little bit more advanced machine learning accuracy measures but but this is quite good actually and i usually pay attention to to these pie charts to see that the number of predicted instances it resembles the actual ones so so you can see that they look quite like here and so we find the right number of ups and downs exactly and then you can see it more closely in this view where we normalize each column like this see so what does this tell you what what are you looking for in this matrix um yeah you can you can look this is like we see here it's a quite unbalanced data set so so if i just look for the ups or downs and i think this this can be quite good to see that among those ups 98 percent were captured correctly and and and with the downs here but the f1 score is is like more overall the data so so 96 is is is quite good here so you can you can sort the the models based on on f1 test score for instance and then you you if you select one of those which is yeah model 12 here that's actually the one we're gonna run on on the device we can download the actual trained model here the h5 file that's the tensorflow trained model and we can download the predictions on the data and we've already done that in advance yes and this and this is a functionality that we've been working on a lot the ability to download the predictions or evaluation of the model that we call it basically running the model on top of all of the data in your project and visualizing the result and this is a very nice view which i really like this is up here is the model prediction running on on the on the test data and you see the yellow line here is is up so it's the probability it's between zero and one here so the probability that the model detects down goes up to to one and then it goes down afterwards uh this is the raw data the wave file so with this well basically you can see and you can even play back uh let's let's show that you can even play back how the model behaves uh in in real time without having to deploy on any devices exactly so this is a extremely powerful way of essentially running a field test in front of your computer yeah before going on to test this live exactly to keep the the development cycle here short and fast so here down here this line is is translated to to a label so we by playing here we can see that this model this is actually your my words your downs which we try out here exactly so we could see it did a good job there right so that's the model evaluations and and let's say that we're happy with this then we would like to um to translate this brain model to c code that we can put on this device because at the moment this is a TensorFlow model it needs to run in a python environment it's suitable for running on a pc or in the cloud yes exactly but not possible to run uh in a embedded system in embedded systems c is still the language or choice c is king c is king um so we open up the h5 model we go to the edge tab um you see this little mcu icon and and here we have a lot of settings yes and i will not talk about all of them right now but um in in this here you can see input file and output file that means we perform a test to make sure that this generated c code and the python uh code uh generates exactly the same output um i will skip that for now because i know that that's the case so what happened now when you pushed build edge button so now we have generated c code um that's actually very quick and we can see some statistics now i took a little bit bigger model uh with 11 000 parameters it has 26.6 kilobytes of ram and and 25 kilobytes of oh actually those two numbers are like 50 percent 50 k ram and and almost 50 k of flash and this is just and and this we will have a quick look at it i think but this c code is completely self-contained and it contains uh the model and the preprocessor and the simple api for for executing the model yeah and and you get well as when you generate c code with our service you completely own the c code so there's no royalties you can deploy this however you like and you can modify it however you like yeah and that's a very good reason for generating c code and not generating some uh intermediate code or some machine language by generating c code you can dig into this if you have to or want to exactly and and change it or optimize it or exactly any way you want so this is what it looks like it's it's um uh it's optimized in such a way that it's actually very well written c code we have put a lot of efforts on doing this really really really good and it's it's also easy to interact with yeah to call this we we have three uh methods that we interact with one is the first one is that we initialize this code uh then we enqueue data so which means that we take um sensor data and we just put it into into into this function and then here we dequeue meaning that we take out the predictions so that's it and and this code is um yeah there's no external dependencies so you only need the c file and the header file yeah that's it so and that's what we we show now yeah so now we went through all of the steps inside the medium of AI yeah we've gone from data collection um to c code and now we're gonna show you how to integrate this into the firmware of the device exactly and so we'll move over from immatimo studio immatimo AI over to stm cube from st yes and that's here um that's where so um so what have you done here to to set this up what what I've done is to download a project that fits like a firmware that that is suited for this st device uh from st microelectronics webpage so one of the st stalker projects in my yeah exactly and and then so that's our scene yeah so we just need to add two files to this project one is the header file model dot h I just added that one here and then in the source um directory I add model dot c and in the main dot c so the main program I go to the to the function that that uh where the the microphone data is is actually collected and you see here it's this function is called when one millisecond of microphone data is is available okay so when that's available I enqueue that data into our model our model yeah and then down here in this while loop I dequeue the the model predictions and here I just output the the model prediction when it has a probability above 0.9 yeah so so from the model at any moment in time you will get the confidence values or probabilities according to the model that it's either up down or back yeah exactly so so so this um the output here is is a number three numbers between zero and and and and one and the sum of them are always one yeah exactly the sum exactly the probability that that it here's something it's one it's one if the microphone is on yeah um and here I just um turn on either the blue or the blue blue led for for up or the green one for for down and I also print out on the terminal here so yeah that's that's about that and when I'm ready here I just compile this code which will generate a dot bin file and then I use the st program stq programmer to to flash it we don't have time to to to flash it right now I I already did that but that's not that's a very it only takes a couple of minutes to do that so what we will do now is we've already preflashed this board with this model and we are now plugging in the board just to give it power but everything is running inside of the board um and you one will start up the program in the board and and activate the microphone and then we will test it live and see how well it detects our keywords so what you see here is of course the board it has a blue and a green led on it and you have a terminal here which where you see zeroes and zeroes is the background noise and two is up up up so what you see here is the blue led turns on when you want up down down up down down up up up up up up up up up up up up up up up up up up up up up up up up down so what you see here is that it's very real time it reacts instantly and you could power this off of a battery right exactly it's a very compact application on a tiny device very low power very cheap to produce a manufacturer and yeah well we actually have a device tester tool that that we haven't show you today but but with the model using 2000 parameters I could see a couple of days ago that that it took 15 milliseconds for one model prediction so it's it's definitely real time it's quick and well it's quite complex data it's 16 000 audio readings per seconds being processed on that little device through all these pre-processor layers new network layers okay so that's that's it we've gone from the data to the deployed model and we've shown it to you guys live so we're ready for for the q&a let's bring that up yes so maybe just some things before the q&a alex so they know how to proceed absolutely so now thank you for staying with us all the way to the end now it's time for the goodies visit www.imanimon.com and click on the free trial button and in there yeah you fill in your email and your name and in the comment or in the description box you you type audio webinar to have a chance to also win this device this sd device 10 of you will will do that so hurry up and register in time take screenshots yeah yeah and just here and out where we took the original the original data sets which we cleaned up and added some data to you if you're interested you can find it here I think we will send some of this information out to you yes you will get an email tomorrow with more information to follow up okay so now we have five minutes left for q&a I just want to inform everybody that we will not have time to answer all your questions but we have the opportunity to get back to you with answers afterwards because the system save all your questions so we can answer them afterwards so please continue to ask questions even though we don't have time to answer all of them by using the functionality for that in the go-to webinar so let's start with some of the questions and we work on as many as we can so the first one all right let me see now a lot of questions here so what is the name of the sd evb I guess it means the device maybe we've been mentioning that but repeat that yeah you wonder we have it in the top of your mind it's a sensitile dot box and it has a it has a longer name we'll make sure you get it afterwards mkv1 etc but yeah we will share you the full part number yeah make sure we we can supply everyone with the actual part number of the device so is the key sporter engine developed by imagimob yes so just to be very clear here now we show the keyword spotter but we could have detected any kind of sounds and create any kind of sound detection application we could have that would have been an interesting webinar we could have been breaking some windows and recorded that then yeah would have been able to detect when a window breaks or it could be someone it could be a baby crying that we would detect so it's completely our ending you it's very flexible depending on what data you feed into and how you annotate it you can create your own kind of application so what you should take away from this webinar is is have a good think of how you can use this yourself on your own or in your organization what kind of applications could you build if you were able to understand audio in real time in in a small device like this yeah and and also is there any data that you can use and how would you go about to to actually collect the data because that that's and if you have any questions on that we're here to help with with tips and guidelines yeah how to think yeah yes so let's continue another question in the data tab what was sorry in the data tab what does the dimensionality mean dimensionality very good question that's how many well it's the dimensionality of the data you can see that's the number of signals so in this case it was one because we have mono data so one audio stream if we would have stereo data it would say two in the dimensionality yeah or if we would have some kind of a radar sensor let's say the dimensionality much might be much higher 256 yeah a much higher value because we're looking at maybe many data points at the same time yeah if you if you for instance if you if you think about CSV files common separate files you might have the columns in that file that's the number of columns basically yeah all right so next question so data processes data processing is running in mini batches rather than pure incremental not sure if I understand that question came relatively early so data processing is running in mini badges rather than pure incremental it might be that the person is referring to we're using a wind sliding window here so we are processing the data over 0.5 seconds and then the model is classifying whether it finds a word into it within that section and then it moves slightly into the future and collect some more data and it classifies it again all right so that's one way of classifying audio another way would be to have a model with this state built into the model instead of the state being built into the pre-processor and that's also something that we're looking into and that we will support in future versions yeah okay I got the question right time flies when you're having fun and now it's unfortunately 10 o'clock center european time and we have to close down this webinar for today but we will get back to to you with answers on all your questions and we still have a lot of questions to go so we'll get back and you will get an email tomorrow and with more information and the ability to contact us of course so with that I would like to say thanks to everyone that took your time today to spend an hour with us about this interesting subject and thanks to the speakers and we're closed down the webinar for now thanks a lot for listening thank you very much thank you large for hosting and looking forward to to hearing from you guys yeah so take care and be careful out there we'll close down the webinar now