 So, the topic that I have for this session is minimizing CPU utilization for deep networks. So, this is just not for CPUs, it is for basically it is for how can we minimize the resources that we have while we are training a network and for us to understand why exactly I pick up this topic or I picked up this topic is because I think first we have to look we have to go back and see where I come from. So, I come from a middle class family in Uttar Pradesh right. So, middle class Uttar Pradesh that is as middle class as it gets right. So, how many of you here are from Uttar Pradesh? None right? You have been to Uttar Pradesh. Okay fine good luck, I am so happy to see you here. So, okay so middle class Uttar Pradesh just you know memorize these two words because these are going to come too often in this session and so yeah. So, I come from middle class background in Uttar Pradesh and whenever we know right whenever we want something or we want to buy something we go to our parents and we all know what happens then. I am pretty sure that you must have an idea and I went to my dad. I asked him that you know I want a GPU because I wanted to train machine learning models and my dad is pretty cool right. So, I mean all of you guys I mean who feel that the dads are cool right I mean you know. So, but my dad is pretty cool and but then you can guess his answer. His answer was no. And so then out of my own necessity I wanted to work on interesting models. I wanted to work on complicated machine learning models but I did not have a GPU. So, what else did I have? I thought that you know what maybe let me just experiment and see if I could maybe work on these machine learning models which are very resource heavy. Can I train them on my CPU or maybe utilize as less resources as possible. So, this was how I came up with this topic. Before that I just wanted to thank ODAC for having me. I think this is the best content that happens in India. I was here last year as well as it is a great experience I got to learn a lot and I am very thankful that I am again you know giving a session this year as well. So, a bit about myself I am SD at Symantec working in cybersecurity domain and I am currently working in the big data analytics team. I have a great team they support my work and let me and help me even in presenting my work at different conferences. And then I am also Intel software innovator which means that I have access to Intel technologies and access to the premium dev cloud dev cloud and other things. And then I am also a machine learning expert from Google developers. So, basically all these three things mean that I can brag about these things. So, sorry this was not supposed to be this was the last year. So, the agenda is this but looking at the agenda I think this slides suits better because I am not you ensure what these are. But let us see let us just go ahead and try to figure out one day I mean one agenda at a time. But before I start this video I kind of tend to show at every conference that I go to because I think this has motivated me it has kind of gave me the essence of why I wanted to get into machine learning. So, I will show the video I will not play the sound I will kind of give a voice over for this. So, basically this is a story about a lady named Tanya and she does not does not have any motor functions. So, below her neck she cannot move any of her muscles. So, if you can see she has these headgears which are attached to her wheelchair and using Morse code she is able to communicate right. And with all these disabilities she was able to help Google. So, basically with the help of Tanya Google was able to integrate Morse code into her Google keyboard into the Google keyboard. And I think that is what a technology really stands for. So, before watching this video I watched this video couple of years back and before that I wanted to work on interesting technologies. I wanted to work on maybe autonomous driving. I wanted to work on these cutting edge technologies. But the thing that I forgot was every innovation or every technology is there to help human kind. And I think this video is there on YouTube you should go and watch it. I think it motivated me. I hope it motivates you as well. So, this is her story about Tanya and how she went to this is her going to Google's office and helping Google integrate Morse code. And so that more and more people can use that. So, this is how she integrated she so now you can see that with the help of the new Google keyboard she was I mean people like people with special abilities like Tanya were now able to communicate and it made life easier for them. And I think if somebody with of this you know with this notion to help human kind I think this is what a technology stood for me. So, the first technique that I would like to mention it's very basic. I think everybody of you must have seen this. So, can you tell me what normalization is? Has anybody of you have any idea of what normalization is? Okay, fine. That's perfectly cool because now I can tell you anything and you have to believe me. So, no this is the perfect audience I tell you. So, normalization is basically I want to set my range for the values. I want to make sure that my values are within a certain boundary. And why we do that? Let's have a look. So, we are working on M and ISC data set which is like the hello world when it comes to image data for machine learning. And if you want to look at the data this is how a data looks like. So, this is the label 3 and this is the image 3. And if you look at the data the values range from 0 to around 255 and this is the pixel intensity for every pixel. So, a simple normalization is that I am dividing the entire data set by the maximum value. So, this will make sure that my range is now between 0 to 1 which was earlier 0 to 255 is now between 0 to 1 and the mean is 0.5. Another thing that we could do is this is we could make sure that our mean is 0. So, the range is and the value range is between minus 1 and plus 1 right. And why do we do that is because when we have mean 0 and model converges better and let's have a look how that happens. So, this is the basic model that we are using. We have these number of any examples and then this is a model simple sequential model no CNN just wanted to give it a try. So, this is when we are when we are fitting the 255 range data. So, the range is 0 to 255 and if you look at it the accuracy is 26 percent after 2 epochs right. So, and the loss is actually increasing. So, this means that a model is not learning at all. And if you look at this data which is the 0 normalization sorry 0.5 mean and this in this we have 0 mean the accuracy is around 85 93 percent in the second epoch and 85 percent in the second epoch in this case. So, basically we mean it means that with the same parameter. So, if you look at the model the model is the same the only thing we have that we have done is we have normalized the data and it has given us 4 times the accuracy that was already present. So, if we compare the graphs the accuracy is around 257 and actually decreasing this means that a model is not learning and in this case the accuracy was started at around 74 percent as going up until 85 percent and I think with just a simple technique with a simple line of code we have increased our model accuracy by 4 percent thereby reducing the number of resources that we will use we will only train for like 3 or 4 epochs otherwise if we did not have normalized data set we would have trained it for like 10 or 15 epochs to get the same accuracy. Then there are couple of other things like optimizers. So, optimizers is basically I am passing this optimizer in the in the model and let us see what happens. So, we are using the same model the model is the same the only thing that we are changing is we are going to optimize it using our own optimizer. So, there are different optimizers that you could use right. So, first is SGD the main function of the main purpose of SGD is to make sure that our model is fast I mean trains fast with every epoch we try to learn right and that happens. So, for that we need to check the time taken per step right. So, if you look at this it only takes 59 microseconds per step while you are training and if you use RMS prop although it is accurate it takes more time it takes around 64 microseconds. So, it depends on your use case whether you want to go for SGD or RMS prop I usually go for Adam which is in the next slide which is Adagrad and RMS prop combined and then we have Adelta as well. So, you can compare the accuracy here as well it is almost similar. So, since we are working with MNIC data I mean there is not a lot of scope for benchmarking but still if you want to go for faster you should definitely go for SGD but if you want to go for more accurate one Adam and Adelta I mean more preferred then we have different activations. So, for activations we provide activations because we want our model to learn non-linear functions right we do not we do not want a model just to learn linear functions. So, V went for sigmoid. So, sigmoid is basically giving it a 0 or 1 value. So, the accuracy is around 79 percent or 80 percent and if you the RLU which is basically the problem with a neural network happens is that the gradients diminish over time and RLU prevents that. So, if you look at if you look here the accuracy for the same set of data everything is the same is 86 percent if you use RLU. So, these kind of techniques like very simple but we still kind of you can use that to increase the accuracy and thereby saving resources using the same model. Then we have certain advanced activations as well. So, we have thresholded RLU and leaky RLU. So, leaky RLU is I mean both of them are advancement on RLU and then it depends where some will work for you some will not. So, thresholded RLU is actually considered to be a better alternative, but here you look the accuracy is actually decreasing. So, this is not good for our data set. So, what you could do is maybe have a better network or maybe get more data things like that and then probably it will work however leaky RLU it performs better. So, if you see with RLU we were getting around 86.2 percent here we are getting 86.7 which is slightly better but still better. So, these kind of techniques that we could follow and then we have focal loss which is basically an alternative to categorical loss. So, categorical loss is used when we have multi-class classification and focal loss is something similar basically, but it with benchmark with data set which in which we benchmark our training set or our new algorithm this has proven to be better than categorical loss. So, this is the actual function and we are get around the same data same accuracy as previously and then we have a technique called as LRDK which is learning rate decay. Yeah. So, this is the function for focal loss. So, basically in focal loss what happens is we give we give weightage equally to all the nodes that are present and do not just overlook some gradients because in categorical loss we will overlook some gradients and focal loss kind of takes care of that we that is important of focal loss we can discuss that maybe data with you right. So, learning rate decay is we do not want a model to overshoot we do not want a model to learn more and then maybe overfit and or miss the minima. Therefore, we go for focal loss and if you look after every epoch I am calling a callback function and I am decreasing I decrease the learning rate and the accuracy is around 85% which is still good and if everything works well you will have something that looks like this. So, this in which in this use case I combined I think you must probably must have seen this. So, this is in this video we have combined computer vision with machine learning and I have used three models right three networks. So, we have a logistic regression network we have a shallow network and then we have a deep network and in most cases deep network will outperform all of them because deep in deep network what happens is you have certain I mean certain features that I learn automatically based on the network right. So, it is it is very good for if deep network is used with CNN I think it is it is very effective and this and it can be easily combined with maybe computer vision to maybe develop an application which has a better you know interface. And then this is actually an encoder and decoder. So, we have three so basically in you can think of it as the model is trying to draw these digits from memory. So, I am writing this digit 5 it has to recognize it and then it has to draw its from its own memory. So, I have encoded that digit and then decoded it using I mean the same network. So, we have shallow network we have a I mean logistic regression network shallow and then a deep network. And mostly the deep networks has a better memory therefore it draws a better digit and this is the basis of GANs. So, encoders and decoders are basically you know maybe used further to give you the concept of GANs. So, just before we go on to this topic I just wanted to ask you guys because it is around 5.30 right nobody sits this long in a conference you guys must have been here for so like for 10 o'clock since 10 o'clock. So, I just wanted to ask you why are you guys here right I mean it is actually very funny I if I was in Europe I will probably just have left taken a sleep it is Thursday maybe I think you guys I had to learn something new. So, let me tell you why I am here. So, I am here because so I will tell you why I was here last year. So, last year the only reason that I came for this conference and thankfully I did was because I was given free stay at Sheraton and not a lot of people could afford that not me. So, these two words middle class Uttar Pradesh came in came into picture again and the only reason why I came here because I was getting free food free stay and free travel. But I met a lot of good people here. So, last year I met people like the Panjan, Fabio, Thaddy all of my idols and I got to learn a lot from them and this year I am meeting that for the first time I have actually used his projects. So, I thank him for coming here and you know helping us understand here the session tomorrow I think I mean I will be there in the audience I think you should also attend that session. So, coming back to this topic we can also go for mind we are optimizing we can also optimize the network what does that mean let us look at this data set yes of course. So, basically in this we are trying to go with I mean there of course different techniques I mean n number of techniques I have just pointed out a couple of couple of them there of course more advanced techniques like quantization model quantization then we have model pruning but we can discuss that later. So, with one simple change we can definitely get a lot of improvements. So, in this we were working with CIFAR 10 data set in which our data set is around of 50,000 examples and it will be divided into 10 categories but the interesting thing here is that we have now three channels. So, we are working with three channels we are working with color data. So, the first architecture that we have is the same that we use in our previous examples. So, it is just a sequential fully connected layer right and if you train it we get a flat line like this right. So, the accuracy is constant for around 10 percent and for the this is a wild accuracy and the actual accuracy it decreases it increases and it decreases which means that our model is not performing at all it is not able to learn it is not able to get any features therefore the accuracy is very low. So, with a simple change if we use a convolution network we can let us see how I mean how can we improve the accuracy with a simple change we can see that the accuracy that was around 10 percent is now around 36 percent with the same number of epochs. So, and then the important thing to notice is that here we have 3 million parameters or around 4 million parameters in this case we only have 300000 parameters. So, we are only using 10 percent of the actual parameters that we use in the previous model, but still we are learning quite more. So, it is also important to understand what you use cases which model would be better for you. So, you do not need to have a machine learning model in place if and simple if else could do the trick right and this is an important project that I did. So, this is based on a resource paper by NVIDIA it is called as autopilot which was one module for behavior for one module for autonomous driving. So, there are different modules in autonomous driving this is behavioral cloning. Behavioral cloning means that given the image of the road I have to predict the steering angle and there are other modules as well. So, other modules include pedestrian detection and path planning things like that. So, all these combined to form an autonomous vehicle. So, the important thing to notice here is that we have the data set whose pics I mean the size of the data set is this I mean per image and then we this is around 2.9 GB of data set which is zipped and this is how the data set looks like. No, so this is one data set which is for 2.9 GB and this is open source by NVIDIA itself. So, this is for given the road you have to predict the steering angle right. So, this is everything is given and yeah. So, basically we will get to know that right. So, this is how the data set looks like and this is the actual model that was present in the paper. So, the paper only suggests that we can go with this model and we will get good accuracy and if we look at the data set in look at the actual model the number of parameters are 132,000 and the total pixels per image is around 40,000 right keep these two numbers in mind. But the problem with this data set is if I wanted to load this in the entire memory, if I wanted to just load it in my RAM, I was using around 8.7 GB while loading everything right. So, if somebody who has less RAM will unable to do that right will not even do that. So, there are couple of things that we could do. First is scaling which is a very simple concept we just scale the model we I mean scaled images we earlier the images were 66 by 200 by 3 we scale it to 50 by 50 by 3. So, and let us see how a model performs. But the important thing to notice is the RAM usage has decreased significantly right. So, the RAM usage is now around 1.1 GB when we are when we have loaded the entire data set in memory and this is an important fact. So, somebody who has less RAM can still work on this model can still train this model and this is the model that we have this is not the model from the paper this is the different model which starts with 50 by 50 by 3 and goes down to the final output layer and then the number of parameters remain the same and if we train it the loss is 0.13 right. So, we are not going for accuracy here because we do not have anything to compare it to we only go for mean squared error which is basically getting the actual y and the predicted y and minusing it and checking what is the loss there. So, if you look the loss is decreasing which means that a model is learning which is the good thing. So, another technique that we could do is filtering and there are couple of filters that we could define ourselves or we could just follow the OpenCV documentation and get to know. So, this is conversion to HSV which is basically HSV is a format is called as huge saturated value in which we look at different gradients. So, if you look in this image you can clearly see that the road has darker shade than the rest of the image right and this played a particular role when I was showing a data set. So, then what I could do is I could just pre-process it and eliminate the three channels I can only use one channel. So, I will I got a data set down from 50 by 50 by 3 to only 50 by 50. So, we have reduced the three channels I was only working with one channel and the RAM usage also decreased earlier it was 1.1 GB now it is only 757 MB. So, you could work anybody could easily work with this data and important yeah you can use it in the grayscale but it will actually be effective if it is in color because then it will have different gradients and you will get to actually see the difference that is what I am saying. So, let us have a look at that. So, I mean if we have the same model if everything is the same then the accuracy would decrease but then if we that is what I am saying we have to come up with certain methodology in which the accuracy is still intact or maybe if it is a decrease it should probably still be the same or almost be the same. So, now because I have only one channel I declared another model and if you look at the parameters I am only going with 30,000 parameters. So, starting from the paper if we go back to the paper model it has around 132,000 parameters and we were able to get it down to around 31,000 parameters. Let us see how it actually works and if you look at the loss the loss is 0.14 or around 0.15 right and we can see that the model is still learning. So, even though we have decreased size of the pixels on everything we still have kept the performance intact and if you want to look and also one more thing is that we could use another function which is a fit generator function which is an out of box for TensorFlow and Keras in which you do not need to load the entire data set in memory the problem that you are facing is that we had to load the entire data set in memory what if we do not have to you know load everything on in memory. So, what fit generator does is that it loads only the batch of images that are required and then gets the next batch. So, now I have used this pixel size right and the parameters are 10 million right. So, I am just going crazy with my model there and total pixel per image is this much and if you look the loss is almost exponential and I am getting a loss of 0.0074. So, if you want to work if you want to make sure that your model is very accurate or you and still want to work with it then of course this is a technique that you could use but the problem is the RAM is of course quite used quite a lot. So, this is one functionally that you could test and I followed very useful maybe you could give a try as well and. So, I change this model using the HSV preprocessor that I used and I actually got good results. So, sometimes it fails of course sometimes it fails. So, let us say whenever the predictions come it tends to go on the left side which is actually hitting the pedestrians which should not happen right it should not happen. But on long I mean on clear roads it works very well. So, this is a very clear road. So, you can see that the steering angle is able to predict that I have to take a right turn now or I have to take a left turn now. So, we have reduced the number of parameters that are required and we have also reduced our image size but then at the same time the accuracy is still intact and this is the whole point. I do not know I just got the video from YouTube and I ran my model on that and it was working good. So, this is basically you can use it with any dash cam not in Indian conditions not this is you can see right this will probably not work in Indian conditions and definitely not when you are in Uttar Pradesh. Yeah, Bangalore it yeah Bangalore will be mostly stationary because you are stuck in traffic all the time. So, yeah but then this is something that you could probably work with and then another project that I did is emoji data in which given the emoji given the hand gesture I was predicting which emoji was I was trying to say. The problem here is I had to create the entire data set and when that happens you overfit because you have different hand sizes and I cannot include I only had like even if I know like 5 or 10 people that will still not include everybody that you know every hand just in probably in India or maybe even in Bangalore. So, we could use image augmentation for that right. So, we have an auto box functionality which is called as image error generator and it helps you to descaled image maybe set some range and I will show you what exactly happens. So, this is how it augments the data. So, this is the same image however you see that this is magnified as well as mirrored right. So, even if somebody who is a left handed can still use this model that I have trained and then similarly this is actually moving it towards down and then you can see that actually adds some noise as well. So, this prevents overfitting and then this is similarly mirroring and other things and if you look at the accuracy this is trained very well. So, it goes almost up to around 98 percent which is quite good and the losses decrease it is around 0.0 something and then if everything goes well the one thing to notice is we have not used an object detection API. So, this is not an object detection API with object detection it takes a lot of I mean it takes a lot of time for each inference. So, the FPS is around 22 or 23 with these techniques with filtering I was able to get around 40 or 45 FPS on my local machine without using a GPU and then also use it and my model was trained well and the data set was quite universal that I was able to use the same data set for a different problem. So, this is I am playing rock papers user Lizard Spock with the CPU. So, this the model is the same I have just added more data more image and I have just retained the model and still it is able it is able to pick all these gestures correctly right and this is something that I am currently working on this is malaria detection. So, if you look on the left these presence of these pigmentations mean that the cell is probably I mean it is affected with malaria and then similarly on the right hand side you see that it is clear. So, it does not have malaria. So, you can train it like that but the problem is it will actually use a lot of resources right. So, we can come up with couple of filtering techniques. So, one is we can use RGB 2 HLS which is on an out of box library, but the problem with this is it actually tells you that these pigments are present but then at the same time if the cell is uninfected that also it says that probably it finds certain gradient changes here as well. So, we have used LAB filter for this and this is also an out of box and if you see that it is able to get just the pigments and the uninfected cell is still the same. So, even now if I if I reduce the dimensions of these two images I will still get very good accuracy. So, I am still working on it is not finished I am going to open source this as well and this is something that I am working on and this is something that I feel is I mean it is very I am passionate about this because this is basically Indian sign language and it is different from American sign language and this is something that I realized when I was volunteering at school for deaf and mute and the problem with Indian sign language is that it is different between different regions in India. So, within Maharashtra Pune and Mumbai had different sign languages because their teachers were different and I am pretty sure that Bangalore Pune Delhi will all have different sign languages because there are certain words which are only present in one language and not in the other. So, I am going to need your help in this project I am going to open source this project very soon and I want you guys to maybe come up with techniques to help me out in this, but let me first show you what I have done. So, basically I have again went with a lot of filtering techniques. So, instead of working with this data or this entire image I am only working with this black and white image right. So, this means that I am not using a GPU I am not using object detection I am able to get this with a simple CPU at runtime and the FPS is intact. So, this means that this could actually I mean this recognition can take place offline as well. So, if even if a system does not have internet or the system does not have a lot of resources or computing power it can still work on this ok and, but currently it is only going to recognize static gestures right. So, if you look there is no getting of the entire context or getting the entire sentence and that is something that I am currently working on and I am going to open source this and maybe you guys can have a look at that and try to improve it so that we can have one universal sign language at least within India ok. So, if you have any questions feel free to reach out there are couple of techniques that I did not cover I do not get the time to prepare for them, but there are the techniques like quantization and other techniques such as pruning which are very popular, but are not very well used and there also one more thing which is distributed deep learning which I am currently exploring and that also actually helps when we have when we are working with big data. So, feel free to reach out to me may be offline I will be available tomorrow as well ok if you have any questions I will have to take them all right sounds good all right. Thank you, thanks guys.