 Hello everyone. Our next talk on device machine learning will be by Shravya. What do you Shravya know? So hello everyone. I am Shravya. I am working as software engineer in Qualcomm. I had overall four years of experience in the field of natural language processing and machine learning. Today we are going to learn about deep learning on mobile devices. So we had many recent advancements in the field of machine learning or in the field of deep learning where we come across CNNs, GANs, RNNs, LSTMs, such complex architectures. So now the question comes how to deploy them on mobile devices because H devices are memory wise and computationally it was intensive. So here we are in this talk we are bringing down the power of those techniques to run on mobile devices. We'll also going to see some cool real world examples where it is being implemented. So let's jump into it. First we'll see introduction. The very first question that pops in our mind was what was the need for on device AI? I'd say because of these five factors. First comes the privacy. User will be a bit reluctant to share his private data while using any of the apps may be powered by deep learning or the normal apps because security is paramount here. And reliability despite of many factors like the phone storage or the internet connectivity etc. User wants to have his apps running seamlessly smoother. Latency this was one of the key factor. Let's take a scenario where suppose if you want to buy something online and we were going for a website to it. If the website is taking time to load what will be our immediate step? We'll just jump out of that website and we'll go for another one. So we can infer with increase in latency there is decrease in usability too. There by decrease in our profits as well. And there comes the cost when without the implications of however complex the algorithm or the app is running. User wants to have his mobile device at budget friendly scale lightweight the models that were running on mobile devices were relatively lighter in weight than the ones that were being done on laptops or any other devices. So these were some industries where deal applications can be powered using all the edge devices like mobile computing industry extended reality which was a blend of augmented and virtual realities. Smart home smart cities where we'll see at the microprocessor scale how deep learning can be implemented variables IOT networking automotive and healthcare industries. These were some of the cool applications that were being developed and these was already in use in some of the smartphones like autofocus auto capture dark light photography earthquake detection phase detection. Snapchat it has really cool deal applications like face swap converting an image into a HD image and many more you can explore them. So we understood why we need deep learning on mobile device and where it was being used we'll see how how it can be done on a mobile SOC. So this was a typical SOC architecture where all these were important peripherals we have encoders direct memory access CPU GPU DSP these were the hardware acceleration other peripherals like audio USB video and NIC. So we will not go into the details of all of these but we'll go through this section hardware acceleration. This was one of the key role in performing all the computations that were being used by deep learning architectures. So as we know CPU CPU is present in every every chipset every SOC and it supports floating point and entire operations. So GPU supports floating point inference and it supports FP16 and FP32 precision. What does this FP16 what does this FP32 meant. So this meant we are converting our native tensor data types to FP16 and storing in FP16 accumulator in FP16 mode and we are converting the tensor data types to FP32 and storing in FP32 accumulator in FP32 precision mode. So in FP32 mode while we were getting more accuracy we were we were compromising on the performance. It was vice versa for FP16 while we were getting least accuracy we were having the high performance. So in order to maintain a tradeoff between both of these we are going to the mixed precision mode where inside while computing itself we were balancing between the accuracy and performance inside a single operator. In DSP DSP is distal signal process that supports quantized inference. It supports running all the intrate operations and similar to DSP we have other accelerator called TPU that supports even the intrate operations too. So what were the key takeaways from that hardware acceleration we earlier saw while creating any of the ML experiment. The typical flow will be like this where some end user create some experiments or run the experiments on any of these devices and get the results. While analyzing the data he understood the bottlenecks and he will deploy the allow list. So here are the key takeaway points where we understood that hardware acceleration plays a key role in performance gain. We had seen earlier that GPU is supporting floating point inference. So it can be faster 10 times faster than CPU and DSP is even more faster than GPU because it is carrying intrate operations only. And device spec that plays a major role in either gaining the performance or gaining the accuracy. So we should take into consideration all the device spec while we are doing any of the ML experiments. Automatic acceleration this also plays key role where user have the capability to control the workloads and he can offload the operations to particular accelerators while creating the experiment itself. In that way there won't be any fallbacks or something he can straight away work offload his all the workloads to the selected accelerators. In order to develop a good deep learning application these were some of the prerequisites. Like we need a high quality data set and we need hardware where the training is going on. And we need to have an efficient inference engine that that was the hardware spec we were talking about and the efficient model that performs relatively well with high accuracy. Then we'll see end to end ML workflow. So right from the problem definition till deployment these were various stages where any machine learning engineer will deal about. So in this process all the boxes that were highlighted in yellow were the off device activities and the boxes highlighted in green were supported on device. So whenever we were running any ML experiment we should see that these processes were supported on device or not. As you go from higher level till down you can see right from data acquisition to model deployment the number of green boxes goes on increasing. Like at the deployment stage you will have all your boxes in green. So at data acquisition you will be having data versioning and model development you can select your network architecture, tune it and you can add your development environment support. And you can deploy it on mobile device, edge devices, model versioning and all. System wide features it comes to the acceleration support that we talk security and governance like access controls etc. This was the training versus inference flow where the top level one split it into your assets and you will build your model. Once your model building and training was done you will test your model. If you are satisfied with accuracy and other metrics you will freeze the model to your static weights and biases file. If you are not you will back propagate and reiterate on training your model. So these were some of the pre-trained models available for performing the tasks model available. In order to deploy on device you will first convert the model into an intermediate model. So this intermediate model should be recognizable by the accelerator end. So you can optimize this model to further I mean to reduce its weight and once the optimized model is available you will assign it to the particular runtime and you can get the inference. So these were some of the common tasks that were being performed by the pre-trained models that we saw. So Qualcomm Neural Processing SDK. This was one of the framework that was used by Snapdragon chipsets to run deep neural networks faster. So here at the application code layer we were taking the third party applications or third party freeze ML models that were done in TensorFlow cafe or ONNX. Once those models were available it will make a call to the SDK will make an API call to NPE which will load our model to a runtime. So once the model was loaded we will convert it into intermediate model as we saw in our previous slide. The intermediate model was in the form of DLC here. Once the DLC model was converted we will validate the model like we will split the model into a series of graphs and we will validate whether all the operations were done. All the operations mentioned in the graphs were being supported by the runtimes that we have available. Once we get a green flag from here we can add user defined layers in order to boost the performance of graphs like if you want to replace any of the op with your own custom op you can do it here. And logging while the inference is being carried on this part will take care of all the logging and stuff like that. Once this was done it will offload the work tool corresponding accelerators. As we earlier saw these was the precision modes being handled by each of the accelerator and it was being handled to a hardware level for its execution. So for example we will take the inception v3 model that was available in TensorFlow model zoo and we will deploy it on device. So for that there were four steps for it one is the setup stage other is the model conversion stage and third is the inference and fourth is the map results like obtaining the results. In setup stage you will download the SDK here. You can just download and unzip to your snappy root and you can download the Android SDK and TensorFlow and all the environment paths and dependency scripts. If they were run the model was ready to go. Then we were converting that inception v3 model to our NP model. This was the intermediate model we earlier saw it's a DLC file and then we'll run the DLC file on Android with the respective accelerator be it CPU or be it GPU or DSP. Then we can obtain the results. This was the code snapshot of all the four stages that we saw in earlier slide. At setup stage we unzip the SDK that we downloaded and we ran the dependency scripts. Then we set up the Android SDK root path. Here I'm using 22b. You can use the one of your choice or the latest one and for setting up the TensorFlow environment you can download the TensorFlow and run this dependency script. For some models like frequently used models like inception Google net Alex net there were some predefined scripts available to set the environment for them. If not you can just download the model and you can convert it straight away. Then we were doing the part where we were converting the TensorFlow model to our NP model. For that we were using a tool called snap it TensorFlow to DLC where it takes the downloaded input model from TensorFlow Zoo and we are specifying the model input dimensions here and output mode. This was the output part where the end model file will be delivered model.dlc. So once this model.dlc was obtained to us after this step then we were ready to go on running it on Android. So here I'm picking up the DSP runtime for faster inference. You can choose the accelerator of your choice. So while running we were setting up the target architecture like this was the architecture on the device which I'm running. You can change it to I mean you need to check the ARM version of it and you can change it to C-length 6.08.0 something. And LD library path this was used in order to recognize where your model was being placed where the input files and all were available. And this was like the system variables in our environment path. This was also being used for the same ADSP library path. So as we were running on DSP runtime we need to set this library in order to make this tool understand that here are the place where the DSP libraries were located on my device. Once this path setting was all done I'm taking the help of Snappy Netrun tool where I pass in the converted DLC file which was this one. I'm passing it to the container parameter and this was the list of input images and I'm using DSP runtime in order to offload the work to DSP accelerator. And perf profile burst mode you can get high performance in this mode for detailed explanation of all the profile modes available or all this setup you can just have a look at the NP official documentation in this link. Once the model was run we'll get the metric summary like this. So these were all the metrics that were available once the model was being run on Android. So load this this was the load time for the model and in it it was the Snapdragon in NP runtime initialization RPC in it. So while making a DSP call it will be get initialized and accelerator in it and the DSP one which we selected and all the boxes highlighted in blue. These were the ones that were activated while execution inference time forward propagates all these. So if you can see the runtimes here till it activates DSP it was using arm CPU. Once the model execution started it make use of the DSP runtime and these were all the graph layer structures. If you can see the one we used is inception from layer 0 to layer to 16 it will have the sequence of operations and this was the time taken for each of the layers to compute. So this was side by side comparison of running the same model on DSP runtime and CPU runtime. You can clearly see the difference while we are getting 12k here the score is more than K in CPU runtime so we can infer DSP is more than 10 times faster than CPU. So similar such framework like Qualcomm neural processing engine SDK was Android NN API developed by Google. This was a CE API this was used for running the deep learning networks faster on Android device. The process flow will be something like this here we have an application and we had written a model for it in our selected frameworks be it TensorFlow or ONNX etc. Once this was available we were planning it to pass it to our Android device for forward inference. Then Android NN API makes a call to Android NN runtime which will call Android NN HAL which was hardware abstraction layer. This layer takes care of splitting the work to each of the accelerator that we specified or our device have. And if none of the hardware accelerator supports a particular operation then it will fall back to CPU for its execution. So the four main abstraction layers here was model, compilation, memory and execution. Model as we saw earlier it was a graph of all the operations say 2 plus 3 is equal to 5. Here 2 and 3 are called my operands and plus c is my operator. So it was a similar graph of all the operations and operands compilation. So while compiling our NN API model to low level that will be understandable at hardware level. Memory while execution of any model on Android NN or any of the framework available. There was a lot of memory exchange needed in order to share the intermediate results available in order to store them in cache. We are splitting it into various subgraphs. So each subgraph is relayed on the previous subgraph for its execution continuation. So these memory buffers will help in offloading that part. Execution we were downloading our NN API model and we will do it a forward inference pass. This was the workflow where we had our application developed by TensorFlow. We are converting it into TensorFlow Lite model to deploy on device. So ever wondered what a TensorFlow Lite model has? This was a flat buffer file. It was having a series of subgraphs where in subgraph one we will have the original model written here. And we will pass that model to subgraph one with golden inputs and it will get processed. And with the set of outputs from subgraph one we will pass it to subgraph two and the process goes on like this. At the end we will be having static files that is the weights of the original models along with golden inputs and outputs. Inputs being the images that need to be classified and outputs will be the probabilities that were expected. And once that model was available we will pass it to ANN tool. It will make a list of API calls in order to pass it to ANN and HAL layer as we saw in our earlier slide. Once that was passed there it will call the compilation part and it will ship our model to on device. Once the model was being passed on device these were the three steps it will perform. One is get supported operations, prepare your model and execute. In get supported operations it will list out all the operations used in graph and it will check out whether all the operations used were being supported or not. Once it was supported it will prepare a graph with all the operations and it will assign the runtime to each of the ops. Then it will execute. The top box will be the offline workflow and bottom box will be online workflow meaning it will happen on device. We just saw Android and an API and Qualcomm neural processing engine SDK. Along with it there were other deep learning solutions like PyTorch Mobile, Core ML, Kafe or Google ML Kit, Paddle such things. You can just soak your hands on it to see which one is performing better here. We discussed earlier that model is a directed acyclic graph of all the operations. This is how an example graph looks like. This does not resemble any pre-trained model. This was just for a representational purpose where from input till we get the output we were listing all the series of operations that were being executed like the convolution, pooling, normalization, dense, softmax. So this is how a typical pre-trained model or any deep learning network architecture will look like. It will be still more complex for simplicity I had used this. So in order to view the network architecture for your model, you can take the assistance of these open source tools. Along with the model graph, you will also be able to look into your loss functions, input images, all the list of hyperparameters being used and how it varies over time, etc. So all these things can be done using TensorBoard. You can check out the other tools as well. Those were all quite friendly. You can just pass your model and it will give out a graph. You can just validate what operations were supported. It will show the runtime of it. Next state of art applications. Till now we had saw all the pre-trained models that were available from TensorFlow models. But what if we want to develop a model by ourselves? There comes the model fine tuning. Well, we'll perform transfer learning. We'll prepare the data set for our task and we'll find a pre-trained model and fine tune it. And we'll convert and deploy it to the end device on which we target to run. So this was the snapshot taken by state of art research called Teachable Machine with Google where we are uploading a data set here. We'll click on train and we can straight away export it to the target device. So this was a sample app that I developed on browser for demonstration purpose, which code you can find here. And this was the URL where we uploaded this. And I just uploaded some 20 random images of Tom and 20 random images of Jerry here. And I clicked on train. These were the hyper parameters I choose. You can vary these hyper parameters based on your task. And once you click on train within less than a minute, it will complete your model training and it will give out the output result. You can check your results by uploading some random images. And if it is performing well, if you can see it gave Tom with 98% probability, even though we provided just 20 images. So you can just play around it and once your model was done, you can export your model. You can export it to either Android or iOS or Coral or Audino Raspberry Pi, any of the framework. And the result will be something like this if I deployed this on my Android phone. You can download this from the GitHub page and you can straight away open this with Android Studio. You can either see this on emulator or you can connect your USB to your mobile phone and you can see the app launched in your mobile device just like this. So it will ask to upload image and it will show the probability of that particular image in the given labels. For simplicity purpose, I had explained Cartoon classification app. But you can really do many complex operations like you can design driverless engines. You can develop security systems. You can integrate it to Raspberry Pi or you can integrate it to Audino and you can build security systems face on locks, etc. You can visit that website and you can see experiment shared by the other users that were being used there. Next comes benchmarking. So while we were running all these experiments on mobile devices, there were certain factors that needs to be that were stressed on your hardware and measuring them would be important in order to assess the performance of our device. So some such metrics were accuracy, throughput, latency, energy power, hardware cost, flexibility and scalability. By accuracy, we meant the complex. So the accuracy of the model we were running and latency time taken by the model to show the results and energy or power, how much memory is being used and how much power consumption is being used, hardware cost of it and flexibility, whether not only the current application that we were running, is it performing any parallel execution or something like that. The scalability of performance with increase in number of the devices. So these factors we need to measure while assessing the performance. AI benchmark V4. So this was the APK used in order to assess the functional performance of a device. You can download this from Play Store or you can download from this link and you can run on your mobile device. On the mobile device, it performs all the series of deep learning tasks like image classification, deep blurring, converting an image into HD, bokeh, bokeh effects, photo enhancements like text completion. Given a paragraph and it leaves blanks, it will fill all the blanks by itself using LSTM model. So all these tasks it will perform by taking the help of these models like it has CNNs, RNNs, GANs, LSTMs, all the complex architectures. And for running each of these models, it was using these inference modes. We earlier saw of P16 Intate, all these modes. So how it was being used by each of these model will be calculated here. Once all the tests were done, it will infer AI score based on overall score based on all these tasks. Like there will be some complexity associated like floating point models will get more score when compared to the Intate model. So it will be assigning weighted average to each of these tasks and it will give out the, when we compute the geometric sum, it will give out the average score. And these were the hardware accelerators that were used for each of these tasks. Like it uses CPU quanta and it will give the overall score for it, Intate, how much of DSP runtime it was being used. So it will calculate all these accelerator times. Two minutes remaining for the client. Yeah. So once this functional assessment was done, there comes the battery and thermal one. Battery is also key factor while performing any of the DL tasks and PC mark 10 is the benchmarking application to assess the battery life. While running PC mark 10, we can come to the conclusion that runtime and frames per second were inversely proportional to our battery life. So deep mind came with a technique called adaptive battery based on the hardware acceleration we had in our Android. It will automatically tune the battery to optimize our tasks. Thermal HAL. So it from Android and Thermal HAL was supported on all of our Android devices where the device skin temperature will trigger all these events based on the temperature sensors that were being used by HAL. And it will trigger the status code accordingly our phone will get operated. Like if it triggered shutdown the phone will immediately go to sleep mode, etc. Optimization as we understood that our models need to be optimized in order to run on mobile devices. So to such optimization techniques that were popularly being used for pruning and quantization in pruning, we can do two types of pruning. We can either remove the neurons or we can remove the connections based on some threshold values that we apply based on the architecture. And we can do quantization where we convert our floating point numbers to intake numbers there by reducing the storage and bit ratio. So it will increase the execution speed and decreases the storage space. That's how it will gain the performance. And net adapt is one such technique where it will automatically optimize our network to our resource budget based on the hardware accelerations we have. These were some other popular optimization techniques. You can take a look into it afterwards. I made this tool based on this was an open source tool provided by Qualcomm at this link for this you can. You have to stop. It's time up. Okay, so I'll just go over the slide headings for it. I meant for doing your compression and quantization you can optimize your model and you can without reduction in accuracy. And TVM TVM is also compiler level optimization where you can optimize your graph and operators auto ML mobile. It will it will give out the neural network architecture based on the hardware spec of the Android device. It enforces a reinforcement learning model and it will output the neural network architecture for our target model. So these were some of the challenges like on device training and battery and thermal mitigation etc. Yeah, some of the differences you can find find out the slides at this ID and please reach out to me at Twitter in case of any queries. Thank you. Thank you for the wonderful talk.