 Hi, welcome everyone to the stock on AI model efficiency toolkit. My name is Abhi Kobre, I work for Qualcomm AI Research. And you know, I focus on this field called model efficiency and I want to motivate to you why, why is model efficiency important in the stock and I also want to introduce this toolkit to you. So while I focus on this model efficiency space, you know, I and my team also work on building these tools that help, you know, engineers within Qualcomm, but like we want to actually impact the wider community. So as we all know, you know, deep learning and deep neural networks are all the rage currently. If you look over time, the networks have grown more and more complex and, you know, they've grown in size, both in like space, but they've also grown in complexity. And if you take all of this together, and if you see on like the y-axis, the networks, you know, the energy consumption of the networks keeps growing up, you know, exponentially if you may. And there is, you know, there's no current limit inside on when this is going to end. So yeah, while this is all good for us, but I think, you know, it's, we need to solve this in some better fashion because clearly we can't let the energy consumption of these deep neural network solutions keep growing at the current pace. So how do we go about this? So a lot traditionally, a lot of the neural networks, they run on the cloud and that will continue to be the case. I think, you know, a lot of applications have been enabled by the cloud and that's not going away. However, if you look at the devices around you, right, if you look at just what you have at the moment, you have cell phones, you have tablets, computers, you have smart watches, you have your cars, you have your TVs, your refrigerators, AR, VR headsets. There's a whole bunch of devices around us. Each of these devices is a AI device, if you may. So they are, even today, they are probably running some machine learning, some deep neural networks, but definitely in the future, this is just going to keep growing. And this is actually a good thing. So this, having the on-device AI will complement AI running on the cloud. You know, no one is the winner. They just complement each other. And so this is, the on-device AI has, you know, certain characteristics that actually make it much more appealing. And one of the characteristics is privacy. A lot of times, like, you know, let's say we are doing, you know, something running on the phone and trying to do some, trying to listen to your voice and give you some, you know, do something for you, write an email, type something up. Any of these interactions that we have with the devices that we touch, generally are private. And so privacy is a big concern. And, you know, having, keeping that communication local is appealing to users. Similarly, if you look at, like, things like automotive, you require a lot of reliability. You don't want, just because your network connection has gone down, you don't want your car to go crash somewhere. So you want your device to keep operating irrespective of, you know, what kind of network connection you have. So you need reliability, also low latency, which, you know, may be impacted by going to the cloud and back. And also if you see these devices keep, you know, producing ever-growing, gazillion amount of information. And even with, like, these 5G networks coming up and we are going to see, like, a humongous growth and capability of the networks, but nevertheless, by keeping some, you know, not needed information off the network is good because you can reduce your network bandwidth, save energy, help the planet. So all this is good. So what's the problem? And the problem is, if you look at the deep learning networks or machine learning workloads today, you see that they are very compute intensive. And you, they have these, you know, while they're compute intensive, they also, you know, depending on the application, you may have real-time constraints. You may want to, like, look at a video stream and do some segmentation on the fly or, you know, change. As you're looking at yourselves in the phone, you may want to, like, change lighting conditions and so on. So there's a real-time aspect to it. Some applications are always on, like when we are, for example, trying to tell our watch to, you know, tell a Google or wake up, right? So those are always on applications. So there are all these challenging AI workloads, but on the right-hand side, you have these mobile, these edge devices have a very constrained environment. And they want, depending on if they are battery-powered or even if they're not battery-powered, do you want them to be very efficient? You know, they are generally sleek and lightweight designs. If they're battery-powered, they require long battery life, you know, usage, those kind of things. And they also have limited storage memory constraints, maybe limited bandwidth, maybe limited compute as well. But all of these limitations, you know, as, you know, Moore's law keeps evolving. I think the devices become more and more powerful, but there is, they are constrained compared to the cloud, right? That will, I think, always remain the case. So we need, we can't just, you know, potentially just do what we do on the cloud on the device. I think that that is not going to scale. So what do we do? So, you know, we at Qualcomm believe that we can, we have to tackle this in multiple ways. So we, you see in the middle bar over there, some of the ways in which we want to make these models efficient are model quantization in the middle will look a little bit of that, what that means. But at a high level, what that does is instead of running at high precision, let's say you're doing 32 bed floating point precision models while training or even, you know, inferences of the cloud on the, on the edge devices, you could scale it down and use lower precision math. And we'll look at why that is useful. There is model compression that you see on the left, which is, you know, can we take these models that we have, you know, developed after a lot of sweat and tears, but given a particular use case, right? It may be a model, maybe designed for something, but then you want to detect faces out of it. It could detect a whole lot of things, but you specifically want to detect faces out of it. Can we make that model? Can we compress it down for the particular use case such that it is a smaller model or more compute, less compute intensive model? So that's called model compression. We'll look at that a little bit. And there are other techniques like, you know, compilation, can we use a machine learning compiler to optimize this out? Just like we write code and a compiler, you know, makes, you know, looks at our code and tries to take away the redundancy and makes it more efficient at runtime. Can we do the same thing with machine learning models? So we are, you know, exploring all these angles. And also, of course, at the bottom is, you know, different hardware architectures, different hardware accelerators. Those are being researched, including, you know, stuff like, you know, can we just do compute in memory? Do we need a specific compute? Can some other compute be just done in memory? So there are all these exciting different areas of making models efficient. For this particular talk, we are going to focus on that on the one in the middle, model quantization and model compression. And these are the techniques that are built into this. AI model efficiency toolkit. So let's move to the next slide. Okay, so what is network quantization? And I told you in a very, at a very high level, it basically involves, like, if you see the picture on the right, it's trying to motivate through completely unrelated example. It's saying, like, if you had a higher bit width image, and if you, you know, reduce the pixel bit width of the image, you can see artifacts, you know, going from 24 to 8 bits. And then if you go down to one bit, you would basically lose the color and so on and so forth. So in a very similar fashion, you can apply that similar analogy to a neural network model. And so what happens is you can, you know, you, when we train the models, you would have, you know, floating point, 32 bit floating point numbers for the model parameters. And then each layer, like if you're doing a convolutional layer, you're convolving using floating point, 32 bit, you know, multiply and accumulates. Can we change that into integer? So going from floating point compute to integer compute is itself a big leap. But then, you know, do we need 32 bits? Can we scale it down to 16 bits? Can we even scale it down to 8 bits? Maybe, you know, beyond that. So these are the things, you know, scaling down from floating point to integer and then scaling down the bit width from 32 to let's say 8 is generally what is called network quantization. And so the way this works is not only do the parameters get quantized, but the compute gets quantized as well. So you're doing like in this particular picture, I don't have a point at a point, but look at the first blue layer, that's a convolutional layer. And then you have a second layer in this yellow color, and that's also a convolutional layer. So let's say the input comes in at 8 bit integers and you have 8 bit parameters feeding into that convolutional layer, the blue one. So the output of that layer would be like at an accumulator bit width, like you do multiply and add and, you know, soon you will grow from your 8 bit to let's say back to 32 bits because in depending on how much you're convolving, you need to accumulate a lot of values. And so what would happen is you would like convert those 32 bit integers back into 8 bit. And the way you do that is, you know, finding a scale factor for every layer. So let's take a look at that. Yeah, but before we get there, so what's why do all this, what do we gain out of this, right? So, of course, we know, you know, we can do the math and we say, okay, 32 bits to 8 bits, that's reduction in memory usage, right? So we, that's a 4x reduction right there, as you can see in the left picture. But that's not the whole story, right? If you look at the latency, it's not only that we have reduced the memory, but we have actually reduced the compute needed from floating point of math to integer math. And so that has a speed up if you may in how many inferences per second you can do. But now start looking at other things where you can see more dramatic gains. Like if you look at the power consumption numbers, if you look at add and multiply those other tables on the left, you would see like dramatic gains, right? I mean, we are talking here on a factor of 30 factor of 20 kind of gains compared to floating point, 32 bit versus integer 8 bit. And a lot of it don't take these exact numbers too hard. I think it all depends on how your hardware is designed and so on and so forth. But I think the takeaway point here is that you get dramatic gains. Same way, not only having less memory usage is good to have require less memory, but it also has a benefit of reducing your memory access because a lot of times when you're running machine or AI workloads, what you're doing is you're like taking, let's say the model parameters or even the activations from a layer, saving it to memory, reading it back from memory. And so you're going back and forth with memory a lot. So if you have less things to go fetch from memory, better for you, less power consumption, faster, so on and so forth. And on the right, you would see that just because doing integer mathematics is much simpler than doing floating point mathematics, you have a dramatic reduction in silicon area as well. So the way you look at this is, silicon area can translate to cost, can also translate to more capability. Something requires less, you could do more of it. So all of these are trade-offs that we can play against. But across the board, you see there are significant benefits of doing things in 8-bit integers. So all that is good. So let's do it. Of course, there's no free lunch, so everything comes at a cost. And so the way you generally do there are lots of ways in which you can do these kind of 8-bit integer computations, let's say. But generally what gets done is at a layer basis, so let's say if you're doing a convolutional layer, what you would do is you would find a scale factor for that layer. So you would say the table on the left, it's a little too dense here, but bear with me. So if you look at that A transpose kind of matrix on the left, pretend that that's a weight matrix and it has some floating point numbers, as you can see, only down to two bits of precision, but think of those as floating point numbers. Now the way you convert them into integer numbers is you could say, hey, how about I find a scale factor? So if I find a scale factor, and in this case that scale factor is let's say 1 over 255. So I keep that scale factor, remember that for this layer. And now if I apply that scale factor to these floating point numbers, I get some integer numbers and they happen to be nicely from 0 to 255. OK, so now I have like 0 to 255 is 2 raised to 8, so we can represent these numbers in 8 bits. So that's how things are done. But however, what would happen is what's the meaning of these numbers? So if you take these numbers and start multiplying them and adding them and so on and so forth, you get some other numbers, but eventually you have to convert it back into this something that you can understand into this flow domain, if you may. And so when you do, if you apply a scale and then you sort of remove the scale and look at the output, you will see that there is an error. There is an error because in case of a 32 bit floating point number, you have a lot of precision. And with 8 bit integers, you have 255 values. So clearly you can specify a lot more values with floating point numbers as opposed to the 8 bit numbers. And so what would happen is what we are doing is we are mapping some buckets of floating point numbers to one number represented by an integer. So there is a rounding error, if you may. And all of this translates into errors at the end of the model. So we want to reduce this. We want to have our cake and eat it too. So we want to use all of those benefits we saw earlier with integers, but not lose any accuracy. And that's the whole trick over here. So we at Qualcomm AI Research have been doing a whole bunch of research on quantization techniques. And you'll see some papers there on the top, including papers on these very novel techniques called add around andvation bits that are coming up in ICML. So we have been doing this research and publishing what we see, not just to help us but to help everybody, the whole industry. But now what we are also trying to do is can we take these techniques and make them available through tools. And so it's easier for somebody, for a user to instead of reading a paper and trying to do that math themselves, they just use it too. So that there are two pronged strategy to try to help. So having said that, we created this toolkit called AI Model Efficiency Toolkit, AMET for short. And so what this toolkit does is has these model quantization and model compression techniques. And the way we have designed AMET is it takes in a trained model. So you, this could be a model like by Dodge or TensorFlow models. And you inject that model into this tool and out comes a more optimized model. So the toolkit itself is not meant to create a quantized model, for example. It's an efficiency toolkit, so it optimizes. So it makes the model better for something. So it makes the model better for running on quantized hardware or makes the model smaller by compressing it. And we can apply both of them together. And the way it has been designed is we will see like for some of the techniques both from quantization and compression, you may want to train the model a little bit further. And this has dramatic improvements in accuracy. And so this has been designed such that the model, the optimized model that you can get back, you can train with it a little bit. And so it's still a training time. It's not a model that is just a run on target. So you can train it a little bit, improve accuracy, and then you basically take it to target. So if you, the one in the middle is how you take your model, how you inject AMET into your workflow. It's designed to be like a plug-in to your existing workflow. And just in May, we made this an open source project. So we are very, very happy that this is available as an open source project. You can access it on github.com slash quick slash AMET. Quick QUIC stands for Qualcomm Innovation Center. So that's the entity that makes AMET available. And one of the things, I think there's a lot of other open source projects with respect to even in the model efficiency field, but definitely in the machine learning field. So what we have tried to do with this project is in addition to just making source available, we've tried to make it user friendly. So it includes documentation, includes code examples, API documentations. It has documentation on the techniques. We also created a video tutorials that we have uploaded to YouTube. We'll have a link on that later on. And so yeah, we've tried to make this, not only something that people can contribute to and look at the source, see what we have done, but also it is something that can be easily used by somebody. So why did we make it open source or what are our goals? So as we saw, I think the overall goal is to enable the whole ecosystem, the whole academia industry to leverage these low power edge devices and push more and more AI workloads to these edge devices. So that's our overall goal. And there are, you know, in this, one of the ways to impact the ecosystem is through releasing open source tools because that's the way, there's a much better way of collaborating with others because there are a whole bunch of other tools being developed as well in this space. And, you know, we don't want to, to the second point, we don't want to like necessarily do, just repeat what others are doing. We want to do something that is head on. And so we have designed these tools to plug in, let's say with PyTorch or TensorFlow. We build on top of other tools and there may be other tools that users are using. And I think this is designed to so that you can layer this up. It's not meant to replace anything. It's meant to be added on. And overall, like if at the end of this, we can, you know, drive the community towards, you know, low precision inference. I think that's the main goal. So I think if you look at generally people, there's a belief that, you know, hey, models run good with 32-bit floats. Maybe 16-bit floats is good enough. But if you go down to 16-bit integers, oh, wow, this is going to be a problem. 8-bit integers? No, that's not going to work. So I think that there is this, you know, some folks have this perception and we want to remove that perception and, you know, make it easily possible for users to move towards integer inference. So here's a very thousand-foot, even higher-foot view of the architecture. So I think what I want you to take away from this is there are a bunch of techniques built into this toolkit. And while I may not go into detail on any one of those techniques here, but, you know, the documentation is there, the videos are there, so you could browse those at your leisure. I will motivate a few of those techniques a little bit. So there are a bunch of techniques built in and the way we have designed this toolkit is we have have the model optimizations part of the code is separate from the extensions for TensorFlow and PyTorch. So we want to make it easy for people, I think PyTorch and TensorFlow are more common training frameworks, so we wanted to make them easy for people to plug in their models into AMED. And so we have these extensions built in, but we didn't want the optimizations to be tied to the extensions, right? So tomorrow if, you know, some collaborator wants to come in and use it with other frameworks or they may have their homegrown framework and they just want to use the optimizations, I think that is possible for them to do. Of course, if you use it with the, these TensorFlow PyTorch extensions, you get higher level APIs, you use the model optimization library, you have somewhat lower level APIs, but that is possible. So here's a one slide introduction to a couple of features on the quantization side. And like I said, I think there are actually a bunch of features that are already in and we are continuing to add more features like we saw add around feature coming in, paper coming out in ICML next month. And so we are going to basically, you know, try to bring that into AMET as well. But if you, here's two features. So one on the left is called data-free quantization. So this was a paper we released last year and this technique, it's actually very, very applies very well to these traditionally harder to quantize models and the MobileNet family from Google. So MobileNet architecture is designed with these depth-wise convolutional layers or depth-wise separable layers. And they are very efficient at doing, you know, machine learning, like vision kind of use cases. But when we take these models, like for example the MobileNet v2 model, the version 2 model, take it to a quantized target, you see a sharp drop in accuracy. And depending on how the model is trained, you will get, you know, less sharp and more sharp. But nevertheless, it's a sharp drop in accuracy to the extent that it's not usable. So these techniques are, we call them the one on the left, that data-free quantization is a post-training technique, meaning we don't actually require the users to do any further training. So you simply apply this technique and you get back a better model for quantization. So somewhat, you know, magical fashion. And there are a bunch of, you know, different parts to this. So the first part, cross-layer equalization and bias absorption. Basically what they're doing is you're looking at these layers in the model and you're equalizing the base of adjacent layers so that across the channels in those layers you have more uniformity and which helps with quantization because when we are trying to find that scale factor across the channels, if you had disparate, like, ranges in the channels, you would find non-optimal scale factors. On the other hand, if you, the channels were more or less homogeneous then it's easy to find better scale factors. So that's what those are doing. The last technique, bias correction, is also an interesting artifact. So with these depthwise separable layers, you have less number of parameters. And so I think there's more of a chance that just by rounding to the nearest, you may have a shift in the output of layers. And it's just like shift, like a shift. So we can observe that shift through passing some data and then we can correct that shift. And so that's what those techniques are. So the one on the left, data-free quantization is a post-training technique. One on the right is quantization aware training. And basically this is a technique you'll find in other tools as well. But we have added a few tweaks to it. So overall, what happens here is you have a model, like here you see a snippet of the model, like a convolutional layer followed by a bias add followed by a ReLU. So we add those green bubbles in here and these are called quantization simulation nodes into this model. And what they would do is they would try to simulate the quantization noise. So specifically we saw a few slides back that you have the scale factor that you apply. And you can apply the scale factor and sort of divide by scale and now multiply by scale. So you get that noise to the fact of getting rounded to the nearest integer. So those green nodes are going to simulate that noise. And so when you simulate that noise, what happens is now you're in your forward path, if you do an inference of the model using the simulation nodes in built-in, you will see that the accuracy now starts mimicking the accuracy you will see on a quantized target. So that's good in itself. So you get sort of how much am I going to get on target? You get like a simulated score. But now you can train with this. And so what happens in this training is interestingly like the model knows that there is this noise in the forward path and it will learn to counteract that noise. So the couple of tweaks we've added with those green ops is we have a way of like inserting those nodes in the right place. So I think it turns out that we need to insert them in the places that how things would run on target, inserting it all over the place or like, let's say between the convolution and the bias side or between the bias side and the rail, it's not the right thing to do. And so we have a way of like inserting in the right place and we have a configurable way of doing this. So you can change the configuration and, you know, adapt it to a particular runtime. Other things are those green nodes are also figuring out what scale factors to use. And we have advanced technique for finding out which we call signal to quantize noise ratio. So we try to find these scale, these optimal scale factors, which may exclude certain outlier values and, you know, just try to have more resolution over the more probable values that we see and a particular activation or weight. So here is some results out of this. And I can obviously understand that with that one slide, I think you're not going to perhaps understand everything that I said. But if you go back to github.com, slash quick slash aim it, you will see user guides, which explain these techniques a little more in detail. And also we have those videos on YouTube. So please look at those. So here are some results. We applied these data-free quantization techniques. And this is like you see different models, but let's look at like the MobileNet V2 model. Floating point accuracy. So what I want to take you to take away is like, if you look at the integer 8 inference versus the floating point 32-bit inference, there is not a whole lot of difference between them, right? So they have come very close. You're getting all these benefits that we saw, but you're getting the accuracy that you would have gotten if you ran it in floating point. And you can actually apply this data-free quantization and the quantization of it training techniques or combine them together. So you can apply this and then do quantization of a training on top of this. And that actually helps quite a bit. So you would further close the gap. So specifically for MobileNet V2, one of the interesting data point is you have starting 71.72 floating point accuracy, depending on which model you use, if you take it to target, it's gonna show you on integer 8-bit targets. It'll show you like close to 0% accuracy. So it's a very sharp drop depending on the model, obviously. But then you can see that even with those sharply dropped models, we have recovered it back almost close to where it was. So I mentioned that we've tried to make this as user-friendly as possible. And so I think this is one slide which sort of tries to motivate that. So the data-free quantization techniques like you saw, there are multiple techniques that go in there. And if you appeal the onion, there's actually other things that happen like biasing, batch norm layers get folded and so on and so forth. All of that gets wrapped up in this, the first call you see there equalizer underscore model. So make one call, give it a model. In this case, this is a PyTorch model from TorchVision. Give it that model, tell what is the input shape to your model. And that's it. In place, that model will be made better and all those techniques get applied. The second half shows how you can apply these quantization simulation nodes. And again, you give it a model, you make one call to compute encodings which is going to find those scale factors which we call as encodings as well. And once you've done that, you have this simulated model, sim.model which has these nodes inserted into it. Now you can invoke your existing pipeline like this evaluate model is a user pipeline to evaluate. And like if you did evaluate underscore model and pasted model, you'll get the floating-front accuracy. If you did evaluate model and pass and sim.model, you get the integer or a simulation of the integer 8-bit accuracy. And it's fairly simple. It plugs in with existing pipelines. So I think that's the main takeaway from the slide. So going on to compression. So there are a bunch of compression features. One on the left is our tensor decomposition kind of feature. So what happens here is you take a layer, like a convolutional layer and you split it into two layers. And you would say, oh, that's not compressing. That's like inflating. But the way you split it into these two layers, you have two smaller layers. And the combination, even with the two layers combined, the combination is much smaller than the layer you started out with. And so the way this works is you see, like you may have a weight matrix on this convolution layer you started out with. We flatten it into a two-dimensional space. And then we basically apply the singular value decomposition technique from your math classes. And you get two different, two reduced matrices. So generally with singular value decomposition, you have three matrices. The third one in the middle is a diagonal matrix with singular values or small numbers. And what we do is you can throw away some of those numbers. They are small, so you can throw them away. And so now you get essentially two smaller layers. If you combine that back into two layers. So you take that diagonal matrix, throw away some of the values and then combine it back, multiply it back with one on the left and one on the right, you get two smaller layers. The technique on the right channel pruning, that is going to take convolutional layers, let's say, and throw away some of the input channels to those convolutional layers. Because maybe not all of the features are as important to the final accuracy of the model given your task. And it basically throws that away. It changes the model architecture because you have to change the remaining architecture in place. Because you have changed the dimension of this particular layer. So it goes to the previous layers and changes the dimension and so on. And one thing that I want you to take away is while there are these techniques and there is a bunch of like nitty gritty details and math involved and then how these techniques are applied and how you select, how much to compress each layer. All of that is done in a somewhat automatic fashion. And so the automatic selection of these in a per layer compressions is, I think, one of the key takeaways from that. So it's easy to use and you can go. You don't have to use the easy to use APIs. You can go and use the more underlying APIs if you wanted to try something different. So these techniques can be applied like back to back, the spatial SVD and channel pruning techniques. And here are some results like ResNet 50 and ResNet 18. These are some of the more popular models. And you will see that we achieve quite a good reduction. There's a 50% mac reduction. And the accuracy stays very close to where we started out from. So very interesting and heartening results. And I have a demo for you to show. So I'm going to share my screen. So it's a real time pose estimation model. Let me share my screen and show you. So it's easier for you to see. So what you're going to see in this video is here is just a computer monitor and it's playing a video of some people dancing on the beach or close to the beach. And we have two phones over here. These are actually commercial phones. So more funky business going on here. And the one on the top is using an uncompressed model. The one on the bottom is using a compressed model. In this case, this pose estimation model is a fairly heavy model. So if you run it uncompressed, you get only like five frames per second. That's the inference. It's all gated by how fast you can run the inferences. Compressed model. We compressed that model down more than four times. And not only you see on paper gains, but you can see like real gains on device where the inferences are sped up by close to a factor of four. So I'm going to play this video. The way we've tried to show this is we do the inferences and then we superimpose the results on top of the video. And so the video, it's just taking the video stream from the camera here, right? So you see that the skeletons lack the people and that's just a way of us showing that the inferences were so late. But as you see on the bottom, the compressed model, you can see those skeletons actually are very much tight on top of the actual people and the video. So hopefully that was interesting for you. We also have built-in visualizations, like tools to help you see how your model is, how is the compression going, is it good for quantization, not good for quantization. So there are a number of built-in visualizations. So I wanted to sort of end with some sort of insights as we did this open-source project, what were some of the learnings we had as part of this. And one thing we sort of realized is we would like, as people make comments and as contributors come in and help make comments, even since the time we launched, there have been close to 100 comments already made, just within the last month or so. And we want those comments to be regressed. So we want to make sure that as the comments come in, they don't break existing stuff. So we have to set up, we are using Jenkins CI and we set it up on an AWS instance. We set up some dockers so that we can have a reproducible environment and then we, from GitHub, from the PR, so we drive these jobs. So that has helped a lot. So that was one thing that we had to think through and say, okay, this is something that we would need. Same thing, we had to make our unit tests be much more expanded. We used to test some through unit tests and some full-blown tests with these big models and so on. And it just doesn't help to have those big models be part of PR regression. I think you want the PR regressions to be fairly tight, small, short. And so we wanted to express the same. So we expanded our unit tests to cover a whole lot of scenarios. So we don't need to run it with these full-blown models. And then one thing that we still are working on is like we realized that people have disparate development environment. So while we assume that, hey, most people would have workstations with CUDA-enabled workstations like with GPUs on them. And it's actually helps for these techniques to have CUDA because we do accelerate some of these techniques in CUDA. But for folks or developers who don't have that environment, we should make it available such that they can still run a little bit slower, but they should still be able to run. And so I think that last item is something that we learned, but we are still working on it. So yeah, in summary, we are very happy to launch this AI model efficiency toolkit. And it has these state-of-art quantization and compression techniques, and more are being in the works. And we hope to work along with the community and keep adding more. And we would really love to see contributors come in and help us expand this toolkit. And we currently support TensorFlow and PyTorch models, and we perhaps add some more support for Keras going forward and with more contributions, we could do more. And yeah, the main thing is we are trying to design a user-friendly way of optimizations. So please come visit us, collaborate with us at github.com, quick AMET. And here's a link to, if you go to YouTube and you search for Qualcomm Innovation Center or quick space AMET, you will see a whole set of videos, like about six or so. And they would go into much more detail, including code examples and so on into these features. So that's it. Thanks for paying attention and let me explain these things to you. So I see that there are some questions that have come in. So one question is if we quantize the parameters, we won't have the best model accuracy, meaning I guess the question here is perhaps if you quantize what happens to your accuracy. And as we saw through the deck, I think it all depends. There are some models, ResNet 18 is a good example. I think just by doing basic quantization techniques, you can actually run it on 8-bit integers and not see a huge drop, like less than a percent of accuracy draw. But other models like the MobileNet V2 example we saw there is a sharp drop in accuracy. So it all depends on the model. But yeah, we believe these techniques help with some models that they are not magic bullets to help with all models. But we continue to find like research other techniques like the add around and then mission beds and more stuff is coming. And we think that for the vast majority of the models out there, we should be able to run them on 8-bit integer. And there are certain models where if that is super hard to do, some layers, not all layers, some layers could be run at 16-bit integer. And I think with 16-bit integer, our experience has been that practically all models can run with almost no drop in accuracy. So yeah, so I think yeah, quantizing does have impact, but I think the techniques and that's the reason for tools like AMET to help recover back the accuracy. I think another question was how to use REST APIs. I assume, you know, can this be invoked using REST APIs? And I think while there is nothing built in here for this tool for REST APIs, but you can like host this tool anywhere and build some REST APIs on top should be fairly simple to do, I think. So yeah, nothing precludes, not that it has today, but any REST APIs support, but nothing precludes that to be added. Another question was, does Qualcomm have any specific chip or edge device to work on AI or deep learning? So yeah, if you look at a lot of the cell phones, which is the most simple device that you may have access to, a lot of the cell phones have Snapdragon chipsets in them. Snapdragon chipsets have AI inference engines, specific, which also run with integer quantized models and get all those benefits that we looked at up front. So yes, definitely Qualcomm is in the edge AI business, if you may answer, they have different, we have different solutions for this. Can you place the link to GitHub? Yeah. Yeah, I don't need to, I think, place it. I guess you can go to github.com slash quick, QIC slash AMAT AI MET. And that's it. Another question, will this work without NVIDIA GPU cards, let's say OpenCL? It says in the installation procedure that this is a requirement. So yeah, thanks for that question, Alexandra. Yeah, like I mentioned in the, one of the key learnings we have had is I think people's development environments are different. So at the very moment, yeah, there is some, the way it is built, it assumes CUDA. But we are, I think, working on making a version which would not have CUDA requirements. And then if you wanted to use some other accelerators like OpenCL or whatnot, you can definitely do that. You'll have to make a few code changes, but please come work with us. There is a forum as well for Q and A, going from that same GitHub location. And we would love to chat with you to see if that's something that we can collaborate on. Maybe one last question. How do we select features which do not create an impact? Do we independently train the model for each feature? So I'm not sure exactly I follow the question, but yeah, so let me answer it in a more generic fashion. So let's say if you take ResNet 50 as an architecture. So that has a very robust architecture for finding features from a vision use case. So if you give images, it's going to find these kind of features in it. And now you put some sort of head networks, generally that's what people do on top of the ResNet. And then you may be interested in specific, learning specific things like you may want to do face detection. And that's your task. So yeah, maybe I think what I meant was that for, so your base model can do a lot more than what your immediate task is. And so given something like this use case, that base model that ResNet 50 model is over parameterized. It can do a lot more than what you need. And so this is a very good use case for applying these model compression techniques that we looked at. And so that's where you can get those gains. Reduce the complexity of the model in an automatic fashion without having to hit on accuracy. So right about the time that I think is alerted. So thanks a lot for attending this lecture. Had real fun. I think please continue questions on the Slack channel. And I'll stick around and answer some of those questions. And yeah, thanks for giving me this opportunity and thanks to the Linux Foundation as well.