 Welcome to our talk living on the edge. My name is Aaron. I'm a software engineer at Collabra and I'll be joined by Marcus who is a machine learning scientist also at Collabra and today we're going to talk about an R&D project we launched at the beginning of this year to see if we can build a purely open source stack to do AI on the edge. So I will give an overview of the problem that we're trying to solve then Marcus will dig into some of the technical details and then at the end we'll give a demo of our results. To begin I'm going to give a brief overview of artificial intelligence field. Now this is a very broad area but I'm going to focus today on object detection. So here in the slide we have four different tasks moving from left to right from easy to more difficult. On the left we have classification where we simply want to find the most important object in the image and classify that object in this case it's a cat. Moving to the right we add localization so we put a bounding box around the object that we have identified. Moving to the right again we have object detection. Here we have multiple images in the scene and we want to detect each of those images and put a bounding box around those images and finally on the right we have the hardest task of all which is segmentation where we segment the boundary of each of those objects in the image. So what's interesting about object detection is that up until 15 or 20 years ago it was generally believed that it would take hundreds of years to create a computer system that could achieve parity with human for object detection simply because our visual system has been evolving for millions of years and we're very good at detecting objects and yet in the past five years we've developed systems that can beat humans at object detection. So let's take a look at how we got to this stage so quickly. The current approach to AI is based on a philosophy called connectionism where we look at the mammalian nervous system and create systems that are analogous to that on a computer. So here we have two neurons in the nervous system on the left. We have a neuron that's connected to the neuron on the right through a synapse which is in yellow and so on the left of that neuron we have inputs from other neurons which are voltages coming in through the dendrites and when the total voltage reaches the activation potential then that neuron is going to fire its voltage along the synapse to the neuron on the right and that neuron also has inputs from other neurons and when its activation potential reaches the threshold that it's going to fire as well. So each neuron is a relatively simple mechanism. It either is firing or it's not firing but when you put a lot of these neurons together and you connect them and you can modify the connections based on the learning about the environment that you're in then you get intelligence and that's how our brains actually work. We have around 85 billion neurons and many trillions of connections between the neurons. We also modify those connections as we grow and as we learn about our environment. And so here is an analogous system that we run on a computer called a deep learning network. So we have layers of neurons. The neurons in the brain have become those round nodes and you have layers of neurons and instead of voltages we have numbers going in and coming out of those nodes. So there's an input layer on the left then there are three hidden layers of neurons and then output layer on the right. And because this network is many layers deep we call it a deep neural net and that's where the term deep learning comes up from. And so let's take a look at the connections between the neurons in our model. When you look at when you see those arrows between the neurons there are two things going on in that arrow. First of all when a neuron outputs a number it gets passed through what's called an activation function and that transforms the output in a nonlinear way and the reason is because most problems in the world that are interesting are nonlinear and if we didn't have the activation function we would end up with just a linear system of equations which is quite easy to solve but doesn't really solve interesting problems. So we need a nonlinearity in our network and that's provided by the activation function and once the output of a neuron is passed through the activation function we apply a weight to it which is simply multiplying by a certain number. But the weights are crucial for training the network because as we look at different types of data we're going to be adjusting those weights so that our model makes predictions that are accurate. And so the first phase is the training phase. Here we have training data of which we know what we want the system to output and we feed the training data into the system and look at what it actually gives us and if it's different than what we expect we move backwards to the network and we adjust the weights. The weights usually start in a random state because we don't know what those weights should be. But once we see what the output is on the training data we will back propagate the changes. It's called back propagation. We back propagate the changes backwards from the output into the input and then we pass the training data through the network again and hopefully if we iterate and adjust the weights we'll eventually get the network to converge and that means that each time we run it the weight changes get smaller and smaller and the accuracy is sufficient from the output so that it's getting close to what we want to get on the training data. And this is a very compute intensive process so we would typically do this on many discrete GPUs. Once we've trained the network and that means that the weights have been fixed and also the connections between the neurons are fixed because during the training process we can also add and remove connections between neurons. So once the weights and the connections are set then we move to the inference phase and this is where we take data that we've never seen before. We pass it into the network and if we've trained it well then the patterns in the new data will appear in the output based on how we train the model on the training data. So we'll see the same patterns in the training data in the new data that we're passing in. Now this is a lot less computationally intensive and so this is suitable for running on a low power edge device of low power chip. So that's what we're going to look at next. Running AI on an edge device. So an edge device is typically a resource constrained device. It's constrained by memory, limited memory, limited compute power, and also limited power envelope itself. Perhaps it's running on a battery. And so the primary concern that we have is efficiency. We have to try to be as accurate as possible with the minimum compute as we possibly can so that we can get as much performance as we can out of these chips. And so there are two things that we do when we're trying to become more efficient on the edge and the first one is pruning. So if you look back at the diagram of the deep network you see that there are lots of connections between the neurons and there are many layers of neurons. And so it turns out that we can sometimes remove some of those connections and even remove some of those layers and still maintain the accuracy of the network. So that's what pruning involves. The second step is called quantization. So when we did the training step we would typically have weights being stored in 32 bit numbers. But it turns out that it's possible to reduce the precision of those weights and still get an accurate network. So instead of 32 bit weights we can drop down to 16 bit or even to 8 bit. And the advantage of dropping precision is first of all our memory requirements go down because it takes less memory to store these weights. And second of all we can make use of vector operations on the chips. So instead of doing a single operation on the 32 bit weight we could do four operations on a single operation on four 8 bit weights if we go down to 8 per precision. So those are two very important strategies that we can use to maintain our efficiency on the edge. Now what types of edge devices are people using to do AI? So this is the NVIDIA Tegra, the Xavier version. This is a popular solution for people. This has an 8 core ARM CPU and 512 CUDA core GPU. Now we have a number of issues with the Tegra solution. First of all it's a closed source solution and so we don't get the advantages of an open system. Second of all we have lock-in from CUDA which is the language that NVIDIA uses to talk to its hardware. So CUDA only runs on NVIDIA chips and so if you start to use CUDA in your system then it's very hard to port it to other hardware and so you're locked in. And because you have that lock-in partially they charge a premium on their hardware so these are very expensive chips. So the objective of our project is to see if we could have a build an alternative system that was open source that we wouldn't tie us to a particular hardware and that we could modify and change and extend at our will. So and to do this the critical piece of the puzzle is the driver, the driver that talks to the hardware and that schedules the compute on the GPU and so this is where Panfrost enters the picture and Panfrost is an open source driver from ARM Mali GPUs. Of course ARM has their own proprietary drivers but this is an effort to reverse engineer the instruction set in the architecture of these GPUs so that we have an open solution. It's fully upstreamed into Mesa which is the Linux 3D graphics layer and the team lead is Lyssa Rosenzweig who is also working with us at Collab. And now let's take a look at some of the features that Panfrost has. So Panfrost is a relatively young project and just as we were launching our project for the OpenStack, Panfrost introduced a new feature which was support for GLES 3.2. So GLES is an embedded version of the OpenGL graphics language and 3.2 is the first version of GLES that supports compute shaders and a shader is a small program that you can tell the driver to run on the GPU. This is really important for us because the neural network requires computation. We want to run it on the GPU because they're inherently parallel more efficient and faster than the CPU and so now we have a way through the shaders and Panfrost of running compute on the GPU. This is very important. Now let's look at the hardware that we would use to run the system. We quickly settled on the Rockpi 4 which is a system on a chip with a 4 core CPU. It has an ARM Mali 4 core GPU and it has 4 gigs of memory and a video decoder. It has a lot of other features but these are the ones that are most relevant for doing object detection on video and it's only around 90 bucks US so it's a very nice system and very economical. The one thing about the Rockpi is it does run a little hot and so here's a picture of Marcus's cooling solution which is a little extreme. You can just alternatively just put a big heat sink on the chip and it works fine that way as well. So we have the hardware and we have the the driver that talks to the hardware and open solution for that and now we have to move up the stack to the library that manages the actual neural network itself. We looked at a bunch of different solutions and we finalized our decision on TensorFlow Lite which is a lite version of Google's TensorFlow library. It's designed to run on mobile and IoT devices on the edge and their focus is on Android and iOS but we were able to build TensorFlow Lite for a stock ARM Debian distro so ARM vn running on the Rockpi so we get TensorFlow Lite to work on the Rockpi instead of on an Android device and now we'll look at some of the features of TensorFlow Lite. Here's a new feature that was just introduced last year called delegate. So normally TensorFlow Lite would run on the edge device the CPU or the DSP but with the delegate we can actually delegate some of the compute to another device like another DSP or a GPU and we also found that the TensorFlow Lite delegate supports GLES 3.2. So TensorFlow Lite can offload compute to the GPU and Panfrost supports that way of doing it with the support of GLES 3.2 so now we have all the pieces that we need to put the total the complete solution together and so next Marcus is going to tell you a bit more of the technical details about how we did it. Marcus. Thanks Aaron. Before we go further into the details I'd like to go over some of the basics of what delegation is and typically a user would start with an already trained model be it in TensorFlow, PyTorch, MLPAC or your favorite network library of choice and use a converter to convert the model into the TF Lite format and this TF Lite file would then be handed down to the interpreter which runs the model on the device itself and by default the model runs in the CPU so the interpreter would call out to the CPU op kernels. However most devices these days especially mobile phones or resource constrained devices have a bunch of extra ships like a mobile GPU or DSP that we can leverage to accelerate inference or even training and this is where our delegates come in which acts as a bridge between the TensorFlow Lite runtime and lower level accelerated APIs for example the GPU delegate uses OpenCL and OpenGL to run inference on mobile GPUs on various devices which now includes the RockPy as well and the natural question here is why would you use delegates at all and the most obvious benefit is faster inference again perfect example here is the GPU delegate because of the highly parallel nature of GPUs they are usually very good at performing metrics math such as convolutions or calculating the output of a fully connected layer and as a result when we use the GPU delegates in our experiments we observed up to 30 times speed ups with our optimized model and another great benefit is lower power consumptions a good example here is a DSP which are often meant for applications such as multimedia and communications which inherently require less power consumptions so when we use the DSP for inference we saw up to 50% less power which is what we observed when we tested the Qualcomm hexagon DSP to run inference so for running object detection in real time on the RockPy we use the GPU delegate which as I mentioned before gives up to 30 times speed ups on the models we tested which involves a lot of convolutions and this is how you would do things with the GPU delegate and the main idea is that you initialize the delegate instance and path it to the interpreter and the rest of the overall logic that you would normally do remains pretty much the same and in fact there's not much else you have to do for for delegates except to use these couple lines of initialization and cleanup at the end so now we know how we can leverage the TensorFlow GPU delegate to accelerate inference but if you look into the recent years there are several production trends which shape the neural network model design as well on the data center side the share of neural network inference keeps growing and it often requires using special hardware tweaks to keep the cost down on the other side the number of mobile and embedded deployments is going up too and those platforms usually have very constrained compute resources and all of that requires model design that really takes hardware into account so in each case in general what we are talking about is deep learning in a efficient way so instead of having these very big large neural networks that can be run for example on a cell phone or an embedded resource constraint device we want to make the model as small as possible and very efficient very power effective so that they're quick and fast and there are couple of ways you can do this you can take a network and make the architecture more efficient for example or you can optimize the kernel that execute or you can develop like a specific hardware for making neural network more efficient and at Collabra we are looking into all of those things however specifically my team is looking into compression quantization and the idea of compression is to remove individual weights or perhaps even better remove complete feature maps complete neurons complete convolutional channels from your network to make it more efficient and parallel to that another way to make your network more efficient is quantization which is based on a simple idea which is can a neural network run on lower positions and normally when you train a neural network it's done in like floating point 32 which means every number is represented in 32 bits and now you can calculate a lot more effectively and efficiently and you have a lot less memory transfer and energy conjunctions if you would use less bits for each of the weights and the activations in the network so the idea is that you can do calculate calculations in 8 bits quantize your whole network both the weights and all the operations in between two 8 bit operations with the result that you are a lot more efficient however since 8 bit has only 256 values one has to do some special tricks to remap floating points to a few available integers and preserve the precision of the model and the simplest approach is applying a fin transformation where the parameters such as scale and zero point are shared for the entire tensor and in general quantization with minimal accuracy drop is an area of active research over the past few years but some of the approaches have already proven to work quite well so both work together compression and quantization to kind of scale the model down and sometimes you can do this with a factor like 80 so your model becomes a lot smaller than the original model you started with so now we have these techniques namely quantization and compression but what does it take away so the current consensus is to start with a bigger model and then make it smaller instead of starting with an already small model like a mobile net or some version of tiny yolo and there's a theory from the paper called the lottery ticket hypothesis that was recently published and this kind of posits and also kind of proves to a certain extent that you are better off training a larger network that is over parameterized to train that first and then make it smaller afterwards that's usually better than training a small network from scratch and the the paper adjusted that basically you want to find a lottery ticket and here a lottery ticket is basically a network architecture that's very good for your problem and every feature will have like a certain probability of being a good feature to use so the more features you have the higher the probability is in finding those optimal features and because of that you are better off training a larger new network that is over parameterized since you have a lot more shots to find those correct architectures so to get an object detection model that runs in real time on the rock pie we used all of those techniques and as a basis we use the yolo verse three model and to get an optimized model we have the number of filters of all convolution layers we place them with the three by three layer and also replace the convolution layer number six and eight with the convolution layer of size one by one and though both layers contain a method amount of parameters meanwhile the the optimized model achieves point eight three on our data set which is very close to the original model which has about point nine one accuracy and the main reason that our compression method does not give no to personal accuracy drop is that the redundant connections in those remain in layers still exist so if you want to go deeper into the details and also have a rock pie you can take a look at our gate leopard polytory there are some instructions to set everything up as well as how to run the model itself thanks and back to Eren so now we're ready to show you the results of our project but before I do that before we do that here is a brief comparison of the unoptimized and the optimized yolo model that we used for the object detection so you see in terms of accuracy that the optimized version is not that different than the unoptimized it's slightly less accurate but when we go to the frame rate and performance you get a big jump particularly on the gpu side we're getting 27 frames per second which is essentially a real-time object detector which is really nice and we were very happy with those results in terms of future directions first of all as I mentioned panfrost is a relatively young project and gles 3.2 is a very new feature for panfrost so we're hoping that as the driver matures we will get some compute improvements on the gles side of things there's another feature that's on the horizon for panfrost called medium p medium precision if you recall I was we were talking about quantization of the of the model medium p supports 16-bit precision rather than 32-bit precision in the driver and so if we were able to add medium p then we can quantize our models by half and so that should give us a big boost in terms of performance and also reduce our memory requirements and our power usage third of all there is on some of the rock pie versions there is something called a neural processing unit or an npu which is a fancy name for another gpu on the chip on the on the system that is just devoted to doing neural network calculations so now panfrost is an effort to reverse engineer the Mali gpu if we are able to reverse engineer the npu then we have a second device that we can offload compute to and then the challenge would be to load balance the computations between two devices instead of one but that should also give us a boost in performance and finally gstreamer unfortunately we didn't get to that layer of the stack on this project but gstreamer as you may know is a popular pipeline-based framework for doing multimedia playing audio and video and currently there's no upstream solution to add object detection or any type of neural net activity into the pipeline which is a natural thing you'd want to do when you're streaming videos to do facial recognition or object detection or any type of processing we can't there's nothing upstream at the moment for gstreamer so we're hoping to integrate what we've done on panfrost and let's say on the rock pie into gstreamer so that it would be possible to have new elements that would support object detection on this device and now finally we're going to show you our object detector running on the rock pie using panfrost and so here we have a scene from the pre-covid days that you may remember a crowded street scene and as you can see the object detector is performing quite well on people it's very accurate it does have some trouble with the baby carriage that it thinks is a motorcycle but other than that the accuracy is very good at real-time frame rates and we were very happy with the results and so in conclusion we wanted to thank you all very much for attending our session today and we both hope you enjoyed it we will be available to take your questions now through the chat tool or feel free to find us online at any time if you're interested in any part of this presentation thanks again