 It's great to have you here in your free time in the evening. My name is Andrea Pizzoni. I work for Accenture here in Singapore. I know Accenture is usually not a company that is famous for doing deep learning or machine learning in general. But if I may say so myself, here in Singapore we actually do have quite a strong deep learning and innovation group. And really to put the point across, again, consultancy companies are usually famous for slides. I have no slides today, zero slides. I have a Python notebook and we're going to go walk through some code together. So the idea here is you want to deploy some models on your mobile phone. And that's what I basically do for most of my day job. When I tell these two people, they usually tell me, that sounds quite crazy. I usually know that mobile cannot really support deep learning stuff. If you have inception, if you have big models, they're going to take hundreds of megabytes of space. They're going to take gigabytes of RAM on your computer, so you need a big powerful cloud instance to run them. And that's actually not really true. Again, as Martin said, you have a lot of weapons now at your disposal. First of all, architectures have evolved to make things smaller and faster. And now even you want to really be crazy and use a very big architecture. There are methods such as quantization and weight rounding that allow you to actually do it. So we're going to take it from the top here. We're going to take it very slow. And I know here in this meetup usually we use Keras. So let's say we want to use Keras and we want to deploy our Keras model on mobile. The idea here is that unfortunately, if you know how Keras works, after the Keras training is done, it usually tries to save your model in its own format. And we need to try to convert this into something that then the mobile is already sort of equipped to read, which would be the TensorFlow Protobuf format. That is the TensorFlow sort of vanilla TensorFlow way of saving models. So fortunately we can do that. Now Keras supports TensorFlow backends. So we can do some magic and try to make Keras models saved in TensorFlow format. So what we do is, again, we try to build a very, very simple networking in Keras just for show. We're stacking some convolutional layers, some relus. If you really have a good eye, you can see something weird is happening here with some lambda. Don't worry about it for now. We're going to get back to that later. Again, very simple model. We're going to train it. We're not going to do it right now on this very small MacBook as the CPU will take a while to train a good model. But anyways, the point is, while you train your model, you want to be able to save whatever you're training, whatever parameter you're creating in TensorFlow format. How to do that? Fortunately, Keras has a very powerful tool that are Keras callbacks. Keras callbacks, I think Sam talked about Keras callbacks last time, are something that get called automatically whenever a special event happens in Keras. In this case, for example, we can create a callback that happens every time an epoch of learning, an epoch of learning is done. What we want this callback to do is, once we have set up our backend in Keras, which is the TensorFlow backend, and we have sort of got the session out of this backend, we want this session, the TensorFlow session, to save the model, to save whatever the Keras sort of front-end has trained so far. So this is very, very simple to do. We create a class to create a callback. We initialize this class with a saver and a session. And on epoch end, so every epoch, we're going to save whatever Keras has trained. And this save model, this save function is simply a TensorFlow function. You can go online on the TensorFlow documentation and find it. So what's going to happen here is after you actually call this callback and you go on your model, you're going to have these checkpoint files. Checkpoint files are the way TensorFlow saves a model in progress. So you have not finished. You have checkpoint files. You cannot just use these checkpoint files on mobile, but it's something. You're basically already converted your Keras model to TensorFlow. So now that we have checkpoint files, let's say we finished our training. So we have a lot of checkpoint files that were saved at every epoch. We want to take maybe the last checkpoint and really freeze the model in TensorFlow format so that we can deploy it on our mobile platform. Okay, there is a few ways we can do that. It's actually not too difficult. One way is actually we can start writing the definition of our TensorFlow graph in Protobuf file just by calling writeGraph from TensorFlow. That's it. Very simple. Since we already sort of extracted the session from Keras front end, we extracted the session from TensorFlow back end. So we can use this session to just say print out in TensorFlow in PB file or PBTXT file, whatever it's in the session. So this is very powerful and can be very funny as the names that you use to sort of choose your models, your modeling Keras, whatever name you use for your convenience. If you save your model in PBTXT file, these names will still be there. So you can actually then, if you choose to save your model in PBTXT file, you can open double click on this PBTXT and really see whatever layers, TensorFlow vanilla layers, the Keras model created once you call it. But say, even simpler than that, we don't want to sort of save the TensorFlow graph definition separately. We just want to create our Keras model. We want to train it and we want to freeze it in TensorFlow format all by using Python. We don't want to use any sort of Google tools to do that. You can use Google tools. There is a very famous tool called Freeze Graph. It's a C++ utility. It has Python bindings. Link is over here if you want to check it. This notebook file will be put on GitHub. Anyways, what we try to do here, we don't want to use too many tools that we don't know exactly what they do. We want to try to do things ourselves. So let's say we trained our Keras model. We now have a lot of checkpoints. What we want to do is we want to prepare our model for freezing and then we want to actually freeze it. So when I say prepare the model from freezing, there is a few things that we might want to do. We want to maybe clear out whatever remain there is left and we want to maybe, if we have some assignments of specific devices assigned to every node, we might want to remove that. So for example, if we train the model on the GPU and for whatever reason we really assigned this node on the GPU, then we don't want these assignments to remain there when we actually deploy on mobile because we're going to use CPU to do inference on mobile. So this is very simple again. You get your model definition. You get the checkpoints and this is again another TensorFlow method to get the checkpoints. The checkpoint folder here will just be the folder where you save your checkpoints and you create a saver. And again, what you do is you set up a TensorFlow session and you link it to Keras and you restore your checkpoint in memory after you finish the training and you try to write it down in PB file. So here basically what you can do is you can load the checkpoint, remove all the assignments of hardware or such and then save it back. Once you have done all your pre-processing, you can freeze the model. In order to freeze the model, you have to actually specify a few things. You have to tell TensorFlow which layer is your input layer and which layer is your output layer. There can be more than one output layer. For example, object detection models sometimes have three output layers to say where the objects are, which objects are those and how certain I am that these objects are there. So that's not a problem. You just specify multiple sort of output layers but in this case we created a very simple model again and the only output layer that we have is a simple softmax node. So nothing we are happening here. So what we do, again, we have our checkpoint that we saved through our Keras callback. We load our checkpoint and we specify what is the input, what is the output and simply at this point we can load the weights that we have been training through Keras and save them in proto buffer format. In order to do that, we simply have to, again, use the TensorFlow playing method, convert variable to constant that really frees whatever you have created. So once everything is converted into constant, it can be put into a pb file through the method of write serialize the string. And again, all these methods have documentation and we can link it to the GitHub and you can go and check it out. As you can see, it's very, very simple to actually get a Keras model, save it in proto buffer file and just deploy it like that. So at this point what we have is a sort of standard vanilla pb TensorFlow model file. What we want to do now is we want to try to understand exactly what happened. We're not really just happy to save it and say hopefully everything is going to be okay. Unfortunately Google provides a lot of tools to do that. First we're going to use is the tool called Samurai's Graph. This is a C++ tool that comes integrated with TensorFlow. When I say it's a C++ tool that comes integrated with TensorFlow, it means you actually have to go download TensorFlow source. You cannot just do pip install TensorFlow on your Python. Unfortunately that's not enough. So you go on GitHub, you clone TensorFlow, you build it, it's not difficult. There are very detailed instructions to do that. And you have access to all these sort of nice utilities such as Samurai's Graph. Samurai's Graph allows you to actually take a pb file and really inspect it to understand whatever operations you have in the file, which ones are your input layers, which one are your output layers, even which one is the estimation of the power necessary to run this graph. So you have an estimation of exactly how much computational power you need to run this graph based on basically the input size and the operations that go through. So this is, for example, let's say we try to summarize the previous graph that we created. And as you can see, we get one possible input, the image input layer, not a surprise, and one possible output that is the class result softmax layer that we created before. And we have just a few operations, right? We have some constant operations, some relu, some convolution. Everything seems fine. Actually, there is some catches, but we can talk about this shortly. Again, this works for every kind of model. If you have, for example, three output layers instead of just one output layer, it's still gonna tell you. For example, this is the standard output of a summarized graph for an object detector. So you have still one input, image input. You are trying to create an object detection model. You pass an image to it. And you have three outputs. So again, the bounding box, that is where the object stays. The probability score in this case means how certain the model is that the object is there. And the class index, which means which object the network thinks is there. And again, you have your operations and everything seems fine. But actually, for both of these models, if we tried to deploy those on mobile, it would not work. We would have some bugs and the application would crash. So what I would like you to understand is you can actually use this C++ tool to actually diagnose what's going on in your model and understand if you're gonna be able to deploy it on mobile. So, for example, on the first model that we have right here, we can look at the operations and we see there is a random uniform operation. And I just made a big mess. And there is a random uniform operation. We think, okay, why do we need a random uniform? Random uniform is an operation that creates a random number with the uniform probability. So it doesn't seem something like we need for our inference. And in fact, TensorFlow doesn't ship with this operation on mobile. So if you compile TensorFlow on mobile, by default it will not have this operation inside because it's not used during inference. So then when you go and deploy your model on mobile, it will not work because this operation is not there. So either we can try to recompile TensorFlow with this operation inside or we need to try to understand what's going on and why we have a random uniform in our model if we're not gonna use it for inference. Turns out random uniform is used for dropout. So dropout means you have a layer on which with a uniform distribution you're gonna turn on and turn off some of the nodes. So, for example, in order to avoid overfitting, you might want to turn off 20% of the nodes every time, but you don't know which nodes because you don't want to turn off always the same nodes. So you have a random number generator that tells you which nodes you want to turn off. So, of course, dropout is only used in training. It's not useful in inference, so that's why TensorFlow doesn't ship with it. So what we would like to do now is we would like to remove the dropout layer from our network. So let's think about it. We want to remove our dropout layer, but is that safe to do? So, again, by sort of just a simple understanding, that would not be really a safe thing to do. You are taking out 20% of the nodes every time you are passing through the dropout layer and now that you are in inference, you're not gonna take out these nodes anymore. So, after this layer, whatever comes after is gonna be in a different shape. It's gonna be different because before 20% of your input was cut off and now you're gonna have 100% of your input coming through. So, actually, what historically has been done when you remove dropout is you substitute it with a multiplication layer so that you actually say if you remove 20% of your nodes in training, you want to maybe sort of multiply all your inputs by 0.8 so that they are still at scale when you do the inference. However, this is not really convenient. So, fortunately, TensorFlow actually doesn't... doesn't actually do dropout. It does something called the inverse dropout. So, without you knowing, TensorFlow is already rescaling things at training time, meaning that whatever if you deploy, for example, take out 20% of your nodes, the inputs are already scaled by 1.2, meaning that then when we have our inference time, we can just safely remove dropout without any issue because the weights were already scaled before we started. So, again, we know we have TensorFlow. We can remove dropout. How to remove dropout? Again, there are a few methods to do that. Some of them are very funny. You go on the Internet, you see people suggesting you get your graph, you print it out, you assign a number to every node of your graph, you see which number is dropout, and you cut out the number of dropout on the graph and you try to rewire the inputs and the outputs of the sort of neighboring nodes. Or maybe you just parcel the nodes, you see whatever has dropout in the name, you remove it. These are sort of not really orthodox methods to remove dropout. The safest way to do it is if you have your model, you already know you're putting dropout over there. Simply write your function so that you can have a boolean when you write the function to get your model. If you are training, you build the model with dropout. If you are not training, you build the model with dropout. So, let's take a look and see. And in this case, we are building, again, a very simple model with a boolean. If we are training, dropout gets put in. If we are not training, dropout is excluded. This way, actually, when then we build the model and we freeze it, dropout is going to be out. And we're not going to have any problem with random uniform. We don't have to recompile TensorFlow with random uniform. We are just safe to use it. Here, again, nice bonus. We're using lambda here, meaning we want to use the dropout layer directly from TensorFlow vanilla. This is just because you want to really make sure of what's going on when you try to remove layers and you don't want to just assume how things are going. So, if you are sure TensorFlow is using inverted dropout, use the vanilla layer. Actually, Keras also uses inverted dropout, but for dropout, Keras does something a bit weird as it splits the graph in two. One direction is for training. One direction is for inference. So, again, you can still work with that, but you might want to avoid this weird sort of dual graph going on. So, you might want to use the TensorFlow vanilla layer, which just has a single path and just rescales the input, and that's fine. Again, now the dropout is going to go out in inference and we're safe to use in mobile. Let's take a look at the other network that we had in our inspection tool. As you can see here, it says 306 nodes assigned to device GPU 0. So here, whoever has built this network actually didn't really follow the advice that we said before. We have to remove the device and we don't have to have anything assigned to the nodes. If you try to deploy these on an iPhone, for example, or an Android, it will say, I don't have any idea what GPU 0 means, so I cannot actually run this model. So let's say, okay, you don't really want to re-freeze the graph because that would be a lot of work. Fortunately, again, Google comes to rescue and they have another very interesting tool that can really help you deploy your model on mobile and suggest if you really want to deploy your models on mobile, always use this tool. And this tool is called the graph transform tool. So Google GTT, which again you can find here, then on GitHub you can go and explore. The Google transform tool, the graph transform tool, really allows you to do a lot of operations to optimize and sort of change your graph so that then you can deploy your mobile without any problems. So it allows you to, for example, fold constants, fold constants means if you have, for example, a set of nodes and these nodes always go out as a constant in the end, it's going to just compress them all and just going to output this constant because it doesn't matter to have many nodes. So you're going to cut out nodes. You're going to give it the nodes that you're going to want in the input and the output. He's going to go through all these nodes. He's going to check, for example, which nodes are not parsed here. If you have some nodes, for example, you don't want these nodes to be there, so you just cut them out. And this tool is amazing and can do everything for you without you doing anything by yourself. So again, it's a C++ Google tool, so you download TensorFlow from source, you build it with your Bazel, and then you can use this tool. You just have to compile this tool specifically, and then you can use it. How you use this tool is very simple, so this is a sort of sample comment. Which one is your graph in input? Graph is just the model that we saved. Which one is going to be your output? Which means if you want to overwrite this graph or you want to give it a different name simply. Which ones are the input layers and the output layers? And which transforms we want to apply to this graph? So in this case, for example, I'm stripping away nodes that I'm not using. I'm stripping away identity nodes because why would I want to keep identity nodes over there? And again, I'm compressing constant, compressing the batch norms and removing the device. So this is sort of the basic stuff that you really want to do to deploy the model on mobile. GTT is so awesome that it can do even more advanced stuff. For example, it can do quantization and the weight rounding that Martin was talking about before. So for example, if I actually want to transform my graph and I want my graph to be quantized automatically, I just have to add one transformation that is quantized weight. And now if I run this command in C++ I will actually get my original graph transformed, take away all the constant, all the identities and such, and quantize all the weights. Just to give you a sample of what happens here is I have sort of a few models. So for example, this is your original model and let's take a look at that. This is 12 megabytes. It's simply a sort of a squeeze-net based model so it's very small, 12 megabytes. But it's possible that you might want to have a bigger model or you might want to have many models in your phone. You have a quantized model. A quantized model is 3 megabytes. So it's the same model, quantized. You just have to go and add the quantized transformation and it's going to go down to 3 megabytes. Unfortunately, sometimes the accuracy of the model can also degrade a bit. There is something called fake quantization that you can do during training to sort of improve the quantized result. If you really are interested in fake quantization, it's a bit extreme topic. You can go check out a Google employee called Pete Warden who has a wonderful blog. I will put the link on the GitHub notebook. And he's trying to teach a bit about how to use fake quantization. At least he's explaining what it is and how to use it. So that's very interesting. But say you don't want to be so extreme. You don't want to really lose accuracy. Then you want to find the compromise. You still want to sort of reduce the space occupied by your models. Then you can use weight rounding. Weight rounding, if you look at this model, it's still 12 megabytes. So it seems applying the weight rounding transformation did not do anything. So when you go and compress this model, so we try to compress this. And now this is 2.8 megabytes. While if I go and try to compress the original model, we have still 11 megabytes. So actually weight rounding works very well, especially if you think that when you deploy a model, for example, on an application, the application is going to get archived before it's sent to any app store. So the model is always going to be compressed. So you don't have to zip your model and then extract your model and so on and so forth. You have to do that if you want to actually keep updating your model without updating the application. And then you might want to have your model in a download folder and you keep it zipped and you unzip it when you want to use it, possibly. But if your model is already right there and you're going to keep it constant, and whenever you want to upgrade your model, you send an update to the application, then the compression really comes out of the box. So let's go back to our Python notebook here. As a sort of a bonus, if you guys are crazy enough, you can really do crazy stuff with TensorFlow. Say you really don't want to take out Dropout because you don't trust taking out nodes. So you can add Dropout and recompile TensorFlow for both iOS and Android. So the links are here. Basically what you have, you have, I don't have internet here so I cannot show you, but what you have is you have two files that basically list all the operations that TensorFlow is going to ship with for your platform, Android, iOS, anything. If you want to add operations, you can do it. But on the other side, if you really want to be extreme on the space, right now TensorFlow occupies about 20 megabytes, you want to be super extreme on the space of your application, you can remove all the operation that your graph is not going to use. For example, if you know your graph is not going to use Concat and it's never going to use Concat, and if you're going to start using Concat, you can update your application to change your TensorFlow. Then you can go and safely take out Concat from TensorFlow and recompile it, and it's not going to link all the Concat files and it's going to have a smaller footprint. So you can have a TensorFlow that is actually much smaller than what gets shipped by default, and Google actually suggests you try to do that. Again, it's at your risk, but if you feel very brave, you can actually do that. You go get those files and select the operations that you want, and you can actually check the operations that you are using through the summarized graph tool that we checked before. So you can use the summarized graph tool, have a list of all the operations that you're using, and just keep those operations and delete whatever else is you don't need, and the TensorFlow is going to be much smaller. So again, bonus, if you really want to be brave and... Okay, now we are finally ready to use TensorFlow on mobile, and there are some examples that Google provides. For example, this is an iOS example. We can take a look at this example, but you can already see this example here is in Objective-C. Actually, it's Objective-C++. What we really would like to do is we would like to do... We would like to write our inference engine in TensorFlow so that we can only have to write it once. We write it once, and then we deploy it on Android, we deploy it on iOS, Windows, whatever. We don't really want to rewrite all the sort of session operations for every single platform. So we can do that. We just have to use the underlying TensorFlow-C++ API again. So let's first take a look at what Google is doing here. So to be simple, they just... If we take a look at this Run-CNNOnFrame function, they are taking an image from the camera. So the idea is always you're going to have a few components in your mobile app. You're going to have your UI, where you do all the normal mobile stuff. Then you're going to have a wrapper in between the UI and the TensorFlow engine that can actually talk to both components. And then you're going to have your C++ underlying. So what Google is doing here, they're going to take a photo through the camera. They're going to sort of pass it through the wrapper, convert it into a format that TensorFlow can understand. In this case, if you take a look, they just try to get the raw data and sort of reshape it in a format that is nice for TensorFlow, and then try to run a session through this data. And we're going to try to sort of mimic what they're doing, but we're going to try to do it in C++. So in order to do that, what we're going to use, for example, is OpenCV. You don't need to use OpenCV. If you're going to use words, for example, not images, you don't need to use OpenCV. If you go use computer vision, OpenCV is a nice tool to have. It's easy to do preprocessing. Maybe you already know the tool. So we're going to have a quick demo of, in this case, a soda object detection model running on an iPhone. So we have our iPhone here. We're going to get some images. Soda photos, yes. Wow, this is very small. Okay. So in this case, I trained my model to actually recognize soda products on a mobile device. So what I'm going to do, on the left, you have the sort of streaming of the mobile phone and on the right, you have the photo I'm going to take. So I'm going to take this photo. And you see, in half a second, you get all these colored rectangles around these objects. This is a good sign. It means the objects get detected. We might want to check if the objects are really what they seem to be. So let's just, I mean, you have a list of these objects. Let's just go and check. Yeah. It seems quite all right. You have some cans in the top two levels, Coke normal, Coke light and such. I'm not partnered with Coca-Cola or any affiliates. So, yeah. No, you can go and buy Pepsi. That's fine. I'm on video, so both companies are great. Anyways. So as you can see, in half a second, a sort of task, like object detection that is thought to be even more difficult and more complex than object classification can be done in just half a second by an iPhone. The iPhone of three, four years old, you're going to do it in three, five seconds, but still, if you had to do this task on the cloud, you would have to take the image, and maybe you need a sort of high-rich resolution image to be able to detect all the details. You're going to have to push it on the cloud, so depending on your internet connection, it's going to take a while. All the processing is going to happen on the cloud, and if you are deploying at scale, it's possible that you have thousands of users sending you images concurrently. It's going to be queued and processed when the cloud is ready, and then the result is going to be pushed back. So it's not going to be as fast as 0.5 seconds. It's going to be probably even three, five minutes. So actually deploying on mobile, especially if you are an enterprise, can both save time and cost, as you don't have any sort of cloud infrastructure to build, you don't have any cloud computing to maintain, and the users themselves will bring the infrastructure to you and you're going to have a decentralized TensorFlow engine, an inference engine infrastructure that works cross-platform and you just built it in C++ once, and you can deploy it on Windows, iOS, and Android. Okay, I think I'm done very simply. Let's just summarize what we learned today. We built a Keras model and we saved it in TensorFlow format. We froze it using the TensorFlow C++ API, and we checked it through the summarized graph tool to be able to understand exactly what was going on. And again, we tried to use this tool as a diagnostic to understand exactly what was not going well with our models. And we deployed and improved our models through the graph transform tool, and then we created a simple sort of cross-platform inference engine in C++. And I guess in 40 minutes, that's enough stuff. If you want to follow up, again, if you see whatever you have seen here is cool, you want to join the team, this is my professional email. If you want to follow up on anything else, you need advice or just need to chat about TensorFlow because it's awesome. You have my personal email here, feel free to contact me. Don't spam me, please. Okay, I'm done. So if there is any question, I'm happy to reply. Yes? Okay, this is crazy, yeah? If you want to recompile TensorFlow, you can add all the operations. You can actually do training. I don't know why you would want to do that because you can only use CPU unless Sam is going to teach you how to use GPU. But again, you have to understand, usually for training, we use maybe an NVIDIA Titan X, which is as powerful as the 10,000 iPhone. So... Oh, yeah, I mean, if you... Again, it always depends on how brave you are. I would not suggest it probably in 2017. Maybe in a couple of years, who knows, the mobile phones will get so powerful that you can train any model. Maybe you have models small enough, who knows. But definitely something theoretically you can do, you just have to recompile to include all the sort of operations and then whatever you get on the phone is the sort of vanilla C++ API that can do basically everything, okay? Yes, one question from the back. Yeah. Okay, so weight rounding, basically, you just try to wiggle the weights in a way that then... Okay, basically, if you have normal weights, by the way, the compression algorithms work, like Z, you cannot really gain a lot of space because you cannot do any checksum or anything because these weights are all over the place. But if you actually wiggle around these weights just a little bit so that they look nicer for a compression algorithm, you actually lose very little accuracy, but you gain a lot of space when you compress. Just because, by the way, these compression algorithms work, they can actually check some a lot more and through the checksum then they can go back to the original weights. But actually, this is a very good question and why a reference from tool can do weight rounding? You're really serious about sort of getting smaller models. I would advise you actually do something called deep compression and I think I have the reference here. Deep compression is actually very simple and very similar. I have a reference on the Python notebook of a deep compression repository where you can just call your Python and you can compress your weights in the same way. It's going to take half an hour instead of five seconds, but the accuracy result will basically be the same as the original model, but when it gets compressed it's going to take less space. So again, I'm going to push up the... I don't know who can check all the references to deep compression and such. Oh, excellent complication for mobiles. We are experimenting with that. I think the performance are actually quite great, but again, I don't want to say too much because it's very experimental. Even TensorFlow is suggesting not to use it in deployment for now. If you, for example, are deploying to clients, as, again, so far it's something very new and that they are still improving. By the way, if you are building TensorFlow from source, you can decide to include the XLA compiler during your configuration script. If you buy build from source, one of the steps is you have to run a configure script, which is going to ask you a lot of questions. For example, if you want to build with CUDA, if you want to build with OpenCL support even, and one of these questions is if you want to create the XLA compiler shipping with TensorFlow. So if you want to try to experiment with that, that's the way to go. And I think it's very interesting. Maybe in one or two years it's going to be very great.