 You have an image, and there's this big machine sitting in the middle. And that will give us some classification outputs. So in this case, it's an image. And the task here is to do classification. Now, you may be excited to know that deep neural networks can do all this. And probably you'll go home and try to build an app that can sort of use this technology and try to deploy it on mobile phones. But one nasty surprise that you'll come across is that it's very difficult to port these things onto small devices like mobile phones. The reason is very simple. This thing that's sitting in the middle is very, very large. So why is that a problem? So if the model is very large, then it takes up a lot of memory. And it also takes up a lot of time to do inference. What do I mean by inference? Inference is the process of feed forwarding the image through the network. So inference time will be very large. And the memory taken by the network will also be very large. So although these networks work very well, they're really too big and they're too slow, the question is how do you make them smaller? So we call this problem as model compression. You take a deep learning model and you compress it. So why do we need compression? Because deep neural networks have a large number of parameters. So the network that you saw in the previous slide that took a disk space of around 200 MB, which is really large for mobile devices. And it takes about three frames per second on, say, a mobile CPU. So it means that that big machine that you saw there, it can only process three frames in one second, which may not be the speed that you're looking for. Ideally, you would want something that is much faster than just three frames per second. And even 200 MB, that is really too big for storing onto the RAM of a mobile phone, let's just say. Even if an app is, say, 200 MB, we would hesitate to download it, because it's so big. You want app sizes to be ideally very small so that it can run efficiently on mobile phones. So the point here is that we need to run deep neural networks on platforms with limited computational resources, such as embedded platforms and mobile platforms. So what we are saying here is that the current deep neural networks taken as it is may not be able to fulfill, may not be able to run on such small devices. So we need to compress models so that these models can be run on such small devices. So this is what we call the problem of model compression. You have a large model which takes lots of parameters. It's very slow and uses GPUs. But it has good accuracy. That's what we like about deep neural networks. They get good accuracy. What we want to do with the process of model compression is to turn it into a smaller model such that you can use CPUs now. We do not use GPUs to do inference, possibly also training. But the point here is that even this small model should get good accuracy. It should get an accuracy comparable to the large model. Ideally, it should not lose any accuracy. So this is a process of model compression. And typically, when converting from a large model to a small model, we use training data. It's not a one-shot process. Typically, you fine-tune the large model such that it is smaller. In terms of an optimization perspective, you optimize this large model. And say that the constraint is that the resulting model must be much smaller than the original model. So now we know what model compression is. Just making a model smaller. So let's just look at what kind of layers are there present in a CNN. So we have come across convolutional layers. And we have come across fully connected layers. Let's just assume that all neural networks contain these two kinds of layers. It turns out that these two are almost opposites of each other. What do I mean by that? If you look at speed, for instance, it turns out that convolutional layers are very slow. They're very slow to infer. But they have very low number of parameters. So they take less memory space. On the other hand, fully connected layers are just the opposite. They are very fast to perform inference. But on the other hand, they take up really large amounts of memory. These two are sort of opposites of each other. But that is good for us. Because if you want to make a network faster, just compress convolutional layers. Because we know that this is very slow. If you want to make a network smaller, just compress fully connected layers. Because we know that that takes up a lot of space. So we have two kinds of layers, convolutional fully connected. If you want to make a network faster, we compress the fully connected layers. If you want to make it smaller, that is make it consume less memory, then we prune or compress fully connected layers. So we've seen all sorts of scary diagrams of neural networks. What I'll try to do in this talk is to work with this tiny toy neural network. So there are only three input channels here. It means the input is a vector of just three numbers. And your output is just a single number, a very small toy neural network. And there are two hidden layers and four neurons in each of the hidden layers, a very simple neural network. And these contain only fully connected layers. But you'll ask, what about convolutional layers? So what I'll try to do in this talk is to just analyze these toy neural networks. But it turns out that whatever analysis we do on these networks can be extended to convolutional layers as well. So I'll talk a few things about these networks. But you can easily extend them to convolutional layers as well. So when we consider this toy neural network, let us just take the thing in the middle here. These set of lines connecting these two hidden layers. It turns out that you can write this as a matrix. It's a four cross four matrix because there are four hidden layers here and four hidden layers here. So far we are good. So all a neural network is doing is, all a neural network is, is just a bunch of matrices. And all we do during inference is we do a bunch of matrix vector multiplications. So what we are saying now is the problem of compressing neural networks is the same as compressing these matrices. You have three matrices in a neural network here. I'll try to compress each of them. And that's how I will compress the entire neural network. So I just said we have to compress matrices. But what does that even mean? So suppose you have an n cross n matrix that'll have n square entries. But if n square is very large, what do we do? So that's where compression comes in. We have to design matrices such that the number of elements in the matrix is much smaller. So we have to come up with techniques that reduce the size of each matrix. So to do that, we will look at four general techniques today. So these do not correspond to any one particular method or any very specific method. But these actually talk about a family of methods which we can use to compress neural networks or compress matrices. So we said that we'll talk about four strategies. The first strategy is just to sparsify the matrix. What do I mean by sparseify the matrix? What do I mean by a sparse matrix? A sparse matrix is just a matrix with lots of zeros. So what if it has lots of zeros? It turns out that if a matrix is very sparse, you can store it very efficiently. If a matrix has lots of zeros, it means that you only need to store those which are not zero. So it turns out that if lots of them are zero, then you only need to store those which are not zero. That reduces our storage requirement. Instead of storing n square elements, you only need to store k elements where k is the number of non-zero elements in the matrix. And along with that, you also need to store the index. That is where it is non-zero. You need to store two things. What is the value of the element in the matrix where it is non-zero and where is it non-zero? If you store these two things, you can actually compress the amount of information you have in a matrix. And you need not store n square elements. So this is one such transformation that people usually use to compress neural networks. I think Suthirth also talked about something similar. He wanted to sparseify his convolution kernels and, in fact, this is the same thing. But what people usually do here is that once you have a trained neural network which has a dense matrix sitting in it, you actually convert. You want to convert the dense matrix to a sparse matrix. And the process of converting from a dense matrix to a sparse matrix uses data in the middle. So it's an optimization problem which takes into account data to create sparse matrices. So this is fine. But how does it look in terms of a neural network? So this was a hidden layer that we were talking about. There were four neurons here, four neurons here, and everything is connected to everything else. But when we have a sparse matrix, we see that a lot of the lines here are missing. That's because they're all zeros, and we don't need to store them. So as you can see immediately, this is a much simpler neural network than this. And this is much faster to evaluate. And it's also a lot easier to store as well. So this is the first way in which you can compress matrices or neural networks. The second way, which is sort of the work that we have been doing in our lab, is to shrink a matrix. So what do I mean by shrinking a matrix? You have a 4 cross 4 matrix. The question we are asking here is can you do the same job with a 4 cross 2 matrix or maybe something, another matrix with a much smaller number of dimensions along any axis? So to do this transformation as well, you use data. But it's not immediately clear what this is doing. So we look at the neural network view. So you had this thing earlier. What we are trying to say here is that maybe you don't need four neurons here. Maybe two of them will do. And if you can get the same performance with two of them, you're essentially saving on memory. You're not storing 16 elements. You're only storing 4 into 8 elements. And also inference is also faster, obviously. But let's just for a moment think about what this means. You're actually selecting which neurons are important and which are not, which is analogous to selecting the architecture of the neural network itself. So usually when you are training a neural network, you'll decide, let me use 96 filters in the first layer. Let me use 300 filters in the second layer. But then most of these decisions, they're pretty arbitrary. You just play around with something and you just stick with something till it starts working properly. But what we are saying here is that most of the time when you do that, you're actually creating an over-complete architecture. You are having more number of filters than is actually necessary. So what this process does is it takes a neural network and it learns which features are important and prunes away the others. So this learns the architecture of a neural network. You can also use this to train neural networks from scratch. We have shown that you can use these kinds of techniques to obtain minimal architectures. That is, it will only have the minimal number of neurons needed for the task and not anymore. Essentially what this does is we are actually select, we are saying that we give it the freedom to select architecture. But we say that if you have a choice, you go for the smaller one rather than the larger one. So this is very effective in selecting small architectures. So along with compression, that is shrinking the size of the matrix, we are also implicitly selecting or learning the architecture of the neural network. So this is actually in contrast to some of the earlier things we were seeing about AutoML, where we have to do a grid search to see how many number of neurons are ideally required. But this can learn all that in a single shot. So this is one way. And another way, which is very interesting, is called, it's called breaking the matrix. You have a 4 cross 4 matrix. And can I do something to the matrix to break it into two parts? There are things called matrix factorizations which can break a matrix into two parts, or as many parts as we want. So there are some general techniques which works for matrices, and there are techniques which works for only neural networks as well. That's what people have been working on. So what do these do, right? So these basically convert one matrix into product of two matrices. So if your matrix has some good properties, like say lower rank or whatever, then you will be saving on your number of parameters. So this is actually very powerful in practice, matrix factorization. And if anybody of you want to compress neural networks, I would suggest to use this first. And the last thing that I'll touch upon is quantizing the matrix. Very simple. You have 32-bit floating point elements in your matrix. Can I do the same with 8-bit floating point? So you can do all kinds of quantization. You can do linear or non-linear. You can even make it binary. So all sorts of things are present in literature. But then the catch here is that you need to have hardware to implement any weird quantizations that you might have. In neural net view, it looks like this. Minus 1 is red, plus 1 is blue. Your weight matrix becomes sparse. And it also has two states, essentially. It can either be plus 1 or minus 1. And this is what quantization does. It makes it easier to store and also makes it faster. Alternately, you can also go ahead and binarize the entire neural network. You don't say that, OK, my weights are binary. You say everything in your neural network is binary. What does that mean? It means that instead of this, you do this, where your features are also binary. This sort of saying all the features that my neural network computes will be binary. And all the weights are doing is transforming one set of binary codes to another. So it's just a code transformation that's happening. So it turns out this is extremely effective. This provides a 58x speedup. This is really, really crazy. So 58x is the first time this caught people's attention. But then it loses a lot of accuracy. So people are researching more on this. So I'll just summarize what all we have seen till now. We talked about fully connected layers. And we said that if you compress fully connected layers, we can make networks smaller. We talked about convolutional layers. And we said that if you compress convolutional layers, we can make networks faster. And we saw four general strategies to do this. One is sparsification, shrinking, breaking the matrix into two parts, and quantization. And needless to say, instead of applying individually, you can also apply the model together. You can apply a sparsification, and shrinking, and breaking. That'll give more and more compression. So this was about the talk. But what next? So have you solved the problem of model compression far from it? We still have a long way to go. The reason is that we are still not able to provide impressive speed ups. Maybe we can get 3x, 4x, or maybe 2x, but not more than that. We need much more speed ups if you want to work with CPUs. And you are still very, very far away from training these networks on CPUs. So that is the core problem of model compression. If you can compress it to a small enough amount, you can actually do training on CPUs. Not sure if this is even possible, but this is something that people are working on. And another thing which Suthirth also mentioned is working with high-resolution images is very difficult. So suppose your image size is, say, k cross k or something. And if you double your size of your input, the number of computations goes up four times. And it becomes four times lower. And it takes four times more number of parameters, which is very bad. So how do you actually go around the problem of working with high-resolution images? So these are all very hard research problems. So those of you who are looking to work on any problem in neural networks, maybe you are new to neural networks and you want to maybe pick on some research problems, I would suggest that you can maybe start with model compression, because this is one area which can have a lot of impact on the way deep learning is sort of done in the industry. It impacts a lot of people, actually, these kinds of model compression technologies. So I'll stop here.