 This is a dog. This is a cat, dog, dog, cat. This network is pretty good at distinguishing dogs and cats. Let's spice it up. Dog. This is a cat. This is a monkey. Yep. This is a cat. Oh, this is a cat. Looks like the network isn't great when distinguishing more animals. So we need to increase the complexity of our network to better capture patterns and data. But which is the best way to scale the network? And by how much do we want to scale? Do we increase the number of layers in the network? Or do we increase the length and breadth of each layer? Or do we increase the number of channels in each of the convolution layers? Well, in this video, we're going to see how to determine this quantitatively. So we can scale our networks in the most efficient way. The explanation is going to be in multiple parts. In the first part, we're going to define our problem and then go deeper with more details from there. Let's get into it. Let's start by defining an objective. We are given a simple neural network architecture. Our goal is to scale this architecture such that we have the best performance. And our hardware should be able to handle this scaled architecture well. So network here is the base architecture we want to scale, like our dog and cat classifier. This network can be scaled in three ways like we mentioned before. So that is we can increase the number of layers in the network and let's call the scaling factor D for depth. We can increase the number of channels in the network and let's call the scaling factor W for layer width. And third, we can increase the length or breadth of each layer. I'll have the length and breadth scaling factor equal to R for resolution. We can write this network as a function of DW and R. So the function network when passing in 1, 1, 1 will give us the base architecture, which is basically scaling everything by a factor of 1. And this is the lower bound. Now the objective is to find the optimal scaling values of DW and R that give the best performance with these hardware constraints. So it's looking a lot more concrete now. On to part two. How do we find the optimal values? This is done using a technique called a grid search. Each one of these dots here represents a trained model for different values of DW and R. The idea is to train your model and then tune the values of DW and R, then train again and repeat until we get the values of DW and R that give the best performance. But it really isn't feasible to compute all of these dots and keep training a thousand times. So we need to identify where in this grid that we should actually constrain our search. So we need to come up with the best ranges for DW and R to know where the best model lies. Now on to part three. How do we determine the ranges for the scaling values? To get the best performance, we should scale DW and R evenly. This makes sense intuitively too. So if the layers have a larger length or depth, so that's higher R. Then we need a deeper network to ensure every neuron is influenced by the later layers. We also need more channels per layer to ensure that we capture the necessary features in the enlarged input. So that means that we need to increase W and D as well. To further prove this point, let's look at these graphs. The x-axis is a measure of complexity to our network and y is a measure of accuracy. We have three graphs that show what happened if we increase D, W and R separately, leaving the other two constant. And it looks like that we see high initial gains at first, but then after a point, scaling really doesn't increase performance. And that's not great. We want a lower complexity with higher gains after all. So to ensure that we scale more uniformly, we can set them to variables with a common exponent like phi. So changing only phi, we can scale these values more evenly. And now we perform our grid search on the new variables alpha, beta and gamma instead. So we have a lower bound now of one each, but let's better quantify the upper bound with these hardware constraints. We have this base architecture for classifying cats and dogs. We can measure its complexity by counting the number of floating point operations required to get a result. Basically it's the number of additions and multiplications. And I'm just going to assume for this cat and dog classifier that our architecture was 100,000 flops. Now let's say that we are running on a small system where if we scale this, we can only handle a model that makes 200,000 flops at most. That is, it can only be twice as complex. So what are the values of alpha, beta and gamma under this constraint? Now this actually provides us a suitable upper bound. We're still missing the component of the scaled network complexity. So how do we compute this? Now we're going to get into the details. So part four, computing the number of flops or the number of floating point operations. The basic operation in a convolutional neural network is convolution. So how many floating point operations are in this? Let's say that the input from a previous layer is a three cross three matrix and the output that we want is just a one cross one matrix or just a number. This is done by convolving the input with a kernel of size three cross three. We perform nine element wise multiplications and then sum them up with nine additions. And this leads to 18 floating point operations. Now we're going to see how the number of floating point operations changes when we scale the network in different ways. So case one, what happens if we double the depth of the network? Well, we're just performing twice as many convolution operations. So we can say that the complexity of the scaled network is D times the complexity of the original base network. Remember D is that scaling factor. Now case two, what happens instead if we just double the number of channels in every layer in the network? This means that the input would have two channels and the output would also have two channels. To get this, we need to convolve with two kernels of shape three cross three cross two. Let's compute the number of floating point operations step by step here. The input is of shape three cross three cross two and one kernel is of shape three cross three cross two. We perform element wise multiplication of every element and we need 18 multiplications here. Then we add all of these together with 18 addition operations and then we get the output number. This entire operation requires 36 floating point operations. But this is just for one channel of the output. To get the second channel, we need to convolve the input with another three cross three cross two kernel. And so repeating this process again, we have 36 more operations. And so in total we have 72 floating point operations that need to be performed, which is four times the initial amount. So doubling the number of channels for every layer increases the number of floating point operations for the network by four times. In fact, if we triple this w, we see that the number of floating point operations increases nine fold. So we can say that the complexity of the scaled network is proportional to w squared times the complexity of the original base network. On to case three. What happens if we double the resolution of every layer in the network? This means that we are doubling the length and the breadth of every layer. If the resolution of every layer increases by two fold, then the input would be of shape six cross six and the output needs to be of shape two cross two. And we can achieve this by convolving with a three cross three kernel with a stride of three. So let's compute the number of floating point operations required here. Each convolution operation of this kernel requires 18 floating point operations like we computed initially, but we need to perform this four times to get the two cross two output. And so if we double the resolution for every layer, the number of floating point operations in the network increases by four times. And you can also check to see that when we increase the resolution by three fold, the network makes nine times more number of floating operations. So we can say that the complexity of the scaled network is proportional to r squared times the complexity of the original base network. We can now combine these three proportionalities together. There are no other factors on which this proportionality is dependent besides the three scaling factors. So we can replace a proportionality with an equality. And we can replace DWR with their alpha beta and gamma equivalence. And let's just factor out that five. Let me bring the number of floating point operations of the base architecture down to the left hand side. And now the left hand side represents a ratio of how much we scaled the complexity. So that means that if we doubled the number of floating point operations, the left hand side would be equal to two. If we tripled it, the left hand side would be equal to three. And this could be any number, but let's say that it takes the form of two to the power of five. You're probably wondering where this two came from, and it's actually kind of arbitrary. We choose two because it is the next integer that gives us a tight bound to alpha beta and gamma so that the grid search happens fast. We can now use this in our objective. So if the capacity changed from 100,000 floating point operations to 200,000 floating point operations, then that means that we scaled the model by a factor of two, which implies that, well, five should be equal to one. Now, once we have the optimal values of alpha, beta and gamma as a result of the grid search, we can then compute DW and R by taking phi equal to one. This will give us the most efficient model that can have up to two times the number of floating point operations as the base architecture. And by increasing the value of five to two, three, four, and so on, we can determine the new values of DW and R. So when phi is equal to two, we are getting the best model that can have up to four times the number of floating point operations as the base. When we set phi is equal to three, we are getting the best model that can have up to eight times the number of floating point operations as the base architecture. When we set phi is equal to four, we are getting the best model that can have up to 16 times the number of floating point operations as the base architecture. And so, we can get a family of efficient models of different complexity levels. In practice, we would set phi depending on the capability of our hardware. The results are encouraging. The x-axis is complexity, the y-axis is accuracy. You can see that the red line, which is the family of efficient net models, has a low complexity and high performance. They use several times less parameters as their counterparts. While this is applied to the ResNets here, you can also try it with your own Convenet architectures to see performance boosts. So I hope that all made sense. There's a lot more content coming in the future. Subscribe for more, and I will see you very soon. Bye-bye!