 Convolution is a measure of overlap between two functions as one slides over the other. Mathematically, it's a sum of products. The standard convolution operation is slow to perform. However, we can speed this up with an alternative method that is the topic of this video, Dep-wise separable convolution. Let's first very quickly go over the basics of convolution on an input volume. Consider an input volume f of shape df cross df cross m, where df is the width and height of the input volume and m is the number of input channels. If the color image was an input, then m would be equal to 3 for the rg and b channels. We apply convolution on a kernel k of shape dk cross dk cross m. This will give us an output of shape dg cross dg cross 1. If we apply n such kernels on the input, then we get an output volume g of shape dg cross dg cross n. A convolution operation takes the sum of products of the input and the kernel to return a scalar. This operation is continued by sliding the kernel over the input. I've explained this concept in detail on my video on convolution neural networks. Check that out for a clearer understanding. I'm more concerned now with the cost of this convolution operation. So let's take a look at that. We can measure the computation required for convolution by taking a look at the number of multiplications required. So why is that? It's because multiplication is an expensive operation relative to addition. So let's determine the number of multiplications. For one convolution operation, the number of multiplications is the number of elements in that kernel. So that would be dk times dk times m multiplications. But we slide this kernel over the input. We perform dg convolutions along the width and dg convolutions along the height. And hence, dg cross dg convolutions overall. So the number of multiplications in the convolution of one kernel over the entire input f is dg square times dk square times m. Now this is for just one kernel. But if we have n such kernels, which makes the absolute total number of multiplications become n times dg square times dk square times m multiplications. Let's now take a look at depth-wise separable convolutions. In standard convolution, the application of filters across all input channels and the combination of these values are done in a single step. Depth-wise separable convolution, on the other hand, breaks us down into two parts. The first is depth-wise convolution, that is, it performs the filtering stage, and then point-wise convolution, which performs the combining stage. Let's get into some details here. Depth-wise convolution applies convolution to a single input channel at a time. This is different from the standard convolution that applies convolution to all channels. Let us take the same input volume f to understand this process. f has a shape df cross df cross m, where df is the width and height of the input volume, and m is the number of input channels, like I mentioned before. For depth-wise convolution, we use filters, or kernels, k, of shape dk cross dk cross 1. Here, dk is the width and height of the square kernel, and it has a depth of 1, because this convolution is only applied to a channel, unlike standard convolution, which is applied throughout the entire depth. And since we apply one kernel to a single input channel, we require m, such dk cross dk cross 1 kernels, over the entire input volume f. For each of these m convolutions, we end up with an output dg cross dg cross 1 in shape. Now, stacking these outputs together, we have an output volume of g, which is of shape dg cross dg cross m. This is the end of the first phase, that is, the end of depth-wise convolution. Now, this is exceeded by point-wise convolution. Point-wise convolution involves performing the linear combination of each of these layers. Here, the input is the volume of shape dg cross dg cross m. The filter kpc has a shape 1 cross 1 cross m. This is basically a 1 cross 1 convolution operation over all m layers. The output will thus have the same input width and height as the input dg cross dg for each filter. Assuming that we want to use some n such filters, the output volume becomes dg cross dg cross n. So that's great, we got this down. Now, let's take a look at the complexity of this convolution. We can split this into two parts, as we have two phases. First, we compute the number of multiplications in depth-wise convolution. So here, the kernels have a shape dk cross dk cross 1. So the number of multiplications on one convolution operation is, well, dk times dk, dk square. When applied over the entire input channel, this convolution is performed dg times dg number of times. So the number of multiplications for the kernel over the input channel becomes dg square times dk square. Now, such multiplications are applied over all m input channels. For each channel, we have a different kernel. And hence, the total number of multiplications in the first phase that is depth-wise convolution is m times dg square times dk square. Next, we compute the number of multiplications in the second phase that is point-wise convolution. Here, the kernels have a shape 1 cross 1 cross m, where m is the depth of the input volume. And hence, the number of multiplications for one instance of convolution is m. This is applied to the entire output of the first phase, which has a width and height of dg. So the total number of multiplications for this kernel is dg times dg times m. So for some n kernels, we'll have n times dg times dg times m, such multiplications. And thus, the total number of multiplications is the sum of multiplications in the depth-wise convolution stage, plus the number of multiplications in the point-wise convolution stage. We can take m times dg squared common. Now, we compare the standard convolution with depth-wise convolution. We get the ratio as the sum of reciprocal of the depth of output volume, that is n, and the reciprocal of the squared dimensions of the kernel, dk. To put this into perspective of how effective depth-wise convolution is, let us take an example. So consider the output feature volume n of 1024, and a kernel of size 3. That's dk is equal to 3. Plugging these values into the relation, we get 0.112. In other words, standard convolution has 9 times more the number of multiplications as that of depth-wise separable convolution. This is a lot of computing power. We can also quickly compare the number of parameters in both convolutions. In standard convolution, each kernel has dk times dk times m learnable parameters. Since there are n such kernels, there are n times m times dk squared parameters. In depth-wise separable convolutions, we'll split this once again into two parts. In the depth-wise convolution phase, we use m kernels of shape dk cross dk. In point-wise convolution, we use n kernels of shape 1 cross 1 cross m. So the total is m times dk squared plus m times n, or we can just take m common. Taking the ratio, we get the same ratio as we did for computational power required. So we understood exactly what depth-wise convolution is and also its computation power with respect to the traditional standard convolution. But where exactly has this been used? Well, there are some very interesting papers here. The first is on multimodal neural networks. These are networks designed to solve multiple problems using a single network. A multimodal network has four parts. The first is modality nets to convert different input types to a universal internal representation. Then we have an encoder to process inputs. We have a mixer to encode inputs with previous outputs. And we have a decoder to generate outputs. A fundamental component of each of these parts is depth-wise separable convolution. It works effectively in such large networks. Next up, we have exception. A convolution neural network architecture based entirely on depth-wise separable convolution layers. It has shown the state-of-the-art performance on large datasets like Google's JFT image dataset. It's a repository of 350 million images with 17,000 class labels. To put this into perspective, the popular ImageNet took three days to train. However, to train even a subset of this JFT dataset, it took a month and it didn't even converge. In fact, it would have approximately taken about three months to converge had they let it run to its full length. So that's useful. This paper is pushing convolution neural networks to use depth-wise separable convolution as a de facto. Up third, we have mobile nets. A neural network architecture that strives to minimize latency of smaller scale networks so that computer vision applications run well on mobile devices. Mobile nets use depth-wise separable convolutions in its 28-layer architecture. This paper compares the performance of mobile nets with fully connected layers versus depth-wise separable convolution layers. It turns out the accuracy on ImageNet only drops by 1% while using significantly less number of parameters. From 29.3 million, the number of parameters is down to just 4.2 million. We can see the MULT's ads. The number of multiplications and additions, which is a direct measure of computation, has also significantly decreased for depth-wise separable convolution mobile nets. So here are some things to remember in this video. Depth-wise separable convolution decreases the computation and number of parameters when compared to standard convolution. Second is that depth-wise separable convolution is a combination of depth-wise convolution followed by a point-wise convolution. Depth-wise convolution is the filtering step, and point-wise convolution can be thought of as the combination step. Finally, they have been successfully implemented in neural network architectures like multimodal networks, exception, and mobile nets. And that's all I have for you now. Thank you all for stopping by today. If you liked the video, hit that like button. If you want to stick around, hit that subscribe button. If you really want to stick around, hit that bell icon next to the subscribe button so as to be notified of my uploads immediately. Links to important papers are down below, so check them out. Have a good day, and I'll see you in the next one. Bye.