 Convolutional Neural Networks, introduced by Yann LeCun in 1998 and used in so many applications today from image classification to audio synthesis in WaveNet. Convolutional Neural Networks is one of those phrases that people like to say, but don't really know much about. We're going to go through this topic and cover some great areas of knowledge you may have in the field. Let's start with the basic question. What and why convolutional neural networks? There are a kind of special neural network for processing data that is known to have a grid-like topology. This could be a one-dimensional time series data, which is a grid of samples over time, or something like even two-dimensional image data, a grid of pixels in space. Convolutional Neural Networks have three fundamental features that reduce the number of parameters in a neural network. Up first, we have sparse interactions between layers. In typical feed-forward neural nets, every neuron in one layer is connected to every other in the next. This leads to a large number of parameters that the network needs to learn, which in turn can cause other problems. Well, for one, too many parameter estimates means we'll need a lot of training data. Convergence time also increases, and we may end up with an overfitted model. CNNs can reduce the number of parameters through indirect interactions. Consider a CNN in which neuron in layer 5 is connected to three neurons in the previous layer, as opposed to all neurons. These three neurons in layer 4 are direct interactions for the neuron in layer 5 because they affect the neuron in question directly. Such neurons constitute the receptive field of the layer 5 neuron in question. Furthermore, these layer 4 neurons have receptive fields of their own in the previous layer 3. These directly affect layer 4 neurons and indirectly affect the few of the layer 5 neurons. If the layers run deep, we don't need every neuron to be connected to every other to carry information throughout the network. In other words, sparse interactions between layers should suffice. Now, a second advantage of CNNs is parameter sharing. This further touches on the reduced parameters as did sparse interactions before. It is important to understand that CNNs create spatial features. Say we have an image, which after passing through the convolution layer gives rise to a volume. Then a section of this volume taken through the depth will represent features of the same part of the image. Furthermore, each feature in the same depth layer is generated by the same filter that convolves the image. Now, don't worry if this statement doesn't make any sense right now. It'll be clear when I explain more about the convolution operation. For now, just understand that the individual points in the same depth of the feature map, that is the output 3D volume, are created from the same kernel. In other words, they are created using the same set of shared parameters. This drastically reduces the number of parameters to learn from a typical ANN. Another feature of convolutional neural networks is equivariant representation. Now, a function f is set to be equivariant with respect to another function g, if f of gfx is equal to g of ffx for an input x. Let's take an example to understand this. Consider that there is some image i, where f is the convolution operation and g is the image translation operation. Convolution is equivariant with respect to translation. This means that convolving the image and then translating it will relate to the same result as first translation and then convolving it. Now, why is this used? Equivariant of convolution with respect to translation is used in sharing parameters across data. In image data, for example, the first convolution layer usually focuses on edge detection. However, similar edges may occur throughout the image, so it makes sense to represent them with the same parameters. Now that we understand some features of CNNs, let's take a look at the types of layers we have in a convolutional neural network. So we can, well, broadly classify this into convolution, activation, pooling, and fully connected layers. We'll discuss these one at a time. So up first we have the convolutional layer. The convolutional layer is usually the first layer of a CNN where we convolve the image or data in general using filters or kernels. Now, filters are small units that we apply across the data through a sliding window. The depth of the filter is the same as the input. So for a color image whose RGB values gives it a depth 3, a filter of depth 3 would also be applied to it. The convolution operation involves taking the element-wise product of filters in the image and then summing those values for every sliding action. The output of the convolution of a 3D filter with a color image is a 2D matrix. It is important to note that convolution is not only applicable to images, though. We can also convolve one-dimensional time series data. I'll explain this a little more mathematically. Consider a one-dimensional convolution. If input F and kernel G are both functions of one-dimensional data like time series data, then their convolution is given by this equation. It looks like fancy math, but it has meaning. The equation represents the percentage of area of the filter G that overlaps the input F at a time tau over all time t. Since tau less than 0 is meaningless and tau greater than t represents the value of a function in the future, which we don't know, we can apply tighter bounds to the integral. This corresponds to a single entry in the 1D convolutional tensor that is the teeth entry. To compute the complete convolved tensor, we need to iterate t over all possible values. In practice, we may have multi-dimensional input tensors that require multi-dimensional kernel tensor. In this case, we consider an image input I and a kernel H, whose convolution is defined as shown. The first equation performs convolutions by sliding the image over the kernel, and the second equation does it the other way around. It slides the kernel over the image. Since there are usually less possible values for x and y in the kernel than in an image, we use the second form for convolution. The result will be a scalar value. We repeat the process for every point x, y for which the convolution exists on the image. These values are stored in a convolved matrix represented by I asterisk H. This output constitutes a part of a feature map. This is the mathematical representation of the image kernel convolution I explained before, where we take the sum of products. Okay, we got that down, but let me explain it in a way that we can get a better intuition. Consider a number of learnable filters. They are small spatially, with respect to the width and the height, but extend deep. In the first layer, we can evolve every filter with the image. So this means it slides along the image and outputs the convolved 2D activation or feature map. Note that the depth of the filter is equal to the depth of the volume. In the case of a first layer of convolution, where we're dealing with an input image of depth 3, then we'll also have a filter of, well, x cross x cross 3. And by the way, the 3 for the input image is because of the R, G and B channels. I'll give you a better example here. Consider the MNIS dataset with 60,028 x 28 images. If we perform 2D convolutions with a 3 x 3 filter and a stride of 1, stride being the amount that we want to move the sliding window after every convolution, then we'll end up with a feature map of size, well, 28 minus 3 plus 1, which is 26 x 26. Now, if we apply, say, 32 such filters on the image, then we'll end up with 32 such 26 x 26 feature maps. These are stacked along their depths to create an output volume, in this case of size 26 x 26 x 32. This will be the feature map I mentioned earlier. So what do these pixels in the output represent? The top left corner with its depth 1 x 1 x 32, each of these 32 numbers corresponds to different features of the same 3 x 3 section of the original image. These features are dictated by the kernel or filters using convolution. Some can be structured to find curves, others to find sharp edges, others for texture and so on. In reality, some or most of the features that are used, especially that are in the deeper layers of a convolutional neural network, are not interpretable by humans. There's much more to this, but I will link several resources down in the description below. I strongly recommend the chapter on convolutional neural networks in the deep learning book by Ian Goodfellow. For now, I think I've spoken enough about convolution, so I'll move on to the next layer. The activation layer. Only nonlinear activation functions are used between subsequent convolutional layers. This is because there won't be any learning if we just use linear activation functions. So, what do I mean by that? Let's consider A1 and A2 be two subsequent convolutional filters applied on X without a nonlinear activation between. Because of the associativity property of convolution, these two layers are effective as just a single layer. The same holds true for typical artificial neural networks. Ten layers of a typical ANN without activation functions is as effective as just having a single layer. Typically, we use leaky ReLU instead of the ReLU activation in order to avoid dead neurons and the dying ReLU problem. Note that the activation isn't necessarily executed after convolution. Many papers follow the convolution pooling activation order, but this isn't strictly the case. Let's take a look at the next layer that I just mentioned. Pooling. Pooling involves a downsampling of features so that we need to learn less parameters during training. Typically, there are two hyperparameters introduced with the pooling layer. The first is the dimensions of the spatial extent, which in other words, the value of N for which we can take an N cross N feature representation and map to a single value. And the second is the stride, which is how many features the sliding window skips along the width and height, which is similar to that we saw in convolution. A common pooling layer uses a 2 cross 2 max filter with a stride of 2. This is a non-overlapping filter. A max filter returns the maximum value among the features in the regions. Average filters, which return the average of features in the region, can also be used, but the max pooling works better in practice. Since pooling is applied through every layer in the 3D volume, the depth of the feature map after pooling will remain unchanged. Performing pooling reduces the chances of overfitting as there are less parameters. Consider the MNIST dataset after the output from the convolution layer that I discussed previously. We have 26 cross 26 cross 32 volume. That's great. Now using a max pool layer with 2 cross 2 filters and a stride of 2, this volume is now reduced to a 13 cross 13 cross 32 feature map. Clearly, we reduced the number of features to 25% of the original number. This is a significant decrease in the number of parameters. Okay, now we're going to talk about converting it to a fully connected layer. So here's a question. What exactly is the use of fully connected layers? The output from the convolution layers represents high level features in data. While that output could be flattened and connected to the output layer, adding a fully connected layer is usually a cheap way of learning nonlinear combinations of these features. Essentially, the convolutional layers are providing a meaningful, low dimensional and somewhat invariant feature space. And the fully connected layer is learning a possibly nonlinear function in that space. So how do we convert the output of a pooling layer to an input for the fully connected layers? Well, the output of a pooling layer is a 3D feature map, a 3D volume of features. However, the input to a simple fully connected feed for a neural network is a one dimensional feature vector. For the 3D volume, they are usually very deep at this point, because of the increased number of kernels that are introduced at every convolutional layer. I say every convolutional layer, because while convolution, activation and pooling layers can occur many times before the fully connected layers, and hence is the reason for the increased depth. To convert this 3D volume into one dimension, we want the output width and height to be one. This is done by flattening this 3D layer into a 1D vector. Consider our MNIST dataset again, where the output feature map is 13 x 13 x 32. By flattening our 13 x 13 x 32 volume, we end up with a vector of 13 x 13 x 32 x 1, or 5408 x 1. That is a 5408 dimensional vector, which is in an admissible form for the fully connected layers. From here, we can proceed as we would for fully connected layers. For classification problems, this involves introducing hidden layers and applying a softmax activation to the last layer of neurons. Now that we got all the theory down, let's take a look at CNN's structure for the complete MNIST dataset. We initially start with 28 x 28 images, grayscaled images. The image is passed through a convolutional layer. Here we apply 32 3 x 3 filters. Note that the depth of the filters is the same as that of the input, which is 1 in this case, for grayscaled images. The output of convolution with respect to each filter is the width of the image, minus width of the filter, plus 1, for a stride of 1. The same form applies to the height, so the output has a width and height of 28 minus 3 plus 1, which is 26. Since the depth is equal to the number of filters, the output of convolution is a 3D volume of dimensions 26 x 26 x 32. We can then, say, pass this into a ReLU, or leaky ReLU activation function, in order to get an intermediate feature map. Activation now doesn't change the dimension spatially, so we still have a 26 x 26 x 32 volume. After activation, we can pass this into a max pooling layer with 2 x 2 filters and a stride of 2. Such a non-overlapping pooling layer is a common design choice. In Keras, we can pass a padding parameter while pooling. It takes two values. The first is either valid, which means that there's no padding, so that we don't slide the kernel off the borders of the image. And the second is the same, in which we pad the input such that the pooling layers can cover the entire image. Let's consider valid pooling, where the output width is equal to the input width minus width of the filter plus 1 divided by stride. So output width is equal to 26 minus 2 plus 1 divided by 2, which is 13.5. But the value is floored to 13, as we don't introduce padded pixels. The output height is the same and the depth remains unchanged. This leads to an output volume of 13 x 13 x 32. After this, say we perform another round of convolution, activation and pooling. For convolution, for example, let's consider some 64 filters of 3 x 3. In this case, the output would be 11 x 11 x 64. Why? Because the output is equal to the input width minus the filter width plus 1, for convolution of stride 1. That's 13 minus 2 plus 1, and the depth is equal to the number of filters, which is 64. We can pass this through an activation layer that doesn't really change its dimensions. For the next max pooling layer, the size is reduced to 5 x 5 x 64, as the output width and height is 11 minus 2 plus 1 divided by 2, which is 5. And the depth is preserved. We can now feed the output of this pooling layer to a fully connected layer by flattening it to a one dimension. So we have 1,600 input features over our ANN. We can then connect it to a hidden layer of, say, 512 neurons, and then apply dropout and output it to a softmax layer of 10 neurons, which signify the probability of being the digits 0 through 9. And so we have a full-fledged convolutional neural network. CNNs are being used everywhere. For example, to generate data like image, audio, or text in generative adversarial networks, the generator is modeled as a convolutional neural network. They're also using gameplay like Go and Atari games through deep Q-learning, which combines CNNs with Q-learning and reinforcement learning. In the field of medicine, they are used in medical imaging from the segmentation of knee cartilage to the detection of Alzheimer's disease in MRIs. Clearly, there's a lot of potential for CNNs in various fields. I've included links to interesting blogs and papers about CNNs in the description down below. Hope you guys understood more about the nature behind convolutional neural networks. It's not magic, and every component has its purpose. Give the video a thumbs up and hit that subscribe button for more awesome content, and I will see you in the next one. Bye-bye.