 Hello, everyone. AJ here. So as you can see, there's something different. Can you guess what it is? Did you guess? You can see my face! That's right, that's right. So a few things before I start this video. I'm a little microphone in front of me, so that's great. Most of my audio, you know, audio right here. So the entire room, there's gonna be a lot of echo. I haven't really adjusted the setup yet, but it'll get there. Got my little camera. Hi guys! Yeah, I can see you. Yeah, you can see me. This works so so far. Yeah. So in the video, I'll be interchanging between my face and, well, the screen. We'll see how that goes. I'm gonna try to change my setup every single video so that I get the best feel. And eventually, hopefully, I'll settle down on, like, well, one location. For now, it's like in my little living room. That's why there's a lot of echo. The floors are wooden, so yeah. Then I'll probably go back to the room, which is where I usually record because of the carpeting and it just a lot better for ambient noise. And yeah, we'll just see how it goes. So that's all I had to say. Enjoy the video. You know, the typical convolution neural network architecture is that you see in those getting started with PyTorch tutorials or getting started with any other deep learning library. You know, the one where we have, like, a stack of layers of convolution, pooling, activation followed by some fully connected layers and a softmax for classification. Yeah, I think you know what I mean. This is just one of the basic neural network architectures based on the Lynette family. In this video, we're going to talk about different such neural network architectures, emphasizing exactly why they were introduced and what new concepts they bring to the fields that make convolutional network research what it is today. We'll start with the beginning of convolutional neural networks in Yanlacun's pioneering paper in 1998. In this paper, he introduces a class of neural network architectures, the Lynette family, Lynette 5 being one of the most common forms that we see to this day. Lynette 5 is a seven layer neural network architecture, excluding inputs, that consists of two alternate convolution and pooling layers, followed by three fully connected layers at the end. It used convolution to preserve spatial orientation of features, average pooling for downsampling of the feature space, and it makes use of the tanch and sigmoid activations between layers. Lynette 5 obtained a 99% test accuracy on the MNIST dataset. Understand that this was a time before GPUs, so computers were not able to process such large amounts of data within a reasonable amount of time. Furthermore, we didn't have such large stores to begin with, and this is the reason why neural networks didn't spark until 2010. So what exactly happened? Well, back in the day, it was considered that a better algorithm would always yield better results regardless of data, but we now know today that this theory is flawed. Then professor at UIUC, Fei-Fei Li, agreed, claiming the best algorithm wouldn't work well if the data it learned from it didn't reflect the real world. She then said, we decided we wanted to do something that was completely historically unprecedented. We're going to map out the entire world of objects. The resulting dataset, ImageNet. From 2010 onwards, an annual ImageNet competition that is ImageNet large-scale visual recognition challenge, ILS VRC, is held between researchers to determine which algorithm can yield the best performance. Many consider this the dawn of the AI boom. I bring ImageNet into this discussion because the winners every year always provided something new to contribute to the field that would influence architectures that succeeded it. The 2012 winner of this competition was AlexNet. The architecture consists of eight learned layers. It is five convolution layers followed by three fully connected layers. Some of the features in this architecture include, well first, is the introduction of the ReLU nonlinearity. Rectified linear unit is a non-saturating nonlinearity. A non-saturating function f has the property that it tends to a positive or negative infinity as its parameters tend to positive or negative infinity. The function f is saturating if it is not non-saturating. Example of such a saturating function would be 10 which ranges from negative one to positive one, or sigmoid which ranges from zero to one. Deep convolution networks with ReLU train several times faster than their TENCH activation counterparts as shown in this graph. The solid lines show performance of ReLU and the dashed lines show that of TENCH. On the C410 dataset, the former was about six times faster than its saturated counterpart. This allows us to train huge neural networks as we see today in reasonable amounts of time. It's very interesting, right? Right? Right? Now the second feature that we see in AlexNet is that it can be trained on multiple GPUs. AlexNet specifically is trained on two GPUs. Their paralyzation scheme puts half of the neurons on each GPU with one additional trick. The GPUs communicate only in certain layers. For example, the neurons of layer three take input from feature maps in layer two. However, neurons in layer four take input only from those feature maps in layer three, which reside on the same GPU. Now a third feature of AlexNet is normalization. ReLUs have the desired property that they don't require input normalization to prevent them from saturating. If at least some training examples produces a positive input to a ReLU, learning will happen in that neuron. However, a local normalization scheme still helps generalization, like the one shown on screen. Using this to train a four-layer CNN on the C410 dataset, they achieved a 13% test error rate without normalization and 11% test error rate with normalization. So that's an improvement. The fourth and final feature of AlexNet is overlapping pooling. Pooling is used to down sample the feature space. And AlexNet performed pooling with a window of three cross three, but stride of two. Hence there will be an overlap between subsequent positions of the kernel. With this overlap pooling scheme, the top one accuracy and the top five accuracy reduced by 0.4% and 0.3% respectively. Now let's take a look at the overall architecture. I'll explain this very briefly. Like I said before, it is an eight-layer architecture with five convolution layers and three fully connected layers. The last layer is passed through a softmax activation for 1000 classes. Here, the top layers are in the first GPU and the bottom set of layers run on the other second GPU. The kernels of the second, fourth, and fifth convolution layers are connected only to those kernel maps in the previous layer which reside on the same GPU. The kernels of the third convolution layer are connected to all kernel maps in the second layer. The neurons in the fully connected layers are connected to all neurons in the previous layer. Response normalization layers follow the first and second convolution layers. Max pooling layers follow both response normalization layers as well as the fifth convolution layer. The Rayleigh nonlinearity is applied to the output of every convolutional and fully connected layer. Hope you understood AlexNet. I'm emphasizing a bit more on this than on the other architectures because this has a large influence on larger neural networks that we will see in the future. Up next, we have the 2013 ILS VRC winner by Zayler and Fergus. Clarify. It's a refinement of AlexNet with an eight-layer architecture. Each layer performs several operations. In the first layer, the input image 224 cross 224 is convolved with 96 7 cross 7 filters. This is followed by Rayleigh activation which doesn't change the output volume shape. Next, overlapping max pooling is applied with a 3 cross 3 sliding window and a stride of 2. And finally, we apply contrast normalization. This set of operations yields an output feature volume for layer 1. A stack of the same operations is performed in layers 2 through 5. Layer 6 and 7 are fully connected layers and layer 8 is the output softmax layer to perform image classification. Next up, we introduce the runner-up for the 2014 ImageNet Challenge, the visual geometry group from the University of Oxford. One of their major changes to the network in their VGG net architecture is a smaller kernel size. Lanets use 5 cross 5 kernels. AlexNet use 11 cross 11 kernels for convolution and Clarify used 7 cross 7 kernels. In VGG nets, the image is passed through a stack of convolution layers where the filters have a very small receptive field of 3 cross 3. This is the smallest size required to capture the notion of left, right, up, down, and center. And this leads to significant parameter decrease. Spatial pooling is carried out by 5 max pooling layers, which follows some of the convaliers. I say some and not all convaliers as not all of them are followed by max pooling. Max pooling is performed over a 2 cross 2 pixel window with a stride of 2, a typical configuration we see today. All hidden layers are followed by Raylou non-linearity, following from AlexNet. In this figure, we see the architectures of different configurations in the VGG family, labeled A through E. A has the smallest number of weighted layers, 11 with 8 convaliers and 3 FC layers. And architecture E has the largest number of weighted layers, 19 with 16 convolution layers and 3 FC layers. The architecture also plays with 1 cross 1 convolutions as we see in type C. This increases the non-linearity of decision functions without affecting the receptive fields. And so it can effectively model more complex problems without affecting the number of learned parameters. We can also see the number of parameters is not as much as its predecessor network architectures that demonstrated revolutionary performance in image localization in ILS VRC 2013. This is attributed to the decreased kernel sizes for convolution and the use of 1 cross 1 convolutions. The next CNN architecture is the network in network architecture. It makes use of 1 cross 1 convolutions with multi-layer perceptron layers. In normal convolution, we apply an inner product of a number of kernels with the input feature volume to get an output feature volume. This is typically followed by a non-linear activation. Typically, such convolution layers are stacked in order to learn more complex features. However, stacking convolution layers can lead to really deep networks that exponentially increase the number of parameters to learn. So what if we replace these stacked convolutions with a micro network? Well, network in network does exactly that where they use multi-layer perceptron MLP. This is the choice because it is a network trainable by backprop. So it meshes in with the existing CNN components and it significantly reduces the number of parameters to learn without stacking a number of convolution layers like what we saw in the VGG nets. It is called network in network because there are many of these micro networks in the entire network. Another feature of NIN is that it uses global average pooling. This, when used with micro nets, creates more interpretable classification results relative to traditional CNNs. Furthermore, fully connected layers are more prone to overfitting and hence rely on dropout as a means of regularization. On the other hand, global average pooling is a structural regularizer and so it inherently avoids overfitting in its structure. This network in network architecture inspired a seminal work in the field known as GoogleNet. Let us talk about the ILS VRC 2014 winner, GoogleNet. Their 22 layer deep neural network architecture used 12 times fewer parameters in AlexNet yet yield much higher performance. The leading approach for object detection at the time was regions with convolutional neural networks, RCNNs. Used for object detection as well, GoogleNet attributes its access to deeper networks and the RCNN algorithm. A fundamental unit of this network is the inception module where it heavily uses the network in networks architectures one cross one convolutions. These one cross one convolutions are typically used for dimensionality reduction. To avoid patch or kernel misalignment, the inception module makes use of small kernels of size one cross one three cross three and five cross five. The outputs of which are concatenated to form the inputs to the next stage. These inception units are stacked one on top of the other throughout the network. However, there is a problem. The deeper we go, the convolutional network extracts lower and more complex features. So each of these three cross three and five cross five filters correspond to smaller and smaller regions as we go deeper into the network. We thus need more parameters and estimating them can lead to a computation blow up within just a few layers. To address this inception performs reductions through one cross one convolutions before carrying out each of the expensive three cross three and five cross five convolutions. These one cross one convolutions additionally provide a rectified linear activation, allowing them to model more complex operations. So we don't need to overload on these inception modules. As far as performance is concerned, Google net obtains a top five error of 6.67% on both the validation and testing data ranking the first among other participants. It was about 40% better than clarify, which was the previous year's best approach. That's quite the improvement. We now move on to the 2015 ILS VRC winner Resnet. Just when you thought they couldn't get any deeper, they still do. To train on the ImageNet data set, the residual network had 152 layers. This is eight times more than that of the VGG nets. Yet it has a lower complexity. Now let's step back a bit and just ask ourselves, why are we trying to build deeper networks? What is the point of this? This goes back to the universal approximation theorem from Wikipedia, which is clearly the most verifiable source. In mathematical theory of artificial neural networks, the universal approximation theorem states that a feed forward network with a single hidden layer containing a finite number of neurons that is an MLP can approximate continuous functions under mild assumptions of an activation function. In other words, we can potentially model any kind of problem using a single layered neural network. However, in the process, we may run into a problem of memorizing training data, which is popularly known as overfitting. And so a common trend in deep learning research in order to enhance generalization is to well go deeper, building deeper networks. But we cannot just simply stack layers in order to actually just go deeper, because this leads to the vanishing gradient problem. Basically, in typical back propagation algorithms, the neural network will learn through gradient updates. In very deep networks, that gradient will eventually reduce to zero. And when a gradient is reduced to zero, the weights themselves will not be updated. And so there's no learning taking place. So performance saturates and even degrades after some point. To combat this problem, folks at Microsoft introduced shortcut connections that bypassed two layers. They called this an identity shortcut connection. The intuition here is that the stacking of such layers won't harm the performance as the layers can be bypassed. Hence, a ResNet will perform at least as good as its shallower counterpart. So deeper the network is, the better. We can observe results on the 1000 class classification of ImageNet by comparing performance of plain networks with the residual network counterpart, which is just a plain network with a bypass. In their paper, we observe an overall better performance of ResNet by about 2.8%. Furthermore, the shallower 18 layer plain net performs better than the deeper 34 layer network attributed to the vanishing gradient problem. On the other hand, the deeper ResNet has the better performance showing that ResNet was able to address this gradient problem. The next and final architecture we'll talk about is exception. It's based off inception and depth wise separable convolutions. In standard convolution, cross channel correlations and spatial correlations are determined together simultaneously. Exception, on the other hand, takes advantage of the fact that these two processes can be completely decoupled. In effect, we can perform convolution in two phases. The first is depth wise convolution, which determines spatial correlations. This is the relationship of pixels with respect to its neighbors. And then we have point wise convolution, which determines the cross channel correlations, the relationship between points in the feature volume along the different channels. These are the fundamental phases of depth wise separable convolution. I made a video on this, so check it out in the info card at the top. But for now, just understand that depth wise separable convolutions break up the convolution operation into two phases, which reduces the number of parameters and hence computation time. The typical inception module first looks at the cross channel correlations, through a set of one cross one convolutions, mapping the input data into three or four separate spaces that are smaller than the original input space. It then looks at all the spatial correlations in the smaller 3D spaces through regular three cross three or five cross five convolutions. Now consider a simplified inception module without the pooling layers and only three cross three convolutions. As an extreme version of the inception module, we first use a one cross one convolution to map cross channel correlations, and then separately map the spatial correlations of every output channel. This extreme version of inception is called exception. The main differences between exception and depth wise convolution is well first, the order of operations. In depth wise separable convolution, the depth wise convolution is performed before the point wise convolution, whereas in exception, the operation orders are reversed. Second, the depth wise and the point wise convolutions in exception are followed by ReLU nonlinear activations, whereas there is no nonlinearity that occurs in depth wise separable convolutions. From exceptions architecture, we see it makes use of these depth wise separable convolutions, ReLU activations, and the shortcut connections from ResNet. Exception was also able to achieve state of the art performance on a 350 million image data set JFT for a classification of 17,000 categories. So what have we learned? The history of convolution neural net architectures started with the Lynette family. This involves stacking a number of convolution units for feature extraction and max pooling units for spatial subsampling. In 2012's AlexNet, convolution operations were repeated between the subsequent max pooling to extract richer features. In 2013, we saw Zaylors and Ferguson's architecture clarify and the VGG architecture in 2014. In the VGG architecture, we saw an increase in the performance with decrease in the number of learnable parameters by using smaller kernel sizes and one cross one convolutions. We saw the introduction of network and network architecture that introduced micro networks between subsequent convolution layers. This allowed the network to prevent stacking of numerous simple convolution layers to decrease the number of parameters. Furthermore, it created more interpretable classification results relative to traditional CNNs. The ILS VRC 2014 winner Google net was a 22 layer deep neural network architecture that introduced the inception module. It borrowed heavily from the network and network and RCNNs. These modules allowed modeling of complex operations. So a small stack of these inception modules could solve problems in computer vision leading to less computation. And in the end of 2015, we had the introduction of residual networks that mitigated the problem of vanishing gradients through shortcut connections. This allowed the training of very deep networks to enhance generalization. In April 2017, we had exception or extreme inception. It's based heavily off of the inception module and makes use of depth wise separable convolutions. It also uses short connections of ResNet to construct deeper networks with high performance. And that's all I have for you now. If you like the video, hit that like button. If you're interested in videos like this, like videos on machine learning, data sciences, artificial intelligence in general, then hit that subscribe button. For notifications, when I upload, hit that bell icon. All links to the papers are down in the description below. So check them out interested in other hot topics in the field. Click one of the videos right here and I will see you in the next one.