 So, after AlexNet came a time of many different architecture innovation. So let's go through a few of them. In 2014, the next outstanding network was VGG. It was developed by the Visual Geometry Group in Oxford. And their problem was beat AlexNet on ImageNet. And it uses smaller filters and deeper networks. Let's look a little at the architecture. What we have is, instead of having an immediate layer with very large filters and big stripes, it has small filters here. So we go through two convolution layers here at the full image size 224. Then comes the max pool layer. We go through another two convolution layers, another max pool, three convolution layers, another max pool. We see how the numbers of parameters goes up as we go through the networks. Another three layers, another max pool, another three layers, another max pool. And then a full three layer fully connected before having the output with the self-max. Okay, so in a way, it's similar to AlexNet, but it has smaller filters and it's considerably deeper. So why all the fully connected layers? Then MLP, they're basically able to do interesting non-linear calculations on that. And the architecture, of course, as you saw it changes. It jumps from the eight layers of AlexNet to like 16 to 19 layers, depending on how you count. It uses only three by three convolution. So with stride one padding one and only two by two max pool stride twos. And effectively it therefore has fewer parameters or like at certain places it has fewer parameters. Now, like instead of going from a large say 11 by 11 filter, which takes 121 parameters, it has just three by three, which gives us nine locally. And that's, and having fewer parameters in a way often makes performance better for these. We remember because fewer parameters means better generalization performance. But of course it's still an incredibly big network with a lot, a lot of parameters. And it achieved 8% top five error and therefore it was considerably better. And it has certain drawbacks. It was incredibly slow to train. The weights themselves are quite large in terms of discs and bandwidth. And the size of a VGG is over 530 megabytes, which makes deploying it kind of tiresome suddenly at that time. So next came Google net. So Google net uses the so-called inception modules and it basically replaced individual layers where we say, okay, we do a three by three convolution with a two by two max pool with these little so-called inception modules. And here the architecture is we have the inception modules, a grid size reduction with some tracks, another set of inception modules, another grid size reduction and another set of inception modules. And then finally, again, a big readout. So in the first version of inception, you can say, kernel size tuning is hard. Now, like should we use a five by five kernel or should we use a three by three kernel or should we use no kernels at all? It's hard to know it. And also if you're Google, then why decide if you can have all of them? So that's what they did. They first have a one by one layer here that just allows you to make more feature dimensions if you want, followed by a five by five. And then they also have a one by one followed by a three by three. And then they have a pooling layer followed by a one by one, and they have a direct one by one. And then they're all, they concatenate all of these together that produces an inception module. So the idea is that therefore, instead of having to choose one, you can have all of them at the same time. And like often in machine learning, when we build ensemble methods, we get better in this case by giving these different possibilities, the system will be doing somewhat better. And now there's, there's of course, certain representational bottlenecks. You can have, it's often the case that you have worse learning properties when the dimensions of the data drastically changed all at once, which happened in the original version. And so they then replaced it with this slide version, where you see the five by five has been replaced by three, by two subsequent three by three layers. Okay, but it's still the general idea. We basically want these different, want to not choose one of these architectures, but have all of them concatenate them, and then basically allow the gradient descent to choose how important each of these channels should be. So what were the results? They get to 6.7% top five error rate. Now that wasn't quite as good as humans because Kapathy tried it and was able to do 5.1%, which is actually quite interesting. It requires considerable training of humans to get good at ImageNet. So you need to train humans on ImageNet so that they get that ImageNet, which is quite interesting. So that's a problem if you want to really go deep. And, and that's the following. If you, if you say take a data set like CIFA 10 that we have here, and your plot error is a function of the iteration, what we can see, no, like you always have these jumps, which is places where the, where the, where the speed of SGD is, is, is made slower. No, like you get better performance, but, but slower when you do that. And so what, what, what can be seen here is that if you make the networks very deep, then they don't converge to good values anymore. No, so 20 layers is actually better than 30 layers, which is better than 40, which is better than 50. So the idea that going deeper helps doesn't seem to be right at least when we have a fixed period of training. And so, so it doesn't seem that simply adding more and more layers is actually the solution to the problem that we're looking for after all. So now here comes, here comes the idea of RAS NATS. Now, like, look at here, what we have on the left-hand side. We have VGG, it starts with convolution, max pulls, convolution, max pulls, convolutions, max pulls, and so forth. You can have a 34 layer plane network that basically goes through this, just like VGG just makes it really deep, where each, where each, where you have lots of layers in each of them. But here comes the cool idea of RAS NATS. What you do is you model basically the differences. So what you do is you take the output here and you add that, keep in mind that in this whole stack, we always have the same dimensions. So you add what goes in here to this one and you add it to this one and you add it to this one. And by having these, these, these connections that go through it, these effective skip connections, what it allows the network is to model basically changes through that. Through what we'd have if we had the identity transformation from here to here. And this is actually quite interesting because, because now there's a shortest path, you know, like we can go from here through these skips to here, to here, through these skips to here, through these skips to here, and so forth. So the network is at the same time a relatively shallow network and a very, very deep network. And you can immediately see why this seems like a good idea. So how does the RAS network? We have an input. We have a weight layer with a real or another weight layer. But now here we take the output of this and we add the input of everything x to it and then we apply the real or to it. So thereby there is this shortcut path. Whereas what these do is they basically model how we should change the input of x so that this is as good as possible. And so age what we have here is the desired mapping. And we hope that we can fit this with a small local network. And if the optimal mapping is closer to the identity, it's easier to find those relevant small fluctuations. Now if you have a very deep network, you can say each layer doesn't need to do all that much to its input. It might be sufficient to do relatively small changes. And therefore it's intuitively desirable to use something like RASnets. And indeed, if they use RASnets, the performance looks very, very different. They find that basically RASnet 20 does a good job. But if you go from 20 to 30, you're doing better. The deeper you go, the better it gets. And so same thing on ImageNet, they really get mileage by going from 18 layers to 34 layers here. Why do you find in neural network papers that there's so much on these small data sets like CIFAR 10? Well, you can run this very, very quickly. Whereas ImageNet, as you may remember, is a very large data set. And that just means that it costs you a lot of compute to calculate anything with it. So why do they work? If we look at the paths, if we remove a layer, let's say we remove layer F1, there's still two paths so we can get from the output to the input. Now imagine this layer for some reason totally cuts out. No gradients, no forward propagation. But we can still get gradients to this path and this path. And therefore we will have much less worry about vanishing gradients. And in fact, you can model the last landscape. We will at some point of time talk about how we make these. Without skip connections, the last landscape looks really bizarre. Whereas with the skip connections, it has a much smoother, much more meaningful landscape where we can hope that optimization will work much better. Now, the next innovation here is ResNext. Where you could take that same basic idea that we have in ResNet. And we can basically have multiple channels in parallel that have uniform multi-branch structure. And in some cases it really helps. And now let's briefly ask ourselves, how does the ResNet implement the identity function? No, what if we needed that through a bunch of layers it's the same at the output as it's in the input? Well, that's actually very simple. We set all weights to zero. Now like if it does nothing in the network, it will stay an identity function. Now, here's another network that came a little later, DansNet. In the case of DansNet, what we do is we take every layer and take the output. And not only give it to the next layer, but allow it to skip the next layer and skip the next layer and skip the next layer. And thereby, of course, it gets to be longer and longer because it's all concatenated with one another. But it shares a lot of the desirable properties of the ResNet. Namely that there are short paths that connect the transition layer where the MaxPool happens. To all to the previous layers and basically they exist in a way the network is at the same time a shallow network and a deep network. And that, of course, helps massively to deal with vanishing gradient prompts. Strengthens feature propagation, encourages feature reuse. Now, if there's something useful in these features that can be used by many of these future layers, it will reuse it. And it also can substantially reduce the number of parameters relative to the ResNet. Now, let's think a little bit about the usefulness of skip connections. Take ResNet, uplaid it and understand the role of skip connections in this context.