 So let's talk a little bit about the history of deep learning and confnets so Where did like the modern class of confnets start? Well in 2012 we had ImageNet since 2009 roughly Now what's ImageNet? before that we had these Simplistic data set things like MNIST, but ImageNet was really big. It was greater than a million images They came in 1,000 categories and there had been a lot of activity of people using Classical machine learning approaches such as SIFT filters. So pre AlexNet State of the art was at about 30% correct here and then Look what happened in 2012 AlexNet came out AlexNet went down from 30% Errors to about 16% and then we see very quickly how year-over-year deep learning got better at it with say VGG in 2014 bring it down to 7% error rate Google net bringing it to 6.7 and Ultimately ResNet bringing it to 3.6 roughly which is better than humans which are at roughly 5% 5% error rate in that Now see what happened is we had like image recognition data sets for a while and Deep learning really revolutionized that and this is The rebirth of the deep learning movement or like of the artificial neural network movement It was incredibly unpopular Pre AlexNet once AlexNet came all of a sudden much of computer vision and later NLP and Reinforcement learning moved away from the previous approaches to deep learning based approaches So let's talk a little bit about AlexNet. It was developed by Alex Krusevsky and Ilya Sutskeva Now Ilya is now chief scientist as open AI. He was the inventor of sequence-to-sequence learning and of course Jeff Hinton was an author on it who was Alex's PhD supervisor. He co-invented Boltzmann machine and Did all kinds of things Arguably Hinton might have been against it because Hinton very much believed that unsupervised learning was the future unsupervised learning still is the future and Arguably unsupervised learning is the way people and babies and animals learn about the world But in our way AlexNet Opened up this discovery that really just going supervised in a deep learning wise way was gonna if the data sets are very big Going to be sufficient. So The pram that that AlexNet was applied to is compete on ImageNet Which was Fei-Fei Li's data set of roughly 14 million images with more than 20,000 categories if we call or consider all of them For example categories could be strawberry balloon or an example that you see on the site on the left hand side That's some kind of a monkey. See I can't even positively identify which kind of a monkey it is Let's briefly talk about Jeff Hinton because he's been so deeply influential on the deep learning field He's an English Canadian cognitive psychologist and computer scientist He popularized back propagation and is seen as the godfather of deep learning So he has been for many decades working towards making artificial neural networks really really work and solve real world problems and He was also always into trying to understand how the brain does it and How the human cognition relates to that he co-invented Boltzmann machines He contributed to Alex net and advised many of the top people in that area including young Lecon Elia Sotsky of a red-foot Neil Brandon tech and Frey and walks at Google And you the University of Toronto at the same time He is one of the most important people in deep learning. That's why I thought it was important to talk a little bit About him here. So let's talk about Alex net Alex net is a truly interesting Architecture in particular if you consider the space out of which it came It had an amazing focus on engineering that many of the previous networks didn't have so let's talk about what's going on So here we have we start with an image, you know, like the image is an image in image net have 224 by 224 by three now like we have three channels and Because we have one for each color Now at the first level what they had is they had an 11 by 11 Convolution kernel, which is huge by today's standards. We're using much smaller ones usually today Okay, and now it had two graphics cards. Why did it have two graphics cards? Well three gigabytes is too little in a way for really big networks at that time and they were very limited in 2012 so What what they did is they had like this they had one part of it Which was running on one GPU another part which was running on another GPU Now mind you that means that there's different features on different on on the different graphics cards not like Like some of the features are here some of the channels are here some of the channels are here and Now of course, we need some interaction which is allowed for example here and suddenly in the fully connected Layers to what's the and there's a lot of those but otherwise a lot of the Communication is local on the graphics cards, which allowed it to run relatively fast Okay, so we basically take that input image which is given to both the graphics cards Now what do the graphics card? What do we have here? We we have the convolution with the stride of four and Then we have a max pool here of Of a stride two now interestingly at this top level here. They used a stride of four and That is why would they go down from 224 to 55 as damages and But and at the same time see like they go from three to forty eight Features here that same at that first level now as they go down from 55 to 27 Which is a factor of four they go from 50 to 128 channels Then here there's another max pool layer and they ultimately go to 192 channels and they have some local response normalization and Then it goes through a Last max pooling stage here, and then we have two layers of fully connected networks in between and a softmax cost function at the end All the commands that you'd need for that you should have seen by now so now During alex net they did a whole bunch of things that really helped it was a wonderful piece of engineering So the first one is this finding that realoo Converges much much faster than touch look here. We have a combination training error rate as a function of the epoch here and In dashed we see Tange in solid. We see realoo realoo works much better The second one is they used dropout where instead of training the network always fully connected They randomly turn off sudden units Which makes the network more stable and a number of ways and more resilient in a way Then here's a really interesting thing. They separated these two filters and keep in mind that there's more Interactions within the same fold a filter bank the same graphics cut then across them And so that generally gives rise to specialization where you have a lot of color tuning in one of the graphics cards in a way it specializes on parsing and interpreting color and a lot of Farm tuning in the other one which focuses on form which is interesting because it's somewhat similar to the way Neurons and the brain are divided into different regions and Now the results where it re-smoked the competition 15% top 5-hour run up had 26.2 that that year it was one of the first neural networks trained with GPU on CUDA and It it had 60 million parameters, which was a very big network at that time and of course It shaped the field it started the rebirth of neural networks and it has been an incredibly influential paper and So it's time for you to learn a little bit about Alex net We will ask you how you think we could improve confnets and a little bit more about parameter efficiency And you get to run Alex net and think a little bit about how it works